Just An Application

August 25, 2015

Swift Miscellany: Parsing With Generics And Protocols

The pseudo-code to parse a parenthesis delimited comma separated list of homogeneous elements, such as this one

    tuple-pattern : '(' tuple-pattern-element-list? ')'
    
    tuple-pattern-element-list : tuple-pattern-element
                               | tuple-pattern-element ',' tuple-pattern-element-list

or this

    parenthesized-expression : '(' expression-element-list? ')'
    
    expression-element-list : expression-element
                            | expression-element ',' expression-element-list

looks something like this

    ...
    
    if next token != '('
    {
        error
    }
    
    initialize element list
    
    while peek token != ')'
    {
        element = try to parse element
        
        add element to element list
    }
    if next token != ')'
    {
        really bad error
    }
    return element list
    
    ...

The differences in each case will be the types of ‘element’ and ‘element list’ and exactly what is involved in parsing the element.

The number of times you get to write the actual code in practice will depend to a large extent on the programming language you are using.

Since Swift supports generics we should be able to get away with writing the actual code just once.

We can define a function with a signature something like this

    func list<T>(tokens:TokenStream, 'something to parse the element of type T') -> [T]?

One choice for the ‘something to parse the element of type T’ is another function.

The function is passed the tokens and it returns an instance of type T or nil if there was an error which gives us the signature

    (tokens:TokenStream) -> T?

which in turn gives us the function

    func list(tokens:TokenStream, parser:(tokens:TokenStream) -> T?)  -> [T]?
    {
        if !next(tokens, .LEFT_PAREN)
        {
            return nil
        }
    
        var elements = [T]()
    
        while !peek(tokens, .RIGHT_PAREN)
        {
            let element = parser(tokens:tokens)
    
            if element == nil
            {
                return nil
            }
    
            elements.append(element!)
    
            if !peek(tokens, .COMMA)
            {
                break
            }
            assertNext(tokens, .COMMA)
        }
        assertNext(tokens, .RIGHT_PAREN)
        return elements
    }

In Swift 2.0 we can make things simpler by dispensing with the returning nil on error convention.

    func list(tokens:TokenStream, parser:(tokens:TokenStream) throws -> T) throws  -> [T]
    {
        try next(tokens, .LEFT_PAREN)
    
        var elements = [T]()
    
        while !peek(tokens, .RIGHT_PAREN)
        {
            elements.append(try parser(tokens:tokens))
    
            if !peek(tokens, .COMMA)
            {
                break
            }
            assertNext(tokens, .COMMA)
        }
        assertNext(tokens, .RIGHT_PAREN)
        return elements
    }

Taking A Slightly More Object-oriented Approach

For practical or ideological reasons we might want to use an ‘object’ or ‘objects’ to do the parsing rather than functions.

In this case we might want to pass an instance of a type capable of doing the parsing of the element rather than using one these old-fangled closure thingys.

For the method, as it now is, to remain generic one possibility would be to pass an instance of a generic class in which case the function signature would look something like this

    func list<T>(parser:GenericListElementParser<T>) throws -> [T]

This is pretty straightforward but not that flexible.

The original function could take any function, method, or closure with the right signature, but in this case any object passed would either have to an instance of the generic class itself or a sub-class of it.

It might be better if we could define a protocol which the object must implement but there seems to be a problem.

    protocol ListElementParser
    {
        func parse() throws -> ????
    }

How do we define the return type of the method the object must implement ?

Unlike classes, enums and structs, protocol definitions cannot have a generic argument clause so how do we define the return type ?

The answer is to define an ‘associated type’ that can then be referenced in subsequent declarations in the protocol

    protocol ListElementParser
    {
        typealias ElementType
        
        func parse() throws -> ElementType
    }

For each type that implements the protocol ListElementParser, for example

    final class ExpressionElementParser: ListElementParser
    {
        func parse() throws -> ExpressionElement
        {
            ...
        }
    }

the concrete type corresponding to ElementType is inferred from the declaration of the parse method.

The function signature now has to be generic in the argument type and the return type but what does the constraint look like ?

    func list<????>(parser:P) throws -> T

It has to include both P and T, and we have to constrain the type P to implement the ListElementParser.

The constraint on the the type P can be expressed like this

    P:ListElementParser

so how about

    func list<P:ListElementParser, T>(p:P) throws -> [T]
    {
        try next(.LEFT_PAREN)
    
        var elements = [T]()
    
        while !peek(.RIGHT_PAREN)
        {
            elements.append(try p.parse())
    
            if !peek(.COMMA)
            {
                break
            }
            assertNext(.COMMA)
        }
        assertNext(.RIGHT_PAREN)
        return elements
    }

Nope, the compiler is having none of it.

It states, not unreasonably

    Cannot invoke 'append' with an argument list of type '(P.ElementType)'

Just plonking the T in the generic argument list tells the compiler nothing.

We have to relate the type T to the type ElementType being returned from the call to the parse method.

Fortunately there is a place in the generic parameter list where we can put requirements like this.

The requirement we have is that

the type ElementType of the ListElementParser type is the same as the type T

which in generic parameter clause requirements speak can be expressed very simply

    P.ElementType == T

We can add this to the generic parameter clause in the ‘requirements section’ and the compiler is happy.

    func list<P:ListElementParser, T where P.ElementType == T>(p:P) throws -> [T]
    {
        try next(.LEFT_PAREN)
    
        var elements = [T]()
    
        while !peek(.RIGHT_PAREN)
        {
            elements.append(try p.parse())
    
            if !peek(.COMMA)
            {
                break
            }
            assertNext(.COMMA)
        }
        assertNext(.RIGHT_PAREN)
        return elements
    }

An Incremental Improvement

To use the function we need to first construct an instance of the parser which is a pain if the only reason for constructing it is to pass it to the function.

Why can’t the function construct the parser itself ?

In Swift it can.

We need to change the function signature to take a type which implements ListElementParser.

    func list<P:ListElementParser, T where P.ElementType == T>(pType:P.Type) throws -> [T]

and we need to specify that the type implements the necessary constructor.

    protocol ListElementParser
    {
        typealias ElementType
    
        init(context:ParserContext, tokens:TokenStream)
    
        func parse() throws -> ElementType
    }

To construct an instance we simply do

    let parser = pType.init(context:context, tokens:tokens)

The function now looks like this

    func list<P:ListElementParser, T where P.ElementType == T>(pType:P.Type) throws -> [T]
    {
        try next(.LEFT_PAREN)
    
        let parser   = pType.init(context:context, tokens:tokens)
        var elements = [T]()
    
        while !peek(.RIGHT_PAREN)
        {
            elements.append(try parser.parse())
    
            if !peek(.COMMA)
            {
                break
            }
            assertNext(.COMMA)
        }
        assertNext(.RIGHT_PAREN)
        return elements
    }

To invoke the method we now pass the type of the parser

    try list(ExpressionElementParser.self)

Another Incremental Improvement

Now it is no longer necessary to contruct an instance of the parser for type T but it is still necessary to know what the type of parser is.

It would be simpler if it was possible to simply pass the type of T and decouple the caller from the details of how T‘s get parsed.

We can also do this.

To do so we need another protocol.

    protocol ListElement
    {
        typealias ParserType: ListElementParser
    
        static var parserType : ParserType.Type { get }
    }

and the method now looks like this

    func list<T:ListElement where T.ParserType.ElementType == T>(tType:T.Type) throws -> [T]
    {
        try next(.LEFT_PAREN)
            
        let parser   = tType.parserType.init(context:context, tokens:tokens)
        var elements = [T]()
            
        while !peek(.RIGHT_PAREN)
        {
            elements.append(try parser.parse())
            
            if !peek(.COMMA)
            {
                break
            }
            assertNext(.COMMA)
        }
        assertNext(.RIGHT_PAREN)
        return elements
    }

For this to work in the ExpressionElementParser case the ExpressionElement type has to implement the ListElement protocol.

    enum ExpressionElement: ListElement
    {
        static let parserType = ExpressionElementParser.self
    
        //
    
        case EXPRESSION(Expression)
        case NAMED_EXPRESSION(String, Expression)
    }

Now we can simply pass the type of the elements in the list.

    try list(ExpressionElement.self)

Via the extension mechanism this approach can be made to work with any class, enum, or struct type.

Defining

    final class StringListElementParser: ListElementParser
    {
        init(context:ParserContext, tokens:TokenStream)
        {
            // ...
        }
    
        func parse() throws -> String
        {
            // ...
        }
    }

and

    extension String: ListElement
    {
        static let parserType = StringListElementParser.self
    }

means we can do

    try list(String.self)

Copyright (c) 2015 By Simon Lewis. All Rights Reserved.

Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and owner Simon Lewis is strictly prohibited.

Excerpts and links may be used, provided that full and clear credit is given to Simon Lewis and justanapplication.wordpress.com with appropriate and specific direction to the original content.

November 23, 2014

Swift vs. The Compound File Binary File Format (aka OLE/COM): Part Nine — Decompression Time

The compressed data in the dir stream object should comprise a signature byte, which is always one, followed by one or more ‘chunk’s each of which represents 4096 bytes of uncompressed data.

Compressed data should always be at the end of a stream object so the end of the compressed data is indicated by reaching the end of the stream object.

On that basis we can process the compressed data like this

    func decompress() -> ByteData?
    {
        if dataIn[0] != 1
        {
            return nil
        }
        dataInCurrent = 1
        while dataInCurrent < dataInEnd
        {
            if !chunk()
            {
                return nil
            }
        }
        return makeByteData()
    }

where dataIn is the compressed data to be processed.

A chunk comprises a two byte header followed by at least one ‘token sequence’.

A chunk header comprises

  • a twelve bit size

  • a three bit signature, and

  • a one bit flag

The size specifies the number of bytes in the chunk minus three.

The signature is always three (0x03).

The flag is one if the chunk contains compressed data, and zero otherwise.

We can process a chunk of compressed data like this

    private func chunk() -> Bool
    {
        let chunkInStart = dataInCurrent
        let header       = getUnsignedInt16()
    
        dataInCurrent += 2
    
        let size        = Int(header & 0x0FFF) + 3
        let signature   = (header >> 12) & 0x07
        let flag        = (header & 0x8000) != 0
    
        if signature != 3
        {
            return false
        }
    
        chunkOutStart = dataOutCurrent
    
        let chunkInEnd = min(chunkInStart + size, dataInEnd)
        
        if flag
        {
            while dataInCurrent < chunkInEnd
            {
                if !tokenSequence(chunkInEnd)
                {
                    return false
                }
            }
            return true
        }
        else
        {
            // uncompressed data not supported
            return false
        }
    }

A token sequence comprises a ‘flags byte’ followed by either

  • eight ‘token’s if it is not the last one in a chunk, or

  • between one and eight ‘token’s otherwise.

The i‘th bit in the flag byte specifies the type of the i‘th token.

If the bit is zero the token is a one byte ‘literal token’ otherwise it is a two byte ‘copy token’

To obtain the uncompressed data represented by a token sequence we need to read the flag byte from the compressed data and then iterate over all the bits or until we reach the end of the compressed data.

In each iteration we need to handle either a literal or a copy token depending upon whether the bit is zero or one.

    func tokenSequence(chunkInEnd:Int) -> Bool
    {
        let flagsByte = Int(dataIn[dataInCurrent])
    
        dataInCurrent += 1
    
        if dataInCurrent < chunkInEnd
        {
            for i in 0 ..< 8
            {
                switch ((flagsByte >> i) & 1) == 1
                {
                    case false where (dataInCurrent + 1) <= chunkInEnd:
        
                        literalToken()
        
                    case true where (dataInCurrent + 2) <= chunkInEnd:
        
                        copyToken()
        
                    default:
        
                        return false
                }
                if dataInCurrent == chunkInEnd
                {
                    // end of chunk no more tokens
                    break
                }
            }
            return true
        }
        else
        {
            // must be at least one token
            return false
        }
    }

A literal token is a single byte of uncompressed data so we copy it from the compressed data to the decompressed data.

    private func literalToken()
    {
        dataOut.append(dataIn[dataInCurrent])
        ++compressedDataCurrent
        ++decompressedDataCurrent
    }

A copy token is a little-endian 16 bit unsigned integer interpreted as two unsigned integers.

The unsigned integer in the low-order bits denotes a ‘length’ and the unsigned integer in the high-order bits denotes an ‘offset’.

The size of ‘offset’ can vary between a minimum of four bits and a maximum of twelve bits, with the size of ‘length’ correspondingly varying between a maximum of twelve bits and a minimum of four bits .

The size of ‘offset’ when a copy token is processed is a function of the amount of decompressed data at that point.

It is the smallest number of bits, nBits, that can be used to represent the amount of decompressed data minus one, or four if nBits is less than four.

To obtain the uncompressed data represented by a copy token we first need to compute the size of ‘offset’ and hence ‘length’ and extract the values.

Given an unsigned 32 bit integer i we can compute the smallest number of bits needed to represent it by counting the number of leading zeros.

This is a very brute force approach

    private func numberOfLeadingZeros(i: UInt32) -> UInt
    {
        var n    : UInt   = 0
        var mask : UInt32 = 0x80000000
    
        while (mask != 0) && ((i & mask) == 0)
        {
            mask >>= 1
            ++n
        }
        return n
    }

and this is a slightly less brute force approach.

    private func numberOfLeadingZerosAlt(var i: UInt32) -> UInt
    {
        var s : UInt = 0
    
        if (i & 0xFFFF0000) != 0
        {
            i >>= 16
            s = 16
        }
        if (i & 0xFF00) != 0
        {
            i >>= 8
            s += 8
        }
        if (i & 0xF0) != 0
        {
            i >>= 4
            s += 4
        }
        if (i & 0x0C) != 0
        {
            i >>= 2
            s += 2
        }
        if (i & 0x03) != 0
        {
            i >>= 1
            s += 1
        }
        if (i & 0x01) != 0
        {
            s += 1
        }
        return 32 - s
    }

The computed ‘offset’ + 1 gives the offset back from the end of the decompressed data at which to start copying.

The computed ‘length’ + 3 gives the amount of data to copy.

The data to copy is appended to the end of decompressed data.

    private func copyToken()
    {
        let copyToken = getUnsignedInt16()
    
        dataInCurrent += 2
    
        let chunkOutSize = dataOutCurrent - chunkOutStart
        let nBits        = max(32 - numberOfLeadingZerosAlt(UInt32(chunkOutSize - 1)), 4)
        let lengthMask   = 0xFFFF >> nBits
        let length       = (copyToken & lengthMask) + 3
        let offsetMask   = ~lengthMask
        let offset       = ((copyToken & offsetMask) >> (16 - nBits)) + 1
        let source       = dataOutCurrent - offset
        
        for i in 0 ..< length
        {
            dataOut.append(dataOut)
        }
        dataOutCurrent += length
    }

And thats it.

We should now have the original data that was compressed and stored in the dir stream object.


Copyright (c) 2014 By Simon Lewis. All Rights Reserved.

Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and owner Simon Lewis is strictly prohibited.

Excerpts and links may be used, provided that full and clear credit is given to Simon Lewis and justanapplication.wordpress.com with appropriate and specific direction to the original content.

June 11, 2013

Programming With Rust — Part Three: Things We Need To Know – Modules

1.0 Modules

A Rust module defines both a namespace and an access-control boundary.

In this example

    mod fish
    {
        enum FlatFish
        {
            Dab,
            Halibut,
            Flounder,
            Sole,
            Turbot
        }
    }

the type FlatFish could be referenced from within the fish module using the identifier FlatFish.

However it is not visible outside the module.

2.0 Access Control

To make it visible it is necessary to make it public it like this.

    mod fish
    {
        pub enum FlatFish
        {
            Dab,
            Halibut,
            Flounder,
            Sole,
            Turbot
        }
    }

The FlatFish type can now be referenced from other modules.

3.0 Paths

Now that the FlatFish type is public it can be referenced from outside the fish module using its path, that is, its fully qualified name, which is

    fish::FlatFish

4.0 Name Binding

Alternatively the name FlatFish can be bound locally in another module like this

    use fish::FlatFish;

Now the local name FlatFish can be used to reference the name FlatFish in the fish module.

5.0 Nested Modules

Modules can be nested.

For example

    mod underwater_creatures
    {
        mod fish
        {
            pub enum FlatFish
            {
                Dab,
                Halibut,
                Flounder,
                Sole,
                Turbot
            }
        }
    }

The path of the FlatFish type is, as you might expect

    underwater_creatures::fish::FlatFish

and to bind the name FlatFish locally you do this

    use underwater_creatures::fish::FlatFish;

6.0 Binding Module Names

You can also bind the name of a module locally, but visibility constraints apply to modules as well.

Given the example above you cannot do this

    use underwater_creatures::fish;

because the fish module is not public.

If the fish module is made public

    mod underwater_creatures
    {
        pub mod fish
        {
            pub enum FlatFish
            {
                Dab,
                Halibut,
                Flounder,
                Sole,
                Turbot
            }
        }
    }

then you can do this

    use underwater_creatures::fish;

and you can then refer to the FlatFish type using the path

    fish::FlatFish

This is useful if you want to use a prefix to qualify a name either to avoid name clashes and/or to identify where the name is from, but you do not want to use its path because of its length.

7.0 Aliasing

A name can be bound to an alias like this

    use ff = fish::FlatFish;

Now the local name ff can be used to reference the name FlatFish in the fish module.

8.0 Access Control And Name Bindings

By default name bindings are not visible outside the module in which they occur, but they can be made public in the same way
as enum types and modules.

The effect of doing this is that the local name is now visible outside the module and it in turn can now be referenced using its path and bound to a local name in another module and so on.

For example, if we do this

    mod underwater_creatures
    {
        pub use underwater_creatures::fish::FlatFish;
    
        mod fish
        {
            pub enum FlatFish
            {
                Dab,
                Flounder,
                Halibut,
                Sole,
                Turbot
            }
        }
    }

then you can use the path

   underwater_creatures::FlatFish

to reference the FlatFish type.

Alternatively you can bind the name locally like this

   use underwater_creatures::FlatFish

and the local name FlatFish will reference the FlatFish type.

Using aliasing in conjunction with a public name binding it is possible to re-name things in very confusing ways if you choose to do so.

For example,

    mod underwater_creatures
    {
        pub use FlatFish = underwater_creatures::fish::OtherFish;
    
        mod fish
        {
            pub enum FlatFish
            {
                Dab,
                Flounder,
                Halibut,
                Sole,
                Turbot
            }
        
            pub enum OtherFish
            {
                Perch,
                Roach,
                Tench
            }
        }
    }

Now the path

   underwater_creatures::FlatFish

references the OtherFish type in the fish module.


Copyright (c) 2013 By Simon Lewis. All Rights Reserved.

Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and owner Simon Lewis is strictly prohibited.

Excerpts and links may be used, provided that full and clear credit is given to Simon Lewis and justanapplication.wordpress.com with appropriate and specific direction to the original content.

Blog at WordPress.com.

%d bloggers like this: