Just An Application

November 23, 2014

Swift vs. The Compound File Binary File Format (aka OLE/COM): Part Nine — Decompression Time

The compressed data in the dir stream object should comprise a signature byte, which is always one, followed by one or more ‘chunk’s each of which represents 4096 bytes of uncompressed data.

Compressed data should always be at the end of a stream object so the end of the compressed data is indicated by reaching the end of the stream object.

On that basis we can process the compressed data like this

    func decompress() -> ByteData?
    {
        if dataIn[0] != 1
        {
            return nil
        }
        dataInCurrent = 1
        while dataInCurrent < dataInEnd
        {
            if !chunk()
            {
                return nil
            }
        }
        return makeByteData()
    }

where dataIn is the compressed data to be processed.

A chunk comprises a two byte header followed by at least one ‘token sequence’.

A chunk header comprises

  • a twelve bit size

  • a three bit signature, and

  • a one bit flag

The size specifies the number of bytes in the chunk minus three.

The signature is always three (0x03).

The flag is one if the chunk contains compressed data, and zero otherwise.

We can process a chunk of compressed data like this

    private func chunk() -> Bool
    {
        let chunkInStart = dataInCurrent
        let header       = getUnsignedInt16()
    
        dataInCurrent += 2
    
        let size        = Int(header & 0x0FFF) + 3
        let signature   = (header >> 12) & 0x07
        let flag        = (header & 0x8000) != 0
    
        if signature != 3
        {
            return false
        }
    
        chunkOutStart = dataOutCurrent
    
        let chunkInEnd = min(chunkInStart + size, dataInEnd)
        
        if flag
        {
            while dataInCurrent < chunkInEnd
            {
                if !tokenSequence(chunkInEnd)
                {
                    return false
                }
            }
            return true
        }
        else
        {
            // uncompressed data not supported
            return false
        }
    }

A token sequence comprises a ‘flags byte’ followed by either

  • eight ‘token’s if it is not the last one in a chunk, or

  • between one and eight ‘token’s otherwise.

The i‘th bit in the flag byte specifies the type of the i‘th token.

If the bit is zero the token is a one byte ‘literal token’ otherwise it is a two byte ‘copy token’

To obtain the uncompressed data represented by a token sequence we need to read the flag byte from the compressed data and then iterate over all the bits or until we reach the end of the compressed data.

In each iteration we need to handle either a literal or a copy token depending upon whether the bit is zero or one.

    func tokenSequence(chunkInEnd:Int) -> Bool
    {
        let flagsByte = Int(dataIn[dataInCurrent])
    
        dataInCurrent += 1
    
        if dataInCurrent < chunkInEnd
        {
            for i in 0 ..< 8
            {
                switch ((flagsByte >> i) & 1) == 1
                {
                    case false where (dataInCurrent + 1) <= chunkInEnd:
        
                        literalToken()
        
                    case true where (dataInCurrent + 2) <= chunkInEnd:
        
                        copyToken()
        
                    default:
        
                        return false
                }
                if dataInCurrent == chunkInEnd
                {
                    // end of chunk no more tokens
                    break
                }
            }
            return true
        }
        else
        {
            // must be at least one token
            return false
        }
    }

A literal token is a single byte of uncompressed data so we copy it from the compressed data to the decompressed data.

    private func literalToken()
    {
        dataOut.append(dataIn[dataInCurrent])
        ++compressedDataCurrent
        ++decompressedDataCurrent
    }

A copy token is a little-endian 16 bit unsigned integer interpreted as two unsigned integers.

The unsigned integer in the low-order bits denotes a ‘length’ and the unsigned integer in the high-order bits denotes an ‘offset’.

The size of ‘offset’ can vary between a minimum of four bits and a maximum of twelve bits, with the size of ‘length’ correspondingly varying between a maximum of twelve bits and a minimum of four bits .

The size of ‘offset’ when a copy token is processed is a function of the amount of decompressed data at that point.

It is the smallest number of bits, nBits, that can be used to represent the amount of decompressed data minus one, or four if nBits is less than four.

To obtain the uncompressed data represented by a copy token we first need to compute the size of ‘offset’ and hence ‘length’ and extract the values.

Given an unsigned 32 bit integer i we can compute the smallest number of bits needed to represent it by counting the number of leading zeros.

This is a very brute force approach

    private func numberOfLeadingZeros(i: UInt32) -> UInt
    {
        var n    : UInt   = 0
        var mask : UInt32 = 0x80000000
    
        while (mask != 0) && ((i & mask) == 0)
        {
            mask >>= 1
            ++n
        }
        return n
    }

and this is a slightly less brute force approach.

    private func numberOfLeadingZerosAlt(var i: UInt32) -> UInt
    {
        var s : UInt = 0
    
        if (i & 0xFFFF0000) != 0
        {
            i >>= 16
            s = 16
        }
        if (i & 0xFF00) != 0
        {
            i >>= 8
            s += 8
        }
        if (i & 0xF0) != 0
        {
            i >>= 4
            s += 4
        }
        if (i & 0x0C) != 0
        {
            i >>= 2
            s += 2
        }
        if (i & 0x03) != 0
        {
            i >>= 1
            s += 1
        }
        if (i & 0x01) != 0
        {
            s += 1
        }
        return 32 - s
    }

The computed ‘offset’ + 1 gives the offset back from the end of the decompressed data at which to start copying.

The computed ‘length’ + 3 gives the amount of data to copy.

The data to copy is appended to the end of decompressed data.

    private func copyToken()
    {
        let copyToken = getUnsignedInt16()
    
        dataInCurrent += 2
    
        let chunkOutSize = dataOutCurrent - chunkOutStart
        let nBits        = max(32 - numberOfLeadingZerosAlt(UInt32(chunkOutSize - 1)), 4)
        let lengthMask   = 0xFFFF >> nBits
        let length       = (copyToken & lengthMask) + 3
        let offsetMask   = ~lengthMask
        let offset       = ((copyToken & offsetMask) >> (16 - nBits)) + 1
        let source       = dataOutCurrent - offset
        
        for i in 0 ..< length
        {
            dataOut.append(dataOut)
        }
        dataOutCurrent += length
    }

And thats it.

We should now have the original data that was compressed and stored in the dir stream object.


Copyright (c) 2014 By Simon Lewis. All Rights Reserved.

Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and owner Simon Lewis is strictly prohibited.

Excerpts and links may be used, provided that full and clear credit is given to Simon Lewis and justanapplication.wordpress.com with appropriate and specific direction to the original content.

Advertisements

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: