Just An Application

November 22, 2014

Swift vs. The Compound File Binary File Format (aka OLE/COM): Part Eight — What Is It We Are Looking For Again ?

Using a CompoundFile instance we can create a Stream instance for any stream object in the file as long as we know what it is called.

If we continue to assume that whatever this particular compound file does is being done by macros, then we need to know where they are stored in a ‘word’ document.

To find this out we need to consult a second specification pithily entitled

    [MS-WORD]: Word (.doc) Binary File Format

According to section 2.1.9 Macros Storage

The Macros storage is an optional storage that contains the macros for the file. If present, it MUST be a Project Root Storage as defined in [MS-OVBA] section 2.2.1.

Curiously every other section which describes storage explicitly specifies the name, for example.

The Custom XML Data storage is an optional storage whose name MUST be “MsoDataStore”.

Section 2.1.9 is the only one that does not so we will have to assume for the moment that its name is going to be

    "Macros"

or something of that ilk.

Moving on to specification number three

    [MS-OVBA]: Office VBA File Format Structure

as referenced from section 2.1.9 quoted above, section 2.2.1 Project Root Storage starts

A single root storage. MUST contain VBA Storage (section 2.2.2) and PROJECT Stream (section 2.2.7).

Going further down the rabbit hole we find section 2.2.2 VBA Storage

A storage that specifies VBA project and module information. MUST have the name “VBA” (case- insensitive). MUST contain _VBA_PROJECT Stream (section 2.3.4.1) and dir Stream (section 2.3.4.2). MUST contain a Module Stream (section 2.2.5) for each module in the VBA project.

Its not obvious from that where the actual code is but a quick look at section 2.2.5 tells us.

A stream (1) that specifies the source code of modules in the VBA project. The name of this stream is specified by MODULESTREAMNAME (section 2.3.4.2.3.2.3). MUST contain data as specified by Module Stream (section 2.3.4.3).

So thats where the source code is but the name of the stream is elsewhere it appears.

In fact the name is in a MODULESTREAMNAME record which is in a MODULE record which is in PROJECTMODULES record which is in the “dir” stream.

In the face of all that its tempting to just guess which stream it must be. There can’t be that many of them can there ?

Assuming we can find it, what’s in it ?

A module stream, it turns out, contains a variable length record followed by the compressed source code, so even if we guess which stream it is not going to do us much good.

The length of the first variable length record is defined in a MODULEOFFSET record which is also contained within a MODULE record and so on and so forth.

There is nothing for it we are going to have to get hold of the “dir” stream.

We are looking for a stream named “dir” within a storage object which is definitely named “VBA” or “vba” or “VbA” or something like that, which is within a storage object which might be called “Macros”, maybe.

Trying

    let cff       = CompoundFileFactory()
    let cf        = cff.open(argv[1])
    let dirStream = cf?.getStream(storage: ["Macros", "VBA"], name: "dir")
    let data      = dirStream?.data()

results in a non nil value for data so we are nearly there.


Copyright (c) 2014 By Simon Lewis. All Rights Reserved.

Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and owner Simon Lewis is strictly prohibited.

Excerpts and links may be used, provided that full and clear credit is given to Simon Lewis and justanapplication.wordpress.com with appropriate and specific direction to the original content.

August 28, 2014

Anatomy Of A PDF Continued: #4 — Part Two: What Is It With This Obfuscation Lark ?

Javascript Obfuscation Considered …

What IS the point of doing this

    ...
    
    var sekritVar0001 = [0x30,0x74,0x67,0x72,0x6E,0x63,0x65,0x67, 1, 10, 40];
    
    ...

and then this somewhere else ?

    ...
    
    for(var q = 0; q < sekritVar0001.length-3; q++)
    sekritVar0002 += String.fromCharCode(sekritVar0001[q]-2);

    ...

Anybody who has managed to lever open a mal-formed PDF and get to the point where they can actually see it is hardly likely to give up at this juncture just because they are going to have to iterate over an array subtract 2 from every element and turn into a character now are they ?

It says

   .replace

for pity’s sake.

And then this.

   ...
    
    var p1 = "(\/[^\\/";
    
    ...
    
    var p2 = "(\/[\\/";
    var sekritVar0003 = "x" + sekritVar0002 + p1 + "\\d]\/g,'')";
    var sekritVar0004 = "z" + sekritVar0002 + p2 + "]\/g,',')";
    
    ...

Oh no the regular expressions are in two halves ! Oh woe is me !

The first one says strip out everything that isn’t a digit or the character ‘/’.

The second one says turn all the ‘/’s into ‘,’s.

If you apply the second to the result of the first you end up with a lot of comma separated integers, and here’s the start of some non-base64 encoded image data.

    fpo10/t10/hmA32/Ac32/nCK32/XX32/yCO32/R32 ...

What a coincidence.

As for this

    ...
    
    function sekritFun0001(x)
    {
        var s = [];
    
        var z = sekritFun0002(sekritVar0003);
        z = sekritFun0002(sekritVar0004);
    
        var ar = sekritFun0002("[" + z + "]");
    
        for (var i = 0; i < ar.length; i ++)
        {
            var j = ar[i];
            if ((j >= 33) && (j <= 126))
            {
                s[i] = String.fromCharCode(33 + ((j + 14) % 94));
            }
            else
            {
                s[i] = String.fromCharCode(j);
            }
        }
        return s.join('');
    }
    
    ...

A variable z which we have worked out is a string that looks like a comma separated list of integers is topped and tailed with what looks suspiciously like the delimiters of a Javascript array literal and is then passed to sekritFun0002 and the result is assigned to the variable ar.

Then there is a for loop which is under the mistaken impression that ar references an array.

Now let me guess. The function sekritFun0002 is really the well known turn strings into arrays as if by magic function for which Javascript fortunately defines a four letter abbreviation.

Then there is the body of the loop.

To cut a long story short here is some Java code which does the same thing as the Javascript function written in the increasingly popular no-regexp idiom

    private static String decode(byte[] theBytes)
    {
        StringBuilder builder = new StringBuilder();
        int           nBytes  = theBytes.length;
        int           v       = 0;

        for (int i = 0; i < nBytes; i++)
        {
            byte b = theBytes[i];
    
            if (b == '/')
            {
                // end of number
    
                int c = 0;
    
                if ((v >= 33) && (v <= 126))
                {
                    c = 33 + ((v + 14) % 94);
                }
                else
                {
                    c = v;
                }
                builder.append((char)c);
                v = 0;
            }
            else
            if (b >= '0' && (b <= '9'))
            {
                v *= 10;
                v += b - '0';
            }
        }
        return builder.toString();
    }

If you run it on the two non-images needless to say you get Javascript.

… Pointless ?

If the point of the obfuscation is that re-obfuscation will result in the hash of the PDF changing then, as I have already observed, there are much much easier ways to achieve the same effect.

If the point of the obfuscation is to conceal the existence of certain strings that could be used to identify the PDF as malicious it is only necessary if the assumption is that something has

  1. parsed the PDF file

  2. extracted Object 1 into a usable state which so far has always required inflating it twice

  3. parsed the XML of the XFA form and identified the different elements

After step 1 the structure of the PDF is apparent and can be considered characteristic of this particular PDF.

After step 2 the ratio of the original to the final size of Object 1 is apparent and can also considered to be characteristic of this particular PDF.

After step 3 another characteristic of this particular PDF is apparent without even looking at the Javascript elements.

To misquote Mae West is that an image in your form or are you just pleased to see me ?

The content of the form is dominated by one thing, an image. To all intents and purposes that’s all it is.

If whatever it is has got this far it can be about 99% certain that it is looking at an instance of this particular malicious PDF.

It can increase the certainty still further by inspecting the image without even decoding it.

In short, unless you can conceal the elephant in the form, there is no need to obfuscate the Javascript other than for amusement.

Postscript (Not The Language)

Even if it was a valid PDF I don’t think this one would do what it was intended to do either.


Copyright (c) 2014 By Simon Lewis. All Rights Reserved.

Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and owner Simon Lewis is strictly prohibited.

Excerpts and links may be used, provided that full and clear credit is given to Simon Lewis and justanapplication.wordpress.com with appropriate and specific direction to the original content.

Anatomy Of A PDF Continued: #4 — Part One: Now What ?

My collection of a single PDF continues to grow apace whether I want it to or not.

I got three ‘invoices’ in one go the other day but they were ZIPs and ZIPs usually means .exes and so it proved.

Then today a PDF arrived which not only supposedly originated in a completely different continent to the other three, but is four times as big as the others. That has got to be good hasn’t it ?

It turned out to be a bit of disappointment taken as PDF because technically it isn’t one. I know I got it for nothing and everything but honestly it comes to something when people can’t even be bothered to get the format right.

Still whats the point of writing a whacking great chunk of Objective-C to read well-formed PDFs if you can’t hack lumps out of it until it can read things that are not well-formed PDFs ? After some judicious hard-wiring of this and that I managed to extract Object 1 once again and inflate it twice as per usual.

And ?

And the XML is exactly the same as all the others ?

Yes and no. The overall structure is pretty much the same but the data for the first two images isn’t.

It does not look like Base64 encoded data and Base64 decoding it definitely does not produce Javascript.

A quick look for the Base64 alphabet construction kit reveals that it has gone walk about, but both images are referenced in the Javascript so the supposition has got to be that it is still really Javascript but its not Base64 encoded. Its a bit of a disappointment but you have got to move with the times I suppose. Base64 encoding was so last week.

Time to find out what this weeks fashionable Javascript encoding technology is i suppose.


Copyright (c) 2014 By Simon Lewis. All Rights Reserved.

Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and owner Simon Lewis is strictly prohibited.

Excerpts and links may be used, provided that full and clear credit is given to Simon Lewis and justanapplication.wordpress.com with appropriate and specific direction to the original content.

August 26, 2014

Anatomy Of A PDF: Afterword

Filed under: CVE, CVE-2013-2729, PDF, PDF Vulnerability, Security — Tags: , , , — Simon Lewis @ 7:14 am

Since I started writing these posts anonymous benefactors have very kindly presented me with two further versions of the original PDF to add to my collection.

I say versions because although they both possess exactly the same structure as the original, the size of Object 1 is slightly different in each one and the binary sludge is different.

This of course means that the hash of the file will be different in each case, which in turn means it is very likely that any hash based AV scanner will miss these slightly different versions unless they are kept updated with the hashes of these new versions as they appear.

Looking at the actual XML it is apparent that the obfuscation of the Javascript has resulted in different variable names in each version but that there is no difference between what is obfuscated and what is not in any of them.

One additional thing all three versions have in common is that I suspect they won’t actually work.

There is no question what they are trying to do and how they are trying to do it, and at least one version has been seen in the wild by someone else that works, but it is a distinct possibility that the versions I have do not.

I have no way of proving this one way or another as I do not have access to the appropriate environment.

If I am wrong and they will in fact do what they are intended to do I would obviously be interested in knowing why I am wrong. It is all grist to the mill.

If I am right then it is all to the good, at least these particular versions cannot cause any damage, so for obvious reasons I am not going to say why they will not work as intended other than that it is a very simple mistake.


Copyright (c) 2014 By Simon Lewis. All Rights Reserved.

Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and owner Simon Lewis is strictly prohibited.

Excerpts and links may be used, provided that full and clear credit is given to Simon Lewis and justanapplication.wordpress.com with appropriate and specific direction to the original content.

August 25, 2014

Anatomy Of A PDF: Part Eight — The Denouement

Filed under: BMP, PDF, Security, XFA — Tags: , , , , , — Simon Lewis @ 9:42 am

Given that this thing is in the wild and putting aside the possibility that it is a piece of very elaborate performance art then it must be targeting an actual vulnerability.

Its pretty obvious what the program is and what the platform is, so typing those along with terms like PDF, XFA, and BMP in some combination into your search engine of choice turns up all sorts of stuff but it looks like CVE-2013-2729 is the vulnerability in question.

See here and here for the gory details of how it actually exploits the heap corruption and what happens once it has done so.

The heap implementation targeted is the Low Fragmentation Heap (LFH). See the paper “Understanding the Low Fragmentation Heap” by Chris Valasek for a detailed description of how it works and how to do unpleasant things to it. Be aware that this is a PDF which in the circumstances … ! You can find it here.

For details of how synthesized x86 machine code can be made to run when it shouldn’t be see here. Warning, may contain assembler


Copyright (c) 2014 By Simon Lewis. All Rights Reserved.

Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and owner Simon Lewis is strictly prohibited.

Excerpts and links may be used, provided that full and clear credit is given to Simon Lewis and justanapplication.wordpress.com with appropriate and specific direction to the original content.

August 24, 2014

Anatomy Of A PDF: Part Five — Q: When Is A Form Not A Form ? A: When It Is A Can Of Worms

Filed under: BMP, Document Format, Image Formats, Javascript, PDF, Programming Languages, XFA — Tags: , , , , — Simon Lewis @ 12:56 pm

Top-Level Structure

This is the top-level structure of the XFA form lurking in Object 1.

I have omitted all the non-interesting elements, elided the contents of the image and script elements, and renamed a few things but apart from that this is what 91MB of XML looks like.

Impressive isn’t it ?


    <xdp:xdp xmlns:xdp="http://ns.adobe.com/xdp/" timeStamp="2014-01-21T18:18:41Z">
        <template xmlns="http://www.xfa.org/schema/xfa-template/3.1/">
            <?formServer defaultPDFRenderFormat acrobat9.1static?>
            ...
            <subform ...>
                
                ...
                <field name="Image_1">
                    <ui>
                        <imageEdit/>
                    </ui>
                    <value>
                        <image>...</image>
                    </value>
                </field>
                <field name="Image_2">
                    <ui>
                        <imageEdit/>
                    </ui>
                    <value>
                        <image>...</image>
                    </value>
                </field>
                <variables>
                    <script name="..." contentType="application/x-javascript">...</script>
                    <script name="..." contentType="application/x-javascript">...</script>
                    <?templateDesigner expand 1?>
                </variables>
                <subform ...>
                    <field name="Image_3">
                        <ui>
                            <imageEdit/>
                        </ui>
                        <value>
                            <image>...</image>
                        </value>
                    </field>
                </subform>
                <event activity="initialize" name="...">
                    <script contentType="application/x-javascript">...</script>
                </event>
                <event activity="docReady" ref="$host" name="...">
                    <script contentType="application/x-javascript">...</script>
                </event>
            </subform>
            ...
        </template>
        
        
        ...
        
    </xdp:xdp>    

Scripts

As predicted the form contains scripts.

As you can see there are four script elements containing chunks of Javascript.

Taking all four together there is approximately 20KB of Javascript.

Two of the script elements are associated with events so one lot of Javascript will get to run when the “initialize” event occurs and the other lot when the “docReady” event occurs.

Images

As you can see there are three images.

The default encoding for image data in XFA is Base64. This is not over-ridden anywhere so the data for each image is Base64 encoded.

Obfuscation

What you cannot see because I’ve omitted it, is that the chunks of Javascript are partially and mildly obfuscated.

More Scripts

To see what I mean by mildly obfuscated consider the following.

One of the Javascript chunks contains these strings.

    "VW~`~`~XYZa!~`bcde!fghij~`~klm~``~~nopqrs~````tuvwx~~~yz01!234~~56789+``/~~~"
    
    "AB!~CDEF!`GHI!``JKL!~~MNOP!```QRSTU"

If you bolt the first on to the end of the second and then strip out all the occurences of the characters ‘!’, ”~’, and ‘`’ you end up with

    "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"

which looks remarkably like the ‘Base 64 Alphabet’ as described in RFC 4648 because that’s what it is.

There are also unobfuscated references to the images I’ve named “Image_1” and “Image_2” in the Javascript.

Given that

  1. the image data is Base64 encoded, and

  2. that there are the makings of a ‘Base 64 Alphabet’ sat in the Javascript,

it doesn’t take an enormous leap of imagination to wonder whether the referenced images are really images at all or whether the Javascript is going to decode them and turn them into something else entirely.

Extracting the image data into files and Base64 decoding them produces two more chunks of Javascript.

One of them contains a table indexed by version number indicating presumably that the Javascript can tailor its behaviour depending on what version of the target executable it finds itself running in.

Dark Matter

OK so there’s around 20KB of Javascript.

The two pseudo images taken together are about 10KB.

There is maybe another 10KB of XFA related random angle bracket action.

Where is the other ~90.95MB ?

Its the data for what I’ve named “Image_3”.

A Really Really Big Image

Apparently Image_3 is a really really big image, but is it ?

Given that Image_1 and Image_2 turned out not to be images at all is Image_3 also something else in disguise ?

Base64 decoding the data does not produce Javascript.

What it produces is this

    0000000 42 4d 00 00 00 00 00 00 00 00 00 00 00 00 40 00
    0000010 00 00 2e 01 00 00 01 00 00 00 01 00 08 00 01 00
    0000020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 02 00
    0000030 00 00 00 00 00 00 52 47 42 41 52 47 42 41 00 02
    0000040 ff 00 00 02 ff 00 00 02 ff 00 00 02 ff 00 00 02
    *
    4040440 f8 00 00 08 01 00 00 00 00 00 27 05 00 02 00 ff
    4040450 00 02 00 ff 00 02 00 ff 00 02 00 ff 00 02 00 ff
    *
    4040470 00 02 00 ff 00 0a 58 58 58 58 58 58 58 58 58 58
    4040480

Yes that is the entire thing.

As you can see although it starts off promisingly it becomes tediously predictable almost straightaway with the four bytes

    00 02 ff 00

repeating over and over and over again.

That’s what’s IN it, but what IS it ?

The first two bytes are the ASCII characters ‘B’ and ‘M’ which would seem to indicate that it is a BMP image which is a pain because the BMP image format is not officially documented.

According to the various bits of unofficial documentation BMP images can have a bewilderingly variety of headers and this one seems to have an OS/2 2.x header.

Treating it as an OS/2 2.x header would mean that the compression type of the image is RLE/8 where RLE means run-length encoded. This is supported by the image data which follows the header which makes sense as RLE/8 data.

Javascript + BMP Image == What ?

So there you have it.

An XFA form with four chunks of Javascript partially and not very successfully obfuscated, two hidden chunks of Javascript likewise, and one great big run-length encoded BMP image.

Why ?

What is this particular can of worms going to do when unleashed on a poor, unsuspecting, but probably not very little executable ?


Copyright (c) 2014 By Simon Lewis. All Rights Reserved.

Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and owner Simon Lewis is strictly prohibited.

Excerpts and links may be used, provided that full and clear credit is given to Simon Lewis and justanapplication.wordpress.com with appropriate and specific direction to the original content.

August 23, 2014

Anatomy Of A PDF: Part Four — Beware Of Objects Bearing Gifts Or Containing Forms

Filed under: Document Format, PDF, PDF Objects — Tags: , , , — Simon Lewis @ 7:07 pm

The start of Object 1 looks like this.

    1 0 obj
    <</Filter [/Fl /Fl] /Length 13178 >>
    stream
    
    ...

and the end of it looks like this

    ...
    
    endstream
    endobj
    
    ...

and in between there are 13178 bytes of assorted binary sludge.

This is beacause the contents of the object have been filtered before being written to the file.

This is what the presence of the Filter entry in the object’s dictionary is telling us.

The associated value is an array which specifies the filters that have been applied.

    Fl

is an abbreviation for

    FlateDecode

There are two entries, so the FlateDecode filter has been applied twice, first to the original object, and then to the output produced by the first pass.

The FlateDecode filter produces data in the DEFLATE compressed data format.

To find out what is in Object 1 we are going to have to inflate it twice.

Seconds Out, Round One

Inflating it once gives us 408521 bytes of compressed XML up from 13178 so its obviously got a lot of zeroes in it or something.

Seconds Out, Round Two

Inflating our collection of 408521 newly minted bytes we end up with a decidedly non-trivial 91044256 bytes of XML which is a whole lot of angle brackets.

Its going to take forever to fill in all the fields in this thing.


Copyright (c) 2014 By Simon Lewis. All Rights Reserved.

Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and owner Simon Lewis is strictly prohibited.

Excerpts and links may be used, provided that full and clear credit is given to Simon Lewis and justanapplication.wordpress.com with appropriate and specific direction to the original content.

Anatomy Of A PDF: Part Three — Objects, Objects, Objects

According to the cross-reference table there are six objects in use starting with object number 1.

Starting from the root object as specified by the trailer.

Object 3

Object 3 starts at 13294 which 0x33EE.

    3 0 obj
    <<
    /Extensions <</ADBE <</ExtensionLevel 3 /BaseVersion /1.7 >> >>
    /AcroForm 2 0 R
    /Type /Catalog
    /Pages 4 0 R
    /NeedsRendering true
    >>
    endobj

The root object should be a Catalog and the Type entry shows that it is.

AcroForm

The presence of the AcroForm entry indicates that the PDF contains an interactive form.

The entry references Object 2 which should therefore be an interactive form dictionary specifying the form.

Pages

The Pages entry references Object 4 which should therefore be a page tree node.

NeedsRendering

The NeedsRendering entry is another clue that the document contains a form.

Object 2

Object 2 start at offset 13263 which is 0x33CF and it looks like this.

    2 0 obj
    <</XFA 1 0 R >>
    endobj

According to Object 3, the Document Catalog, this should be an interactive form dictionary and the presence of the XFA entry confirms that it is.

XFA is the XML Forms Architecture. Amongst other things it supports scripting, and the Adobe implementation the scripting support includes Javascript, which in this context seems highly significant.

The XFA entry identifies the object which contains the XML which describes the XFA form.

The referenced object is Object 1, so ten to one on there are scripts in Object 1 and they do something unpleasant.

Object 1

Object 1 starts at offset 15.

According to Object 2 it should contain XML specifying an XFA form.

It is the biggest object by far so it definitely contains something, and presumably something nasty at that, so we’ll save it for later.

Object 4

Object 4 starts at 13443 which is 0x3483 and it looks like this

    4 0 obj
    <<
        /Count 1
        /Kids [5 0 R]
        /Type /Pages
    >>
    endobj

According to Object 3, the Document Catalog, this should be a page tree node with a Type of Pages and it is.

The value of the Kids (sic) entry is an Array of Object References of length one.

This entry identifies the children of this object, which are either collections of pages or individual pages which make up the document.

Object 5

Object 5 starts at 13499 which is 0x34BB and it looks like this

    5 0 obj
    <<
        /Parent 4 0 R
        /Type /Page
        /Contents 6 0 R
        /Resources << /Font <&lt/F1 <</BaseFont /Helvetica /Subtype /Type1 /Name /F1 >> >> >>
    >>
    endobj

According to Object 4, the Pages Object, this should be either another page tree node or page object. The Type entry Page tells us that it is a page object.

The Parent entry is a reference to Object 4 which is indeed the parent of this object.

The Contents entry is a reference to Object 6.

The Resources entry is an example of an entry whose value is a dictionary, which is itself an example of a dictionary with entries whose values are dictionaries.

Object 6

Object 6 starts at 13644 which is 0x354c.

    6 0 obj
    <</Length 23 >>
    stream
    BT /F1 24 Tf 100 100 Td
    endstream
    endobj

We know that it is the contents of the page represented by Object 5.

It is in fact a Content Stream which comprises a series of graphic operators and their operands, with the operands preceding their operators (cough, Postscript, cough).

For what its worth

    BT

is the ‘begin text’ operator.

    Tf

is the ‘set text font and size’ operator.

It is preceded by its operands

    /F1 24

with F1 being the font as defined in Object 5 and 24 being the size.

    Td

is the ‘move text’ operator.

It is preceded by its operands

    100 100

And that’s it. Not a very exciting page it has to be said.


Copyright (c) 2014 By Simon Lewis. All Rights Reserved.

Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and owner Simon Lewis is strictly prohibited.

Excerpts and links may be used, provided that full and clear credit is given to Simon Lewis and justanapplication.wordpress.com with appropriate and specific direction to the original content.

Anatomy Of A PDF: Part Two — Its One Of Those Start At The End Of Formats …

The portable document format started life as one of those formats where you have to start at the end, like ZIPs. It seems to have been all the rage at one time, ‘tho not so much these days.

Times having changed, and document formats where you have to start at the end having gone out of fashion, it is now possible to produce PDFs which can be read from the front, but where’s the fun in that ?

Anyway, here’s the very end of my PDF file

   ...
    
    0003620   0   0   0   0   0       n      \n   t   r   a   i   l   e   r
    0003630  \n   <   <   /   R   o   o   t       3       0       R       /
    0003640   S   i   z   e       7       >   >  \n   s   t   a   r   t   x
    0003650   r   e   f  \n   1   3   7   1   6  \n   %   %   E   O   F
    000365f

You can tell its the end because it says

    %%EOF

at the end, which is helpful.

As you can see its mostly ASCII, except for the line endings, which aren’t.

For some reason the internal data in a PDF file is often human readable and the actual contents aren’t.

Formtting the ASCII bit according to the line endings gives us

    ...
    
    trailer
    <</Root 3 0 R /Size 7 >>
    startxref
    13716
    %%EOF

Dictionaries

A lot of the internal data in a PDF file is in the form of dictionaries which are collections of key/value pairs as you might expect.

Dictionaries are delimited by pairs of angle brackets like so

    << ... >>

In the example above you can see that on the line following the word trailer is a dictionary

    ...
    
    <</Root 3 0 R /Size 7 >>
    
    ...

with two entries

    Root

and

    Size

Dictionary keys are Names and Names are always prefixed with a '/'.

The corresponding values may be of any PDF data type including Names, Arrays, and Dictionaries.

Dictionaries often contain a Type entry which specifies the type of the dictionary from which it is possible to determine, via the PDF specification, what other entries the dictionary must, or can, contain.

Dictionaries do not have to be written on a single line, it just so happens that the one in the trailer is.

Objects And Object References

A PDF document comprises a number of objects.

Each object has a number and a generation number.

An object is referred to using both its object number and its generation number.

In the internal data of a PDF file object references are written like so

    object-number genration-number 'R'

The value of the Root entry in the dictionary shown above

    3 0 R

is an example of an object reference as it appears in internal data.

Objects in a PDF file always start with the object number and the generation number so when following references to objects for example, you can always work out whether the object you’ve got is the one you are expecting.

The Trailer

The trailer of a PDF file starts with the line trailer, followed by a dictionary, followed by the line startxref, followed by an offset followed by the EOF marker.

The Root entry in the trailer dictionary specifies the root object of the document. From the root object you can find all the other objects in the file one way or another.

The Size entry in the trailer dictionary specifies the total number of entries in the document’s cross-reference table.

The offset following the word startxref line is the offset of the document’s cross-reference table within the file.

The Cross-Reference Table

The cross-reference table starts at 13716 which is 0x3594.

    ...
    
    0003590   o   b   j  \n   x   r   e   f  \n   0       7  \n   0   0   0
    00035a0   0   0   0   0   0   0   0       6   5   5   3   5       f
    00035b0  \n   0   0   0   0   0   0   0   0   1   5       0   0   0   0
    00035c0   0       n      \n   0   0   0   0   0   1   3   2   6   3
    00035d0   0   0   0   0   0       n      \n   0   0   0   0   0   1   3
    00035e0   2   9   4       0   0   0   0   0       n      \n   0   0   0
    00035f0   0   0   1   3   4   4   3       0   0   0   0   0       n
    0003600  \n   0   0   0   0   0   1   3   4   9   9       0   0   0   0
    0003610   0       n      \n   0   0   0   0   0   1   3   6   4   4
    0003620   0   0   0   0   0       n      \n   t   r   a   i   l   e   r
    
    ...

Formatting the bit starting at 0x3594 by obeying the line endings gives us

    ...
    
    xref
    0 7
    0000000000 65535 f
    0000000015 00000 n
    0000013263 00000 n
    0000013294 00000 n
    0000013443 00000 n
    0000013499 00000 n
    0000013644 00000 n
    
    ...

The second line of the cross-reference table

    0 7

specifies the number of the first object which has an entry in this table, and the number of entries.

In this case the number of the first object is 0 and there are seven entries.

The following seven lines are the entries for objects 0 to 6.

Each entry specifies the offset of the object within the file, the generation number of the object and whether it is in use (‘n’) or free (‘f’).

For example, the second entry

    0000000015 00000 n

tells us that object 1 is at offset 15 within the file, its generation number is 0, and it is in use.


Copyright (c) 2014 By Simon Lewis. All Rights Reserved.

Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and owner Simon Lewis is strictly prohibited.

Excerpts and links may be used, provided that full and clear credit is given to Simon Lewis and justanapplication.wordpress.com with appropriate and specific direction to the original content.

Anatomy Of A PDF: Part One — Enter A PDF Stage Left

Filed under: Document Format, PDF — Tags: , , — Simon Lewis @ 6:11 am

Some nice people have just sent me a PDF file.

I didn’t ask them to but they sent it nonetheless.

Its a shame they cannot spell the word ‘invoice’ properly but it’s the thought that counts isn’t it ?

Since I once made the mistake of finding out how PDFs ‘work’ so to speak, you really don’t want to know but I am going to tell you anyway, I was curious as to what exactly my unsolicited PDF file might contain instead of an ‘inovice’ and what would have happened had I been unwise enough to attempt to open it using, I would assume, acrobat reader running on windows.

Here’s the edited output of

    hexdump -c

run on my little, its only 14KB, PDF.


    0000000   %   P   D   F   -   1   .   5  \n   %  [.] [.] [.] [.] \n   1
    0000010       0       o   b   j  \n   <   <   /   F   i   l   t   e   r
    0000020       [   /   F   l       /   F   l   ]       /   L   e   n   g
    0000030   t   h       1   3   1   7   8       >   >  \n   s   t   r   e

    ...
    
    00033c0   d   s   t   r   e   a   m  \n   e   n   d   o   b   j  \n   2
    00033d0       0       o   b   j  \n   <   <   /   X   F   A       1
    00033e0   0       R       >   >  \n   e   n   d   o   b   j  \n   3
    00033f0   0       o   b   j  \n   <   <   /   E   x   t   e   n   s   i
    0003400   o   n   s       <   <   /   A   D   B   E       <   <   /   E
    0003410   x   t   e   n   s   i   o   n   L   e   v   e   l       3
    0003420   /   B   a   s   e   V   e   r   s   i   o   n       /   1   .
    0003430   7       >   >       >   >       /   A   c   r   o   F   o   r
    0003440   m       2       0       R       /   T   y   p   e       /   C
    0003450   a   t   a   l   o   g       /   P   a   g   e   s       4
    0003460   0       R       /   N   e   e   d   s   R   e   n   d   e   r
    0003470   i   n   g       t   r   u   e       >   >  \n   e   n   d   o
    0003480   b   j  \n   4       0       o   b   j  \n   <   <   /   C   o
    0003490   u   n   t       1       /   K   i   d   s       [   5       0
    00034a0       R   ]       /   T   y   p   e       /   P   a   g   e   s
    00034b0       >   >  \n   e   n   d   o   b   j  \n   5       0       o
    00034c0   b   j  \n   <   <   /   P   a   r   e   n   t       4       0
    00034d0       R       /   T   y   p   e       /   P   a   g   e       /
    00034e0   C   o   n   t   e   n   t   s       6       0       R       /
    00034f0   R   e   s   o   u   r   c   e   s       <   <   /   F   o   n
    0003500   t       <   <   /   F   1       <   <   /   B   a   s   e   F
    0003510   o   n   t       /   H   e   l   v   e   t   i   c   a       /
    0003520   S   u   b   t   y   p   e       /   T   y   p   e   1       /
    0003530   N   a   m   e       /   F   1       >   >       >   >       >
    0003540   >       >   >  \n   e   n   d   o   b   j  \n   6       0
    0003550   o   b   j  \n   <   <   /   L   e   n   g   t   h       2   3
    0003560       >   >  \n   s   t   r   e   a   m  \n   B   T       /   F
    0003570   1       2   4       T   f       1   0   0       1   0   0
    0003580   T   d  \n   e   n   d   s   t   r   e   a   m  \n   e   n   d
    0003590   o   b   j  \n   x   r   e   f  \n   0       7  \n   0   0   0
    00035a0   0   0   0   0   0   0   0       6   5   5   3   5       f
    00035b0  \n   0   0   0   0   0   0   0   0   1   5       0   0   0   0
    00035c0   0       n      \n   0   0   0   0   0   1   3   2   6   3
    00035d0   0   0   0   0   0       n      \n   0   0   0   0   0   1   3
    00035e0   2   9   4       0   0   0   0   0       n      \n   0   0   0
    00035f0   0   0   1   3   4   4   3       0   0   0   0   0       n
    0003600  \n   0   0   0   0   0   1   3   4   9   9       0   0   0   0
    0003610   0       n      \n   0   0   0   0   0   1   3   6   4   4
    0003620   0   0   0   0   0       n      \n   t   r   a   i   l   e   r
    0003630  \n   <   <   /   R   o   o   t       3       0       R       /
    0003640   S   i   z   e       7       >   >  \n   s   t   a   r   t   x
    0003650   r   e   f  \n   1   3   7   1   6  \n   %   %   E   O   F
    000365f


I’ve chopped out the bit in the middle because, as we will see, in its current format its not very interesting at all.

You can tell its a PDF because it says so at the front.


Copyright (c) 2014 By Simon Lewis. All Rights Reserved.

Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and owner Simon Lewis is strictly prohibited.

Excerpts and links may be used, provided that full and clear credit is given to Simon Lewis and justanapplication.wordpress.com with appropriate and specific direction to the original content.

Create a free website or blog at WordPress.com.