Just An Application

August 28, 2014

Anatomy Of A PDF Continued: #4 — Part Two: What Is It With This Obfuscation Lark ?

Javascript Obfuscation Considered …

What IS the point of doing this

    ...
    
    var sekritVar0001 = [0x30,0x74,0x67,0x72,0x6E,0x63,0x65,0x67, 1, 10, 40];
    
    ...

and then this somewhere else ?

    ...
    
    for(var q = 0; q < sekritVar0001.length-3; q++)
    sekritVar0002 += String.fromCharCode(sekritVar0001[q]-2);

    ...

Anybody who has managed to lever open a mal-formed PDF and get to the point where they can actually see it is hardly likely to give up at this juncture just because they are going to have to iterate over an array subtract 2 from every element and turn into a character now are they ?

It says

   .replace

for pity’s sake.

And then this.

   ...
    
    var p1 = "(\/[^\\/";
    
    ...
    
    var p2 = "(\/[\\/";
    var sekritVar0003 = "x" + sekritVar0002 + p1 + "\\d]\/g,'')";
    var sekritVar0004 = "z" + sekritVar0002 + p2 + "]\/g,',')";
    
    ...

Oh no the regular expressions are in two halves ! Oh woe is me !

The first one says strip out everything that isn’t a digit or the character ‘/’.

The second one says turn all the ‘/’s into ‘,’s.

If you apply the second to the result of the first you end up with a lot of comma separated integers, and here’s the start of some non-base64 encoded image data.

    fpo10/t10/hmA32/Ac32/nCK32/XX32/yCO32/R32 ...

What a coincidence.

As for this

    ...
    
    function sekritFun0001(x)
    {
        var s = [];
    
        var z = sekritFun0002(sekritVar0003);
        z = sekritFun0002(sekritVar0004);
    
        var ar = sekritFun0002("[" + z + "]");
    
        for (var i = 0; i < ar.length; i ++)
        {
            var j = ar[i];
            if ((j >= 33) && (j <= 126))
            {
                s[i] = String.fromCharCode(33 + ((j + 14) % 94));
            }
            else
            {
                s[i] = String.fromCharCode(j);
            }
        }
        return s.join('');
    }
    
    ...

A variable z which we have worked out is a string that looks like a comma separated list of integers is topped and tailed with what looks suspiciously like the delimiters of a Javascript array literal and is then passed to sekritFun0002 and the result is assigned to the variable ar.

Then there is a for loop which is under the mistaken impression that ar references an array.

Now let me guess. The function sekritFun0002 is really the well known turn strings into arrays as if by magic function for which Javascript fortunately defines a four letter abbreviation.

Then there is the body of the loop.

To cut a long story short here is some Java code which does the same thing as the Javascript function written in the increasingly popular no-regexp idiom

    private static String decode(byte[] theBytes)
    {
        StringBuilder builder = new StringBuilder();
        int           nBytes  = theBytes.length;
        int           v       = 0;

        for (int i = 0; i < nBytes; i++)
        {
            byte b = theBytes[i];
    
            if (b == '/')
            {
                // end of number
    
                int c = 0;
    
                if ((v >= 33) && (v <= 126))
                {
                    c = 33 + ((v + 14) % 94);
                }
                else
                {
                    c = v;
                }
                builder.append((char)c);
                v = 0;
            }
            else
            if (b >= '0' && (b <= '9'))
            {
                v *= 10;
                v += b - '0';
            }
        }
        return builder.toString();
    }

If you run it on the two non-images needless to say you get Javascript.

… Pointless ?

If the point of the obfuscation is that re-obfuscation will result in the hash of the PDF changing then, as I have already observed, there are much much easier ways to achieve the same effect.

If the point of the obfuscation is to conceal the existence of certain strings that could be used to identify the PDF as malicious it is only necessary if the assumption is that something has

  1. parsed the PDF file

  2. extracted Object 1 into a usable state which so far has always required inflating it twice

  3. parsed the XML of the XFA form and identified the different elements

After step 1 the structure of the PDF is apparent and can be considered characteristic of this particular PDF.

After step 2 the ratio of the original to the final size of Object 1 is apparent and can also considered to be characteristic of this particular PDF.

After step 3 another characteristic of this particular PDF is apparent without even looking at the Javascript elements.

To misquote Mae West is that an image in your form or are you just pleased to see me ?

The content of the form is dominated by one thing, an image. To all intents and purposes that’s all it is.

If whatever it is has got this far it can be about 99% certain that it is looking at an instance of this particular malicious PDF.

It can increase the certainty still further by inspecting the image without even decoding it.

In short, unless you can conceal the elephant in the form, there is no need to obfuscate the Javascript other than for amusement.

Postscript (Not The Language)

Even if it was a valid PDF I don’t think this one would do what it was intended to do either.


Copyright (c) 2014 By Simon Lewis. All Rights Reserved.

Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and owner Simon Lewis is strictly prohibited.

Excerpts and links may be used, provided that full and clear credit is given to Simon Lewis and justanapplication.wordpress.com with appropriate and specific direction to the original content.

Anatomy Of A PDF Continued: #4 — Part One: Now What ?

My collection of a single PDF continues to grow apace whether I want it to or not.

I got three ‘invoices’ in one go the other day but they were ZIPs and ZIPs usually means .exes and so it proved.

Then today a PDF arrived which not only supposedly originated in a completely different continent to the other three, but is four times as big as the others. That has got to be good hasn’t it ?

It turned out to be a bit of disappointment taken as PDF because technically it isn’t one. I know I got it for nothing and everything but honestly it comes to something when people can’t even be bothered to get the format right.

Still whats the point of writing a whacking great chunk of Objective-C to read well-formed PDFs if you can’t hack lumps out of it until it can read things that are not well-formed PDFs ? After some judicious hard-wiring of this and that I managed to extract Object 1 once again and inflate it twice as per usual.

And ?

And the XML is exactly the same as all the others ?

Yes and no. The overall structure is pretty much the same but the data for the first two images isn’t.

It does not look like Base64 encoded data and Base64 decoding it definitely does not produce Javascript.

A quick look for the Base64 alphabet construction kit reveals that it has gone walk about, but both images are referenced in the Javascript so the supposition has got to be that it is still really Javascript but its not Base64 encoded. Its a bit of a disappointment but you have got to move with the times I suppose. Base64 encoding was so last week.

Time to find out what this weeks fashionable Javascript encoding technology is i suppose.


Copyright (c) 2014 By Simon Lewis. All Rights Reserved.

Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and owner Simon Lewis is strictly prohibited.

Excerpts and links may be used, provided that full and clear credit is given to Simon Lewis and justanapplication.wordpress.com with appropriate and specific direction to the original content.

August 26, 2014

Anatomy Of A PDF: Afterword

Filed under: CVE, CVE-2013-2729, PDF, PDF Vulnerability, Security — Tags: , , , — Simon Lewis @ 7:14 am

Since I started writing these posts anonymous benefactors have very kindly presented me with two further versions of the original PDF to add to my collection.

I say versions because although they both possess exactly the same structure as the original, the size of Object 1 is slightly different in each one and the binary sludge is different.

This of course means that the hash of the file will be different in each case, which in turn means it is very likely that any hash based AV scanner will miss these slightly different versions unless they are kept updated with the hashes of these new versions as they appear.

Looking at the actual XML it is apparent that the obfuscation of the Javascript has resulted in different variable names in each version but that there is no difference between what is obfuscated and what is not in any of them.

One additional thing all three versions have in common is that I suspect they won’t actually work.

There is no question what they are trying to do and how they are trying to do it, and at least one version has been seen in the wild by someone else that works, but it is a distinct possibility that the versions I have do not.

I have no way of proving this one way or another as I do not have access to the appropriate environment.

If I am wrong and they will in fact do what they are intended to do I would obviously be interested in knowing why I am wrong. It is all grist to the mill.

If I am right then it is all to the good, at least these particular versions cannot cause any damage, so for obvious reasons I am not going to say why they will not work as intended other than that it is a very simple mistake.


Copyright (c) 2014 By Simon Lewis. All Rights Reserved.

Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and owner Simon Lewis is strictly prohibited.

Excerpts and links may be used, provided that full and clear credit is given to Simon Lewis and justanapplication.wordpress.com with appropriate and specific direction to the original content.

August 25, 2014

Anatomy Of A PDF: Part Eight — The Denouement

Filed under: BMP, PDF, Security, XFA — Tags: , , , , , — Simon Lewis @ 9:42 am

Given that this thing is in the wild and putting aside the possibility that it is a piece of very elaborate performance art then it must be targeting an actual vulnerability.

Its pretty obvious what the program is and what the platform is, so typing those along with terms like PDF, XFA, and BMP in some combination into your search engine of choice turns up all sorts of stuff but it looks like CVE-2013-2729 is the vulnerability in question.

See here and here for the gory details of how it actually exploits the heap corruption and what happens once it has done so.

The heap implementation targeted is the Low Fragmentation Heap (LFH). See the paper “Understanding the Low Fragmentation Heap” by Chris Valasek for a detailed description of how it works and how to do unpleasant things to it. Be aware that this is a PDF which in the circumstances … ! You can find it here.

For details of how synthesized x86 machine code can be made to run when it shouldn’t be see here. Warning, may contain assembler


Copyright (c) 2014 By Simon Lewis. All Rights Reserved.

Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and owner Simon Lewis is strictly prohibited.

Excerpts and links may be used, provided that full and clear credit is given to Simon Lewis and justanapplication.wordpress.com with appropriate and specific direction to the original content.

Anatomy Of A PDF: Part Seven — How To Overflow An Unsigned Integer Using Nothing But Bytes !

Filed under: BMP, BMP Run Length Encoding, Image Formats, Security — Tags: , , , — Simon Lewis @ 8:33 am

The theory of overflowing unsigned integers is, fortunately, well understood.

You fill them up with ‘F’s, you can use ‘f’s instead if you prefer, and then you add 1, like so.

    ...
    
    uint32_t i = 0xFFFFFFFF;

    i += 1;
    
    ...

and i is now zero.

Nothing to it.

Alternative combinations of ‘F’s not to mention other hex digits and increments are possible but its best to start off with the base case and make sure you’ve got the hang of it before moving on to the more advanced stuff.

One obvious application of this is when you have an excess of large unsigned integers and a shortage of zeroes. You can use this technique to turn the former into the latter.

Another application is this

    0000000 42 4d 00 00 00 00 00 00 00 00 00 00 00 00 40 00
    0000010 00 00 2e 01 00 00 01 00 00 00 01 00 08 00 01 00
    0000020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 02 00
    0000030 00 00 00 00 00 00 52 47 42 41 52 47 42 41 00 02
    0000040 ff 00 00 02 ff 00 00 02 ff 00 00 02 ff 00 00 02
    *
    4040440 f8 00 00 08 01 00 00 00 00 00 27 05 00 02 00 ff
    4040450 00 02 00 ff 00 02 00 ff 00 02 00 ff 00 02 00 ff
    *
    4040470 00 02 00 ff 00 0a 58 58 58 58 58 58 58 58 58 58
    4040480

which is the really really big BMP that came free with the PDF.

The result of Base64 decoding the image data in the XML clocks in at 67,372,160 bytes, which is surprisingly large for an image which is only 302×1.

The reason that you can fit 67MB of data on nine lines of hexdump output is that the data is intensively repetitive.

The vast majority of the file comprises the four bytes

    00 02 ff 00

The first occurence is at 0x00003E and the last occurence ends at 0x404043E which gives a grand total of 0x4040400 bytes or 0x1010100 repetitions.

If you multiply 0xFF by 0x1010100 you get 0xFFFFFF00 which is a mere 0xFF+1 away from being 0 in certain circumstances.

This is interesting because that multiplication is equivalent to the effect of the four bytes

    00 02 ff 00

being repeated 16843008 times in the BMP image data.

As we know the header specifies that the image is run length encoded which means that the image data is a series of commands which are evaluated at runtime to produce the bytes that comprise the image.

The image is built up left to right, bottom to top as the commands are evaluated.

If the result of evaluating a command is a sequence of bytes, those bytes are appended at the offset within the current line specified by the current x position.

Both the current x position and the current y position, which specifies the current line, can be changed using a delta command.

The four bytes

    00 02 ff 00

are an example of a delta command the effect of which is to add 0xFF to the current x position.

If that command is successfully evaluated 16843008 times then you know

  1. that whatever is doing the evaluating is not doing any kind of bounds checking in this particular case, and

  2. that the current x position is now 0xFFFFFF00

The command at 0x404043E which immediately follows the repetitions is another delta command

    00 02 f8 00

except that this time it adds 0xF8 to the current x position which brings it up 0xFFFFFFf8 as well as creating the expectation that the next command is going to feature the number 8.

The next command is at 0x4040442 and it is this

    00 08 01 00 00 00 00 00 27 05

and lo and behold there is the number 8.

The effect of the command is to add the last eight bytes of the command as image data at the current x position

Clearly at this point it would be a very irresponsible of whatever it is that is evaluating the commands not to perform a bounds check, which means that we are going to discover exactly how the current x position is being represented.

The width and height fields of this particular BMP are signed 32-bit integers so it would not be unreasonable for the current x position to be represented using an unsigned 32 bit integer.

If this command succeeds than either

  • there is no bounds checking

OR

  • the current x position is represented by the command evaluator as an unsigned 32 bit integer

Whichever it is, those eight bytes are going to end up somewhere where they are going to make no useful contribution to the image at all.

After all the excitement of adding some actual bytes, albeit in the wrong place, it all seems to go horribly wrong.

Starting at 0x404044C another delta command

    00 02 00 ff

which adds 0xFF to the current y position, is repeated ten times.

The final command at 0x4040474

    00 0a 58 58 58 58 58 58 58 58 58 58

then adds ten bytes at the current x position, except of course that it doesn’t.

The current y position is out of bounds, the image is supposed to be 302×1 after all, and nothing like large enough to benefit from any kind of unsigned integer overflow and that is intentional.

Causing the image load to fail results in the memory allocated for the image being freed.

If the eight bytes that actually got written have ended up in the right wrong place then presumably their effect is to cause the heap management code to free the wrong object at this point.


Copyright (c) 2014 By Simon Lewis. All Rights Reserved.

Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and owner Simon Lewis is strictly prohibited.

Excerpts and links may be used, provided that full and clear credit is given to Simon Lewis and justanapplication.wordpress.com with appropriate and specific direction to the original content.

Anatomy Of A PDF: Part Six — The Allocation And Deallocation Of Strings By Javascript Considered Harmful ?

Filed under: Javascript, Security — Tags: , — Simon Lewis @ 6:53 am

When it is not Base64 decoding other bits of Javascript disguised as images the Javascript in the XFA form is mostly messing about with strings which on the face of it doesn’t look as though it could be that dangerous.

The messing about occurs in three distinct phases.

The Allocation And Selective Deallocation Phase

The Javascript triggered by the initialize event starts by creating four arrays.

It then proceeds to set each element of the first array to a newly allocated string.

Each string is the result of concatenating two distinctive patterns plus two elements computed using the current iteration index.

Interestingly the size of the created strings looks very much as though it may be the same as the width of the really really big BMP image.

Having gone to all that trouble to create all those different strings it then proceeds to repeatedly null out every tenth element of the first array.

Search Phase

The search phase is performed by the Javascript triggered by the docReady event.

The code iterates over all the elements of the first array. If the element is not null the first character of the string is compared with the value that was used to initialize it when the string was created.

If they are not equal the index of the element is noted, the contents of part of the changed string are saved and the iteration terminates.

Targetted Deallocation and Second Allocation Phase

This phase is also performed by the Javascript triggered by the docReady event but only occurs if a changed string was found during the search phase.

This phase begins by assembling a new string from various bits of hex.

The element in the first array which references the changed string and the element before it in the array are then repeatedly set to null.

Each element of the second array is then set to a string created from the new string constructed at the start of this phase.

It is at this point there is a reference to “Image_2” which we know to be Base64 encoded Javascript.

The image data is decoded and then handed off to eval which is masequerading under an alias !

The “Image_2” code builds another string out of information from the version indexed table in “Image_1”, large amounts of repeated hex, some of which is in all probability x86 code, the strings

    "VirtualProtect"

and

    "KERNEL32"

plus the result of unescaping a big chunk of data which, it turns out, contains, amongst other things, the hard-wired URL of a .exe.

I cannot begin to imagine what they are going to try and do with all of that !

The string is then repeatedly concatenated with itself until the result is a string which occupies at least 1MB of memory.

One hundred copies of this string are then created, presumably just to be on the safe side ?

By this point the heap must surely consist almost entirely of x86 code !

Conjecture

The order of events is

  1. the initialize event code runs.

  2. System attempts to load really, really, big BMP

  3. the docReady event code runs.

and what happens is

  1. the initialize event code sets up a known heap state based on the allocation of strings with distinctive contents, interspersed with holes created by freeing every tenth string.

  2. the System image loading code corrupts the heap attempting to load the really, really, big BMP

  3. he docReady event code finds the location of the heap corruption on the basis of the known state set up by the initialize event code and exploits it.

Given that the Javascript it looking for a string that has changed underneath it so to speak the assumption must be that the string has been freed despite still being in use. Once freed it is re-used by non Javascript code which changes it.

This in turn implies that the heap corruption involves the data used by the heap management code to keep track of which pieces of memory in the heap are in use and which are free. Such data is often contiguous with the memory that is allocated, in which case the easiest way to corrupt it is to write beyond the limits of a piece of memory as allocated which is what the really really big BMP is for.


Copyright (c) 2014 By Simon Lewis. All Rights Reserved.

Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and owner Simon Lewis is strictly prohibited.

Excerpts and links may be used, provided that full and clear credit is given to Simon Lewis and justanapplication.wordpress.com with appropriate
and specific direction to the original content.

August 24, 2014

Anatomy Of A PDF: Part Five — Q: When Is A Form Not A Form ? A: When It Is A Can Of Worms

Filed under: BMP, Document Format, Image Formats, Javascript, PDF, Programming Languages, XFA — Tags: , , , , — Simon Lewis @ 12:56 pm

Top-Level Structure

This is the top-level structure of the XFA form lurking in Object 1.

I have omitted all the non-interesting elements, elided the contents of the image and script elements, and renamed a few things but apart from that this is what 91MB of XML looks like.

Impressive isn’t it ?


    <xdp:xdp xmlns:xdp="http://ns.adobe.com/xdp/" timeStamp="2014-01-21T18:18:41Z">
        <template xmlns="http://www.xfa.org/schema/xfa-template/3.1/">
            <?formServer defaultPDFRenderFormat acrobat9.1static?>
            ...
            <subform ...>
                
                ...
                <field name="Image_1">
                    <ui>
                        <imageEdit/>
                    </ui>
                    <value>
                        <image>...</image>
                    </value>
                </field>
                <field name="Image_2">
                    <ui>
                        <imageEdit/>
                    </ui>
                    <value>
                        <image>...</image>
                    </value>
                </field>
                <variables>
                    <script name="..." contentType="application/x-javascript">...</script>
                    <script name="..." contentType="application/x-javascript">...</script>
                    <?templateDesigner expand 1?>
                </variables>
                <subform ...>
                    <field name="Image_3">
                        <ui>
                            <imageEdit/>
                        </ui>
                        <value>
                            <image>...</image>
                        </value>
                    </field>
                </subform>
                <event activity="initialize" name="...">
                    <script contentType="application/x-javascript">...</script>
                </event>
                <event activity="docReady" ref="$host" name="...">
                    <script contentType="application/x-javascript">...</script>
                </event>
            </subform>
            ...
        </template>
        
        
        ...
        
    </xdp:xdp>    

Scripts

As predicted the form contains scripts.

As you can see there are four script elements containing chunks of Javascript.

Taking all four together there is approximately 20KB of Javascript.

Two of the script elements are associated with events so one lot of Javascript will get to run when the “initialize” event occurs and the other lot when the “docReady” event occurs.

Images

As you can see there are three images.

The default encoding for image data in XFA is Base64. This is not over-ridden anywhere so the data for each image is Base64 encoded.

Obfuscation

What you cannot see because I’ve omitted it, is that the chunks of Javascript are partially and mildly obfuscated.

More Scripts

To see what I mean by mildly obfuscated consider the following.

One of the Javascript chunks contains these strings.

    "VW~`~`~XYZa!~`bcde!fghij~`~klm~``~~nopqrs~````tuvwx~~~yz01!234~~56789+``/~~~"
    
    "AB!~CDEF!`GHI!``JKL!~~MNOP!```QRSTU"

If you bolt the first on to the end of the second and then strip out all the occurences of the characters ‘!’, ”~’, and ‘`’ you end up with

    "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"

which looks remarkably like the ‘Base 64 Alphabet’ as described in RFC 4648 because that’s what it is.

There are also unobfuscated references to the images I’ve named “Image_1” and “Image_2” in the Javascript.

Given that

  1. the image data is Base64 encoded, and

  2. that there are the makings of a ‘Base 64 Alphabet’ sat in the Javascript,

it doesn’t take an enormous leap of imagination to wonder whether the referenced images are really images at all or whether the Javascript is going to decode them and turn them into something else entirely.

Extracting the image data into files and Base64 decoding them produces two more chunks of Javascript.

One of them contains a table indexed by version number indicating presumably that the Javascript can tailor its behaviour depending on what version of the target executable it finds itself running in.

Dark Matter

OK so there’s around 20KB of Javascript.

The two pseudo images taken together are about 10KB.

There is maybe another 10KB of XFA related random angle bracket action.

Where is the other ~90.95MB ?

Its the data for what I’ve named “Image_3”.

A Really Really Big Image

Apparently Image_3 is a really really big image, but is it ?

Given that Image_1 and Image_2 turned out not to be images at all is Image_3 also something else in disguise ?

Base64 decoding the data does not produce Javascript.

What it produces is this

    0000000 42 4d 00 00 00 00 00 00 00 00 00 00 00 00 40 00
    0000010 00 00 2e 01 00 00 01 00 00 00 01 00 08 00 01 00
    0000020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 02 00
    0000030 00 00 00 00 00 00 52 47 42 41 52 47 42 41 00 02
    0000040 ff 00 00 02 ff 00 00 02 ff 00 00 02 ff 00 00 02
    *
    4040440 f8 00 00 08 01 00 00 00 00 00 27 05 00 02 00 ff
    4040450 00 02 00 ff 00 02 00 ff 00 02 00 ff 00 02 00 ff
    *
    4040470 00 02 00 ff 00 0a 58 58 58 58 58 58 58 58 58 58
    4040480

Yes that is the entire thing.

As you can see although it starts off promisingly it becomes tediously predictable almost straightaway with the four bytes

    00 02 ff 00

repeating over and over and over again.

That’s what’s IN it, but what IS it ?

The first two bytes are the ASCII characters ‘B’ and ‘M’ which would seem to indicate that it is a BMP image which is a pain because the BMP image format is not officially documented.

According to the various bits of unofficial documentation BMP images can have a bewilderingly variety of headers and this one seems to have an OS/2 2.x header.

Treating it as an OS/2 2.x header would mean that the compression type of the image is RLE/8 where RLE means run-length encoded. This is supported by the image data which follows the header which makes sense as RLE/8 data.

Javascript + BMP Image == What ?

So there you have it.

An XFA form with four chunks of Javascript partially and not very successfully obfuscated, two hidden chunks of Javascript likewise, and one great big run-length encoded BMP image.

Why ?

What is this particular can of worms going to do when unleashed on a poor, unsuspecting, but probably not very little executable ?


Copyright (c) 2014 By Simon Lewis. All Rights Reserved.

Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and owner Simon Lewis is strictly prohibited.

Excerpts and links may be used, provided that full and clear credit is given to Simon Lewis and justanapplication.wordpress.com with appropriate and specific direction to the original content.

August 23, 2014

Anatomy Of A PDF: Part Four — Beware Of Objects Bearing Gifts Or Containing Forms

Filed under: Document Format, PDF, PDF Objects — Tags: , , , — Simon Lewis @ 7:07 pm

The start of Object 1 looks like this.

    1 0 obj
    <</Filter [/Fl /Fl] /Length 13178 >>
    stream
    
    ...

and the end of it looks like this

    ...
    
    endstream
    endobj
    
    ...

and in between there are 13178 bytes of assorted binary sludge.

This is beacause the contents of the object have been filtered before being written to the file.

This is what the presence of the Filter entry in the object’s dictionary is telling us.

The associated value is an array which specifies the filters that have been applied.

    Fl

is an abbreviation for

    FlateDecode

There are two entries, so the FlateDecode filter has been applied twice, first to the original object, and then to the output produced by the first pass.

The FlateDecode filter produces data in the DEFLATE compressed data format.

To find out what is in Object 1 we are going to have to inflate it twice.

Seconds Out, Round One

Inflating it once gives us 408521 bytes of compressed XML up from 13178 so its obviously got a lot of zeroes in it or something.

Seconds Out, Round Two

Inflating our collection of 408521 newly minted bytes we end up with a decidedly non-trivial 91044256 bytes of XML which is a whole lot of angle brackets.

Its going to take forever to fill in all the fields in this thing.


Copyright (c) 2014 By Simon Lewis. All Rights Reserved.

Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and owner Simon Lewis is strictly prohibited.

Excerpts and links may be used, provided that full and clear credit is given to Simon Lewis and justanapplication.wordpress.com with appropriate and specific direction to the original content.

Anatomy Of A PDF: Part Three — Objects, Objects, Objects

According to the cross-reference table there are six objects in use starting with object number 1.

Starting from the root object as specified by the trailer.

Object 3

Object 3 starts at 13294 which 0x33EE.

    3 0 obj
    <<
    /Extensions <</ADBE <</ExtensionLevel 3 /BaseVersion /1.7 >> >>
    /AcroForm 2 0 R
    /Type /Catalog
    /Pages 4 0 R
    /NeedsRendering true
    >>
    endobj

The root object should be a Catalog and the Type entry shows that it is.

AcroForm

The presence of the AcroForm entry indicates that the PDF contains an interactive form.

The entry references Object 2 which should therefore be an interactive form dictionary specifying the form.

Pages

The Pages entry references Object 4 which should therefore be a page tree node.

NeedsRendering

The NeedsRendering entry is another clue that the document contains a form.

Object 2

Object 2 start at offset 13263 which is 0x33CF and it looks like this.

    2 0 obj
    <</XFA 1 0 R >>
    endobj

According to Object 3, the Document Catalog, this should be an interactive form dictionary and the presence of the XFA entry confirms that it is.

XFA is the XML Forms Architecture. Amongst other things it supports scripting, and the Adobe implementation the scripting support includes Javascript, which in this context seems highly significant.

The XFA entry identifies the object which contains the XML which describes the XFA form.

The referenced object is Object 1, so ten to one on there are scripts in Object 1 and they do something unpleasant.

Object 1

Object 1 starts at offset 15.

According to Object 2 it should contain XML specifying an XFA form.

It is the biggest object by far so it definitely contains something, and presumably something nasty at that, so we’ll save it for later.

Object 4

Object 4 starts at 13443 which is 0x3483 and it looks like this

    4 0 obj
    <<
        /Count 1
        /Kids [5 0 R]
        /Type /Pages
    >>
    endobj

According to Object 3, the Document Catalog, this should be a page tree node with a Type of Pages and it is.

The value of the Kids (sic) entry is an Array of Object References of length one.

This entry identifies the children of this object, which are either collections of pages or individual pages which make up the document.

Object 5

Object 5 starts at 13499 which is 0x34BB and it looks like this

    5 0 obj
    <<
        /Parent 4 0 R
        /Type /Page
        /Contents 6 0 R
        /Resources << /Font <&lt/F1 <</BaseFont /Helvetica /Subtype /Type1 /Name /F1 >> >> >>
    >>
    endobj

According to Object 4, the Pages Object, this should be either another page tree node or page object. The Type entry Page tells us that it is a page object.

The Parent entry is a reference to Object 4 which is indeed the parent of this object.

The Contents entry is a reference to Object 6.

The Resources entry is an example of an entry whose value is a dictionary, which is itself an example of a dictionary with entries whose values are dictionaries.

Object 6

Object 6 starts at 13644 which is 0x354c.

    6 0 obj
    <</Length 23 >>
    stream
    BT /F1 24 Tf 100 100 Td
    endstream
    endobj

We know that it is the contents of the page represented by Object 5.

It is in fact a Content Stream which comprises a series of graphic operators and their operands, with the operands preceding their operators (cough, Postscript, cough).

For what its worth

    BT

is the ‘begin text’ operator.

    Tf

is the ‘set text font and size’ operator.

It is preceded by its operands

    /F1 24

with F1 being the font as defined in Object 5 and 24 being the size.

    Td

is the ‘move text’ operator.

It is preceded by its operands

    100 100

And that’s it. Not a very exciting page it has to be said.


Copyright (c) 2014 By Simon Lewis. All Rights Reserved.

Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and owner Simon Lewis is strictly prohibited.

Excerpts and links may be used, provided that full and clear credit is given to Simon Lewis and justanapplication.wordpress.com with appropriate and specific direction to the original content.

Anatomy Of A PDF: Part Two — Its One Of Those Start At The End Of Formats …

The portable document format started life as one of those formats where you have to start at the end, like ZIPs. It seems to have been all the rage at one time, ‘tho not so much these days.

Times having changed, and document formats where you have to start at the end having gone out of fashion, it is now possible to produce PDFs which can be read from the front, but where’s the fun in that ?

Anyway, here’s the very end of my PDF file

   ...
    
    0003620   0   0   0   0   0       n      \n   t   r   a   i   l   e   r
    0003630  \n   <   <   /   R   o   o   t       3       0       R       /
    0003640   S   i   z   e       7       >   >  \n   s   t   a   r   t   x
    0003650   r   e   f  \n   1   3   7   1   6  \n   %   %   E   O   F
    000365f

You can tell its the end because it says

    %%EOF

at the end, which is helpful.

As you can see its mostly ASCII, except for the line endings, which aren’t.

For some reason the internal data in a PDF file is often human readable and the actual contents aren’t.

Formtting the ASCII bit according to the line endings gives us

    ...
    
    trailer
    <</Root 3 0 R /Size 7 >>
    startxref
    13716
    %%EOF

Dictionaries

A lot of the internal data in a PDF file is in the form of dictionaries which are collections of key/value pairs as you might expect.

Dictionaries are delimited by pairs of angle brackets like so

    << ... >>

In the example above you can see that on the line following the word trailer is a dictionary

    ...
    
    <</Root 3 0 R /Size 7 >>
    
    ...

with two entries

    Root

and

    Size

Dictionary keys are Names and Names are always prefixed with a '/'.

The corresponding values may be of any PDF data type including Names, Arrays, and Dictionaries.

Dictionaries often contain a Type entry which specifies the type of the dictionary from which it is possible to determine, via the PDF specification, what other entries the dictionary must, or can, contain.

Dictionaries do not have to be written on a single line, it just so happens that the one in the trailer is.

Objects And Object References

A PDF document comprises a number of objects.

Each object has a number and a generation number.

An object is referred to using both its object number and its generation number.

In the internal data of a PDF file object references are written like so

    object-number genration-number 'R'

The value of the Root entry in the dictionary shown above

    3 0 R

is an example of an object reference as it appears in internal data.

Objects in a PDF file always start with the object number and the generation number so when following references to objects for example, you can always work out whether the object you’ve got is the one you are expecting.

The Trailer

The trailer of a PDF file starts with the line trailer, followed by a dictionary, followed by the line startxref, followed by an offset followed by the EOF marker.

The Root entry in the trailer dictionary specifies the root object of the document. From the root object you can find all the other objects in the file one way or another.

The Size entry in the trailer dictionary specifies the total number of entries in the document’s cross-reference table.

The offset following the word startxref line is the offset of the document’s cross-reference table within the file.

The Cross-Reference Table

The cross-reference table starts at 13716 which is 0x3594.

    ...
    
    0003590   o   b   j  \n   x   r   e   f  \n   0       7  \n   0   0   0
    00035a0   0   0   0   0   0   0   0       6   5   5   3   5       f
    00035b0  \n   0   0   0   0   0   0   0   0   1   5       0   0   0   0
    00035c0   0       n      \n   0   0   0   0   0   1   3   2   6   3
    00035d0   0   0   0   0   0       n      \n   0   0   0   0   0   1   3
    00035e0   2   9   4       0   0   0   0   0       n      \n   0   0   0
    00035f0   0   0   1   3   4   4   3       0   0   0   0   0       n
    0003600  \n   0   0   0   0   0   1   3   4   9   9       0   0   0   0
    0003610   0       n      \n   0   0   0   0   0   1   3   6   4   4
    0003620   0   0   0   0   0       n      \n   t   r   a   i   l   e   r
    
    ...

Formatting the bit starting at 0x3594 by obeying the line endings gives us

    ...
    
    xref
    0 7
    0000000000 65535 f
    0000000015 00000 n
    0000013263 00000 n
    0000013294 00000 n
    0000013443 00000 n
    0000013499 00000 n
    0000013644 00000 n
    
    ...

The second line of the cross-reference table

    0 7

specifies the number of the first object which has an entry in this table, and the number of entries.

In this case the number of the first object is 0 and there are seven entries.

The following seven lines are the entries for objects 0 to 6.

Each entry specifies the offset of the object within the file, the generation number of the object and whether it is in use (‘n’) or free (‘f’).

For example, the second entry

    0000000015 00000 n

tells us that object 1 is at offset 15 within the file, its generation number is 0, and it is in use.


Copyright (c) 2014 By Simon Lewis. All Rights Reserved.

Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and owner Simon Lewis is strictly prohibited.

Excerpts and links may be used, provided that full and clear credit is given to Simon Lewis and justanapplication.wordpress.com with appropriate and specific direction to the original content.

Older Posts »

Create a free website or blog at WordPress.com.