Just An Application

August 23, 2014

Anatomy Of A PDF: Part Three — Objects, Objects, Objects

According to the cross-reference table there are six objects in use starting with object number 1.

Starting from the root object as specified by the trailer.

Object 3

Object 3 starts at 13294 which 0x33EE.

    3 0 obj
    <<
    /Extensions <</ADBE <</ExtensionLevel 3 /BaseVersion /1.7 >> >>
    /AcroForm 2 0 R
    /Type /Catalog
    /Pages 4 0 R
    /NeedsRendering true
    >>
    endobj

The root object should be a Catalog and the Type entry shows that it is.

AcroForm

The presence of the AcroForm entry indicates that the PDF contains an interactive form.

The entry references Object 2 which should therefore be an interactive form dictionary specifying the form.

Pages

The Pages entry references Object 4 which should therefore be a page tree node.

NeedsRendering

The NeedsRendering entry is another clue that the document contains a form.

Object 2

Object 2 start at offset 13263 which is 0x33CF and it looks like this.

    2 0 obj
    <</XFA 1 0 R >>
    endobj

According to Object 3, the Document Catalog, this should be an interactive form dictionary and the presence of the XFA entry confirms that it is.

XFA is the XML Forms Architecture. Amongst other things it supports scripting, and the Adobe implementation the scripting support includes Javascript, which in this context seems highly significant.

The XFA entry identifies the object which contains the XML which describes the XFA form.

The referenced object is Object 1, so ten to one on there are scripts in Object 1 and they do something unpleasant.

Object 1

Object 1 starts at offset 15.

According to Object 2 it should contain XML specifying an XFA form.

It is the biggest object by far so it definitely contains something, and presumably something nasty at that, so we’ll save it for later.

Object 4

Object 4 starts at 13443 which is 0x3483 and it looks like this

    4 0 obj
    <<
        /Count 1
        /Kids [5 0 R]
        /Type /Pages
    >>
    endobj

According to Object 3, the Document Catalog, this should be a page tree node with a Type of Pages and it is.

The value of the Kids (sic) entry is an Array of Object References of length one.

This entry identifies the children of this object, which are either collections of pages or individual pages which make up the document.

Object 5

Object 5 starts at 13499 which is 0x34BB and it looks like this

    5 0 obj
    <<
        /Parent 4 0 R
        /Type /Page
        /Contents 6 0 R
        /Resources << /Font <&lt/F1 <</BaseFont /Helvetica /Subtype /Type1 /Name /F1 >> >> >>
    >>
    endobj

According to Object 4, the Pages Object, this should be either another page tree node or page object. The Type entry Page tells us that it is a page object.

The Parent entry is a reference to Object 4 which is indeed the parent of this object.

The Contents entry is a reference to Object 6.

The Resources entry is an example of an entry whose value is a dictionary, which is itself an example of a dictionary with entries whose values are dictionaries.

Object 6

Object 6 starts at 13644 which is 0x354c.

    6 0 obj
    <</Length 23 >>
    stream
    BT /F1 24 Tf 100 100 Td
    endstream
    endobj

We know that it is the contents of the page represented by Object 5.

It is in fact a Content Stream which comprises a series of graphic operators and their operands, with the operands preceding their operators (cough, Postscript, cough).

For what its worth

    BT

is the ‘begin text’ operator.

    Tf

is the ‘set text font and size’ operator.

It is preceded by its operands

    /F1 24

with F1 being the font as defined in Object 5 and 24 being the size.

    Td

is the ‘move text’ operator.

It is preceded by its operands

    100 100

And that’s it. Not a very exciting page it has to be said.


Copyright (c) 2014 By Simon Lewis. All Rights Reserved.

Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and owner Simon Lewis is strictly prohibited.

Excerpts and links may be used, provided that full and clear credit is given to Simon Lewis and justanapplication.wordpress.com with appropriate and specific direction to the original content.

Anatomy Of A PDF: Part Two — Its One Of Those Start At The End Of Formats …

The portable document format started life as one of those formats where you have to start at the end, like ZIPs. It seems to have been all the rage at one time, ‘tho not so much these days.

Times having changed, and document formats where you have to start at the end having gone out of fashion, it is now possible to produce PDFs which can be read from the front, but where’s the fun in that ?

Anyway, here’s the very end of my PDF file

   ...
    
    0003620   0   0   0   0   0       n      \n   t   r   a   i   l   e   r
    0003630  \n   <   <   /   R   o   o   t       3       0       R       /
    0003640   S   i   z   e       7       >   >  \n   s   t   a   r   t   x
    0003650   r   e   f  \n   1   3   7   1   6  \n   %   %   E   O   F
    000365f

You can tell its the end because it says

    %%EOF

at the end, which is helpful.

As you can see its mostly ASCII, except for the line endings, which aren’t.

For some reason the internal data in a PDF file is often human readable and the actual contents aren’t.

Formtting the ASCII bit according to the line endings gives us

    ...
    
    trailer
    <</Root 3 0 R /Size 7 >>
    startxref
    13716
    %%EOF

Dictionaries

A lot of the internal data in a PDF file is in the form of dictionaries which are collections of key/value pairs as you might expect.

Dictionaries are delimited by pairs of angle brackets like so

    << ... >>

In the example above you can see that on the line following the word trailer is a dictionary

    ...
    
    <</Root 3 0 R /Size 7 >>
    
    ...

with two entries

    Root

and

    Size

Dictionary keys are Names and Names are always prefixed with a '/'.

The corresponding values may be of any PDF data type including Names, Arrays, and Dictionaries.

Dictionaries often contain a Type entry which specifies the type of the dictionary from which it is possible to determine, via the PDF specification, what other entries the dictionary must, or can, contain.

Dictionaries do not have to be written on a single line, it just so happens that the one in the trailer is.

Objects And Object References

A PDF document comprises a number of objects.

Each object has a number and a generation number.

An object is referred to using both its object number and its generation number.

In the internal data of a PDF file object references are written like so

    object-number genration-number 'R'

The value of the Root entry in the dictionary shown above

    3 0 R

is an example of an object reference as it appears in internal data.

Objects in a PDF file always start with the object number and the generation number so when following references to objects for example, you can always work out whether the object you’ve got is the one you are expecting.

The Trailer

The trailer of a PDF file starts with the line trailer, followed by a dictionary, followed by the line startxref, followed by an offset followed by the EOF marker.

The Root entry in the trailer dictionary specifies the root object of the document. From the root object you can find all the other objects in the file one way or another.

The Size entry in the trailer dictionary specifies the total number of entries in the document’s cross-reference table.

The offset following the word startxref line is the offset of the document’s cross-reference table within the file.

The Cross-Reference Table

The cross-reference table starts at 13716 which is 0x3594.

    ...
    
    0003590   o   b   j  \n   x   r   e   f  \n   0       7  \n   0   0   0
    00035a0   0   0   0   0   0   0   0       6   5   5   3   5       f
    00035b0  \n   0   0   0   0   0   0   0   0   1   5       0   0   0   0
    00035c0   0       n      \n   0   0   0   0   0   1   3   2   6   3
    00035d0   0   0   0   0   0       n      \n   0   0   0   0   0   1   3
    00035e0   2   9   4       0   0   0   0   0       n      \n   0   0   0
    00035f0   0   0   1   3   4   4   3       0   0   0   0   0       n
    0003600  \n   0   0   0   0   0   1   3   4   9   9       0   0   0   0
    0003610   0       n      \n   0   0   0   0   0   1   3   6   4   4
    0003620   0   0   0   0   0       n      \n   t   r   a   i   l   e   r
    
    ...

Formatting the bit starting at 0x3594 by obeying the line endings gives us

    ...
    
    xref
    0 7
    0000000000 65535 f
    0000000015 00000 n
    0000013263 00000 n
    0000013294 00000 n
    0000013443 00000 n
    0000013499 00000 n
    0000013644 00000 n
    
    ...

The second line of the cross-reference table

    0 7

specifies the number of the first object which has an entry in this table, and the number of entries.

In this case the number of the first object is 0 and there are seven entries.

The following seven lines are the entries for objects 0 to 6.

Each entry specifies the offset of the object within the file, the generation number of the object and whether it is in use (‘n’) or free (‘f’).

For example, the second entry

    0000000015 00000 n

tells us that object 1 is at offset 15 within the file, its generation number is 0, and it is in use.


Copyright (c) 2014 By Simon Lewis. All Rights Reserved.

Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and owner Simon Lewis is strictly prohibited.

Excerpts and links may be used, provided that full and clear credit is given to Simon Lewis and justanapplication.wordpress.com with appropriate and specific direction to the original content.

Create a free website or blog at WordPress.com.