Just An Application

August 23, 2014

Anatomy Of A PDF: Part Two — Its One Of Those Start At The End Of Formats …

The portable document format started life as one of those formats where you have to start at the end, like ZIPs. It seems to have been all the rage at one time, ‘tho not so much these days.

Times having changed, and document formats where you have to start at the end having gone out of fashion, it is now possible to produce PDFs which can be read from the front, but where’s the fun in that ?

Anyway, here’s the very end of my PDF file

   ...
    
    0003620   0   0   0   0   0       n      \n   t   r   a   i   l   e   r
    0003630  \n   <   <   /   R   o   o   t       3       0       R       /
    0003640   S   i   z   e       7       >   >  \n   s   t   a   r   t   x
    0003650   r   e   f  \n   1   3   7   1   6  \n   %   %   E   O   F
    000365f

You can tell its the end because it says

    %%EOF

at the end, which is helpful.

As you can see its mostly ASCII, except for the line endings, which aren’t.

For some reason the internal data in a PDF file is often human readable and the actual contents aren’t.

Formtting the ASCII bit according to the line endings gives us

    ...
    
    trailer
    <</Root 3 0 R /Size 7 >>
    startxref
    13716
    %%EOF

Dictionaries

A lot of the internal data in a PDF file is in the form of dictionaries which are collections of key/value pairs as you might expect.

Dictionaries are delimited by pairs of angle brackets like so

    << ... >>

In the example above you can see that on the line following the word trailer is a dictionary

    ...
    
    <</Root 3 0 R /Size 7 >>
    
    ...

with two entries

    Root

and

    Size

Dictionary keys are Names and Names are always prefixed with a '/'.

The corresponding values may be of any PDF data type including Names, Arrays, and Dictionaries.

Dictionaries often contain a Type entry which specifies the type of the dictionary from which it is possible to determine, via the PDF specification, what other entries the dictionary must, or can, contain.

Dictionaries do not have to be written on a single line, it just so happens that the one in the trailer is.

Objects And Object References

A PDF document comprises a number of objects.

Each object has a number and a generation number.

An object is referred to using both its object number and its generation number.

In the internal data of a PDF file object references are written like so

    object-number genration-number 'R'

The value of the Root entry in the dictionary shown above

    3 0 R

is an example of an object reference as it appears in internal data.

Objects in a PDF file always start with the object number and the generation number so when following references to objects for example, you can always work out whether the object you’ve got is the one you are expecting.

The Trailer

The trailer of a PDF file starts with the line trailer, followed by a dictionary, followed by the line startxref, followed by an offset followed by the EOF marker.

The Root entry in the trailer dictionary specifies the root object of the document. From the root object you can find all the other objects in the file one way or another.

The Size entry in the trailer dictionary specifies the total number of entries in the document’s cross-reference table.

The offset following the word startxref line is the offset of the document’s cross-reference table within the file.

The Cross-Reference Table

The cross-reference table starts at 13716 which is 0x3594.

    ...
    
    0003590   o   b   j  \n   x   r   e   f  \n   0       7  \n   0   0   0
    00035a0   0   0   0   0   0   0   0       6   5   5   3   5       f
    00035b0  \n   0   0   0   0   0   0   0   0   1   5       0   0   0   0
    00035c0   0       n      \n   0   0   0   0   0   1   3   2   6   3
    00035d0   0   0   0   0   0       n      \n   0   0   0   0   0   1   3
    00035e0   2   9   4       0   0   0   0   0       n      \n   0   0   0
    00035f0   0   0   1   3   4   4   3       0   0   0   0   0       n
    0003600  \n   0   0   0   0   0   1   3   4   9   9       0   0   0   0
    0003610   0       n      \n   0   0   0   0   0   1   3   6   4   4
    0003620   0   0   0   0   0       n      \n   t   r   a   i   l   e   r
    
    ...

Formatting the bit starting at 0x3594 by obeying the line endings gives us

    ...
    
    xref
    0 7
    0000000000 65535 f
    0000000015 00000 n
    0000013263 00000 n
    0000013294 00000 n
    0000013443 00000 n
    0000013499 00000 n
    0000013644 00000 n
    
    ...

The second line of the cross-reference table

    0 7

specifies the number of the first object which has an entry in this table, and the number of entries.

In this case the number of the first object is 0 and there are seven entries.

The following seven lines are the entries for objects 0 to 6.

Each entry specifies the offset of the object within the file, the generation number of the object and whether it is in use (‘n’) or free (‘f’).

For example, the second entry

    0000000015 00000 n

tells us that object 1 is at offset 15 within the file, its generation number is 0, and it is in use.


Copyright (c) 2014 By Simon Lewis. All Rights Reserved.

Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and owner Simon Lewis is strictly prohibited.

Excerpts and links may be used, provided that full and clear credit is given to Simon Lewis and justanapplication.wordpress.com with appropriate and specific direction to the original content.

Advertisements

1 Comment »

  1. […] to the cross-reference table there are six objects in use starting with object number […]

    Pingback by Anatomy Of A PDF: Part Three — Objects, Objects, Objects | Just An Application — August 23, 2014 @ 11:13 am


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: