The portable document format started life as one of those formats where you have to start at the end, like ZIPs. It seems to have been all the rage at one time, ‘tho not so much these days.
Times having changed, and document formats where you have to start at the end having gone out of fashion, it is now possible to produce PDFs which can be read from the front, but where’s the fun in that ?
Anyway, here’s the very end of my PDF file
...
0003620 0 0 0 0 0 n \n t r a i l e r
0003630 \n < < / R o o t 3 0 R /
0003640 S i z e 7 > > \n s t a r t x
0003650 r e f \n 1 3 7 1 6 \n % % E O F
000365f
You can tell its the end because it says
%%EOF
at the end, which is helpful.
As you can see its mostly ASCII, except for the line endings, which aren’t.
For some reason the internal data in a PDF file is often human readable and the actual contents aren’t.
Formtting the ASCII bit according to the line endings gives us
...
trailer
<</Root 3 0 R /Size 7 >>
startxref
13716
%%EOF
Dictionaries
A lot of the internal data in a PDF file is in the form of dictionaries which are collections of key/value pairs as you might expect.
Dictionaries are delimited by pairs of angle brackets like so
<< ... >>
In the example above you can see that on the line following the word trailer
is a dictionary
...
<</Root 3 0 R /Size 7 >>
...
with two entries
Root
and
Size
Dictionary keys are Names and Names are always prefixed with a '/'
.
The corresponding values may be of any PDF data type including Names, Arrays, and Dictionaries.
Dictionaries often contain a Type
entry which specifies the type of the dictionary from which it is possible to determine, via the PDF specification, what other entries the dictionary must, or can, contain.
Dictionaries do not have to be written on a single line, it just so happens that the one in the trailer is.
Objects And Object References
A PDF document comprises a number of objects.
Each object has a number and a generation number.
An object is referred to using both its object number and its generation number.
In the internal data of a PDF file object references are written like so
object-number genration-number 'R'
The value of the Root
entry in the dictionary shown above
3 0 R
is an example of an object reference as it appears in internal data.
Objects in a PDF file always start with the object number and the generation number so when following references to objects for example, you can always work out whether the object you’ve got is the one you are expecting.
The Trailer
The trailer of a PDF file starts with the line trailer
, followed by a dictionary, followed by the line startxref
, followed by an offset followed by the EOF marker.
The Root
entry in the trailer dictionary specifies the root object of the document. From the root object you can find all the other objects in the file one way or another.
The Size
entry in the trailer dictionary specifies the total number of entries in the document’s cross-reference table.
The offset following the word startxref
line is the offset of the document’s cross-reference table within the file.
The Cross-Reference Table
The cross-reference table starts at 13716 which is 0x3594.
...
0003590 o b j \n x r e f \n 0 7 \n 0 0 0
00035a0 0 0 0 0 0 0 0 6 5 5 3 5 f
00035b0 \n 0 0 0 0 0 0 0 0 1 5 0 0 0 0
00035c0 0 n \n 0 0 0 0 0 1 3 2 6 3
00035d0 0 0 0 0 0 n \n 0 0 0 0 0 1 3
00035e0 2 9 4 0 0 0 0 0 n \n 0 0 0
00035f0 0 0 1 3 4 4 3 0 0 0 0 0 n
0003600 \n 0 0 0 0 0 1 3 4 9 9 0 0 0 0
0003610 0 n \n 0 0 0 0 0 1 3 6 4 4
0003620 0 0 0 0 0 n \n t r a i l e r
...
Formatting the bit starting at 0x3594 by obeying the line endings gives us
...
xref
0 7
0000000000 65535 f
0000000015 00000 n
0000013263 00000 n
0000013294 00000 n
0000013443 00000 n
0000013499 00000 n
0000013644 00000 n
...
The second line of the cross-reference table
0 7
specifies the number of the first object which has an entry in this table, and the number of entries.
In this case the number of the first object is 0
and there are seven entries.
The following seven lines are the entries for objects 0 to 6.
Each entry specifies the offset of the object within the file, the generation number of the object and whether it is in use (‘n’) or free (‘f’).
For example, the second entry
0000000015 00000 n
tells us that object 1 is at offset 15 within the file, its generation number is 0, and it is in use.
Copyright (c) 2014 By Simon Lewis. All Rights Reserved.
Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and owner Simon Lewis is strictly prohibited.
Excerpts and links may be used, provided that full and clear credit is given to Simon Lewis and justanapplication.wordpress.com with appropriate and specific direction to the original content.