Bug: Problem parsing PDF headers which contain bytes before %PDF

Started by LeoC, April 10, 2018, 05:18:23 AM

Previous topic - Next topic

LeoC

Hi -

I've run into a problem parsing some PDF files, it seems that some 'valid' PDF files contain additional random bytes before the magic %PDF header.

System Type: Linux
Exiftool Version: 10.91
Command: exiftool (filename.pdf)
Output: "Error: File format error"

An example of such a PDF can be found here:
http://noc.twaren.net/noc_2008/Download/download_file.php?id=3960301

There are various discussions on this topic elsewhere:
https://stackoverflow.com/questions/32178603/pdf-and-docx-magic-numbers
https://stackoverflow.com/questions/6186980/determine-if-a-byte-is-a-pdf-file
https://stackoverflow.com/questions/2731917/how-to-detect-if-a-file-is-pdf-or-tiff

My interpretation of the discussion is that although the spec states that the first bytes should be "%PDF", some (most?) reader implementations will accept files which contain the "%PDF" somewhere in the first 1024 bytes.

I made some quick changes to the regexes used to detect PDFs:


  • in ExifTool.pm (839)
    PDF => '%PDF-\d+\.\d+',
    becomes
    PDF => '.*%PDF-\d+\.\d+',
  • in PDF.pm (2088)
    $buff =~ /^%PDF-(\d+\.\d+)/ or return 0;
    becomes
    $buff =~ /%PDF-(\d+\.\d+)/ or return 0;

These changes seem to solve my problem, but of course I have done no regression testing and don't know if there is any side effect.

Could these changes be considered for a future release?

Thanks -

Leo


Phil Harvey

Hi Leo,

The PDF file won't be read/written properly with the changes you made.  The problem is that (for the sample you provided at least) all offsets are relative to the PDF header, while ExifTool assumes they are relative to the start of file.  If the PDF header isn't at the start of the file, then we have a problem seeking to the correct offsets for the objects in the PDF file.

The specific example you gave just has whitespace before the PDF header.  I don't object to making an accomodation for this.  However, if I allow up to 1024 random bytes before the PDF header (as apparently Adobe Reader does), this would substantially increase the possibility of mis-identifying some other file type as PDF.  So I don't like this idea.

Do you have any files for which a change from "%PDF" to "\s*%PDF" doesn't identify the file as PDF?  If so, could you post a couple?  Thanks.

I like this quote from a reference you gave:

The problem is the PDF spec says the %PDF-1.x only needs to be in the first 1024 bytes and not the first 4 - This is wrong, the specification (ISO 32000-1) clearly says "The first line of a PDF file shall be a header consisting of the 5 characters %PDF- followed by a version number of the form 1.N, where N is a digit between 0 and 7". Even the Adobe PDF references similarly say "The first line of a PDF file is a header identifying the version of the PDF specification to which the file conforms" and offer the same variants as the specification. Merely the implementation notes of the Adobe PDF references say that "Acrobat viewers require only that the header appear somewhere within the first 1024 bytes of the file." Thus, "Some programs will add information before %PDF and still be valid." is wrong, the created PDFs are not valid, they merely are accepted and displayed by a number of viewers in spite of being broken; they also are rejected by numerous other PDF processors. – mkl Mar 11 '16 at 11:34

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

LeoC

Phil - Thanks for taking the time to look at this.

I've been trying to find more examples without much success.
I do have one more, but unfortunately I can't share the whole document with you (the content is personal).

I can share the headers though - at least the first 16 bytes look like this:
0x20 0x20 0x20 0x20 0x0A 0x0A 0x20 0x20 0x25 0x50 0x44 0x46 0x2D 0x31 0x2E 0x34
..so this is again just various whitespace characters before the header.
I changed my ExifTool.pm to use "\s*%PDF" instead of ".*%PDF" (ExifTool.pm:839) and as you would expect, it gives me the same output.

I'm going to keep looking for another example that I can share with you in whole. I presume the changes to accommodate the offset are a bit more complex than just a change to a regex!

Thanks again -

Leo


Phil Harvey

Hi Leo,

ExifTool 10.92 (released 2 days ago) should read/write these files OK.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

dgpickett@aol.com

I got this today after downloading exiftoo yesterday, Warning: [minor] PDF header is not at start of file - <file_name>.pdf  Using cat -vte, it seems like a lot of blank lines (25) precede the header, occasionally with a space (1), so maybe just skip leading white space?

$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
%PDF-1.7$
%M-bM-cM-OM-S$
1 0 obj$
<</Type/XObject/Subtype/Form/Resources<</Font<</ArialMT 2 0 R>>>>/BBox[0 0 147.79 14.85]/FormType 1/Matrix [1 0 0 1 0 0]/Length 107/Filter/FlateDecode>>stream$
xM-^\+T^HTM-P^OM-)PpM-ruV(T0^@BC^Ss=sK M-%gaM-*PM-^TM-*^PM-.M-^P^GM-^Tq$

Phil Harvey

This is just a warning.  ExifTool should read/write this file anyway.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).