Print Page - Bug: Problem parsing PDF headers which contain bytes before %PDF

Title: Bug: Problem parsing PDF headers which contain bytes before %PDF
Post by: LeoC on April 10, 2018, 05:18:23 AM

Hi -

I've run into a problem parsing some PDF files, it seems that some 'valid' PDF files contain additional random bytes before the magic %PDF header.

System Type: Linux
Exiftool Version: 10.91
Command: exiftool (filename.pdf)
Output: "Error: File format error"

An example of such a PDF can be found here:
http://noc.twaren.net/noc_2008/Download/download_file.php?id=3960301

There are various discussions on this topic elsewhere:
https://stackoverflow.com/questions/32178603/pdf-and-docx-magic-numbers
https://stackoverflow.com/questions/6186980/determine-if-a-byte-is-a-pdf-file
https://stackoverflow.com/questions/2731917/how-to-detect-if-a-file-is-pdf-or-tiff

My interpretation of the discussion is that although the spec states that the first bytes should be "%PDF", some (most?) reader implementations will accept files which contain the "%PDF" somewhere in the first 1024 bytes.

I made some quick changes to the regexes used to detect PDFs:

in ExifTool.pm (839)
PDF => '%PDF-\d+\.\d+',
becomes
PDF => '.*%PDF-\d+\.\d+',
in PDF.pm (2088)
$buff =~ /^%PDF-(\d+\.\d+)/ or return 0;
becomes
$buff =~ /%PDF-(\d+\.\d+)/ or return 0;

These changes seem to solve my problem, but of course I have done no regression testing and don't know if there is any side effect.

Could these changes be considered for a future release?

Thanks -

Leo

Title: Re: Bug: Problem parsing PDF headers which contain bytes before %PDF
Post by: Phil Harvey on April 10, 2018, 07:46:35 AM

Hi Leo,

The PDF file won't be read/written properly with the changes you made. The problem is that (for the sample you provided at least) all offsets are relative to the PDF header, while ExifTool assumes they are relative to the start of file. If the PDF header isn't at the start of the file, then we have a problem seeking to the correct offsets for the objects in the PDF file.

The specific example you gave just has whitespace before the PDF header. I don't object to making an accomodation for this. However, if I allow up to 1024 random bytes before the PDF header (as apparently Adobe Reader does), this would substantially increase the possibility of mis-identifying some other file type as PDF. So I don't like this idea.

Do you have any files for which a change from "%PDF" to "\s*%PDF" doesn't identify the file as PDF? If so, could you post a couple? Thanks.

I like this quote from a reference you gave (https://stackoverflow.com/questions/6186980/determine-if-a-byte-is-a-pdf-file):

The problem is the PDF spec says the %PDF-1.x only needs to be in the first 1024 bytes and not the first 4 - This is wrong, the specification (ISO 32000-1) clearly says "The first line of a PDF file shall be a header consisting of the 5 characters %PDF- followed by a version number of the form 1.N, where N is a digit between 0 and 7". Even the Adobe PDF references similarly say "The first line of a PDF file is a header identifying the version of the PDF specification to which the file conforms" and offer the same variants as the specification. Merely the implementation notes of the Adobe PDF references say that "Acrobat viewers require only that the header appear somewhere within the first 1024 bytes of the file." Thus, "Some programs will add information before %PDF and still be valid." is wrong, the created PDFs are not valid, they merely are accepted and displayed by a number of viewers in spite of being broken; they also are rejected by numerous other PDF processors. – mkl Mar 11 '16 at 11:34

- Phil

Title: Re: Bug: Problem parsing PDF headers which contain bytes before %PDF
Post by: LeoC on April 12, 2018, 05:20:57 AM

Phil - Thanks for taking the time to look at this.

I've been trying to find more examples without much success.
I do have one more, but unfortunately I can't share the whole document with you (the content is personal).

I can share the headers though - at least the first 16 bytes look like this:
0x20 0x20 0x20 0x20 0x0A 0x0A 0x20 0x20 0x25 0x50 0x44 0x46 0x2D 0x31 0x2E 0x34
..so this is again just various whitespace characters before the header.
I changed my ExifTool.pm to use "\s*%PDF" instead of ".*%PDF" (ExifTool.pm:839) and as you would expect, it gives me the same output.

I'm going to keep looking for another example that I can share with you in whole. I presume the changes to accommodate the offset are a bit more complex than just a change to a regex!

Thanks again -

Leo

Title: Re: Bug: Problem parsing PDF headers which contain bytes before %PDF
Post by: Phil Harvey on April 12, 2018, 06:58:08 AM

Hi Leo,

ExifTool 10.92 (released 2 days ago) should read/write these files OK.

- Phil

Title: Re: Bug: Problem parsing PDF headers which contain bytes before %PDF
Post by: dgpickett@aol.com on July 20, 2020, 06:48:48 PM

I got this today after downloading exiftoo yesterday, Warning: [minor] PDF header is not at start of file - <file_name>.pdf Using cat -vte, it seems like a lot of blank lines (25) precede the header, occasionally with a space (1), so maybe just skip leading white space?

$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
%PDF-1.7$
%M-bM-cM-OM-S$
1 0 obj$
<</Type/XObject/Subtype/Form/Resources<</Font<</ArialMT 2 0 R>>>>/BBox[0 0 147.79 14.85]/FormType 1/Matrix [1 0 0 1 0 0]/Length 107/Filter/FlateDecode>>stream$
xM-^\+T^HTM-P^OM-)PpM-ruV(T0^@BC^Ss=sK M-%gaM-*PM-^TM-*^PM-.M-^P^GM-^Tq$

Title: Re: Bug: Problem parsing PDF headers which contain bytes before %PDF
Post by: Phil Harvey on July 21, 2020, 06:33:48 AM

This is just a warning. ExifTool should read/write this file anyway.

- Phil

ExifTool Forum

ExifTool => Bug Reports / Feature Requests => Topic started by: LeoC on April 10, 2018, 05:18:23 AM