Bug: Parsing PDF headers which contain *non-printable* chars before %PDF

crembo · June 13, 2019, 05:04:20 AM

Hello Phil,

The original ticket covered "printable" characters showing before the %PDF header:
https://exiftool.org/forum/index.php/topic,9086.0.html

However, some non English PDF documents appear to have non printable characters (0xca, 0xff) before the %PDF marker, thus the fix introduced in the above ticket cannot identify these as PDFs. Obviously, these documents are not compliant with the standard, but apparently some tools still produce them. Will it be possible to modify the regex to include a

Code Select

.*%PDF instead of

Code Select

\s*%PDF in both PDF.pm and ExifTool.pm?

Regards,
Mike

Phil Harvey · June 13, 2019, 09:36:36 AM

Hi Mike,

As I wrote:

Quote from: Phil Harvey on April 10, 2018, 07:46:35 AM
I allow up to 1024 random bytes before the PDF header (as apparently Adobe Reader does), this would substantially increase the possibility of mis-identifying some other file type as PDF. So I don't like this idea.

I would prefer not to do this.

- Phil

News:

Bug: Parsing PDF headers which contain non-printable chars before %PDF

crembo

Phil Harvey