Print Page - Bug: Parsing PDF headers which contain *non-printable* chars before %PDF

Title: Bug: Parsing PDF headers which contain *non-printable* chars before %PDF
Post by: crembo on June 13, 2019, 05:04:20 AM

Hello Phil,

The original ticket covered "printable" characters showing before the %PDF header:
https://exiftool.org/forum/index.php/topic,9086.0.html

However, some non English PDF documents appear to have non printable characters (0xca, 0xff) before the %PDF marker, thus the fix introduced in the above ticket cannot identify these as PDFs. Obviously, these documents are not compliant with the standard, but apparently some tools still produce them. Will it be possible to modify the regex to include a

Code Select

.*%PDF instead of

Code Select

\s*%PDF in both PDF.pm and ExifTool.pm?

Regards,
Mike

Title: Re: Bug: Parsing PDF headers which contain *non-printable* chars before %PDF
Post by: Phil Harvey on June 13, 2019, 09:36:36 AM

Hi Mike,

As I wrote:

Quote from: Phil Harvey on April 10, 2018, 07:46:35 AM
I allow up to 1024 random bytes before the PDF header (as apparently Adobe Reader does), this would substantially increase the possibility of mis-identifying some other file type as PDF. So I don't like this idea.

I would prefer not to do this.

- Phil

ExifTool Forum

ExifTool => Bug Reports / Feature Requests => Topic started by: crembo on June 13, 2019, 05:04:20 AM