Bug: Parsing PDF headers which contain *non-printable* chars before %PDF

Started by crembo, June 13, 2019, 05:04:20 AM

Previous topic - Next topic

crembo

Hello Phil,

The original ticket covered "printable" characters showing before the %PDF header:
https://exiftool.org/forum/index.php/topic,9086.0.html

However, some non English PDF documents appear to have non printable characters (0xca, 0xff) before the %PDF marker, thus the fix introduced in the above ticket cannot identify these as PDFs. Obviously, these documents are not compliant with the standard, but apparently some tools still produce them. Will it be possible to modify the regex to include a .*%PDF instead of \s*%PDF in both PDF.pm and ExifTool.pm?

Regards,
Mike

Phil Harvey

Hi Mike,

As I wrote:

Quote from: Phil Harvey on April 10, 2018, 07:46:35 AM
I allow up to 1024 random bytes before the PDF header (as apparently Adobe Reader does), this would substantially increase the possibility of mis-identifying some other file type as PDF.  So I don't like this idea.

I would prefer not to do this.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).