Literal string in /ID file identifier produces corrupted PDF file

Started by thomasbachem, January 13, 2016, 06:35:42 AM

Previous topic - Next topic

thomasbachem

GhostScript may create PDF files that have a literal string (as opposed to a hex string) in the /ID property, e.g.:

/ID [('{fuRA\307xG\(\017\303K7\030_)('{fuRA\307xG\(\017\303K7\030_)]

It seems to happen rarely, and I don't know what triggers this behavior, but according to the discussion at http://bugs.ghostscript.com/show_bug.cgi?id=690820 this is valid and all typical PDF tools (qpdf, pdfinfo, Acrobat) don't complain about the file.

After updating metadata with ExifTool however, the file gets corrupted. I suppose that has something to do with the file identifier getting modified to:

/ID [ ('{fuRA\307xG\(\017\303K7\030_) (({fuRA\307xG\(\017\303K7\030_) ]

So spaces were added between the brackets (no idea whether that is problematic), but more importantly, (' became (( in the second part.

Output from qpdf --check pdf-after-exiftool.pdf:


WARNING: pdf-after-exiftool.pdf: file is damaged
WARNING: pdf-after-exiftool.pdf (trailer, file position 26767): EOF while reading token
WARNING: pdf-after-exiftool.pdf: Attempting to reconstruct cross-reference table
pdf-after-exiftool.pdf (trailer, file position 26767): EOF while reading token


You'll find the original PDF file (pdf-before-exiftool.pdf) with the literal string /ID attached. You can execute e.g.

exiftool -Title="foo" pdf-before-exiftool.pdf -o pdf-after-exiftool.pdf

To corrupt the file.

I'm using exiftool v9.96 on OS X 10.11 and Ubuntu 14.04.

Please let me know if I can further assist in getting this fixed.

Phil Harvey

Hi Thomas,

Thanks for this report.  You are correct that the change to ID is the problem.  ExifTool needs to change the ID when a file is changed, but is causing problems when changing an ID in this format.  I haven't seen this type of ID before.  I need to look into this in more detail to see what possible formats the ID may take, and come up with a strategy to handle all of them properly.

You may assume that this problem will be fixed in ExifTool 10.10 unless I post back here otherwise.

Thanks again.  And thanks for the sample so I could reproduce this problem.

- Phil

Edit:  I have uploaded an ExifTool 10.10 pre-release that should solve this problem.  It would be helpful if you could test this to see if it works for you.  Thanks.
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

thomasbachem

Thank YOU for all the great work and your quick response, Phil!

Am I right to assume that the file becomes corrupted only when an unescaped bracket shows up in /ID after ExifTool changed the ID? So to hotfix our code, I just need to ensure that whenever an unescaped bracket is used within the brackets in /ID, we just replace it with some other random character to fix the PDF?

thomasbachem

Wow just saw the prerelease now, that was really quick!

# avoid generating characters that could cause problems

So that seems to answer my question from above ;).

Yes, it works! Thanks again!

Phil Harvey

Right.  If, after writing with and older version of ExifTool, the first character in the second string is an unescaped "(" or ")", then change it to any other character.  There is a possibility that ExifTool could generate a "\" incorrectly as the first character, but this would be more difficult to detect.

Please grab the pre-release again.  Just before I posted this I updated it with an extra test to nail down what I think is the last potential pitfall here (the case where the first character is "(", ")" or "\" to begin with).

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

thomasbachem

The new pre-release works as well.

For when do you expect the 10.10 release? And do you feel comfortable about running it in production to manipulate PDF metadata?

I need to assess whether we should update the servers to 10.10p, wait a few days for 10.10 or deploy our own hotfix and use a stable version.

Phil Harvey

Version 10.10p should be stable and reliable, but may not include all the features/bug fixes of the final 10.10 version.  There haven't been any major changes since 10.09 that I see as causing any problems for you.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

thomasbachem

I'm just asking because we're using v9.96 right now and you write on http://www.exiftool.org/history.html that "The most recent production release is Version 10.00." :).

Phil Harvey

Ah.  Well you should definitely update then to take advantage of the bug fixes since 9.96.

- Phil

Edit:  Oh, I see what you mean.  I am working toward 10.10 being another production release, so it will be as stable as possible.  (Not that the dev releases are unstable, but just that I like to allow the dust to settle for a while after a new feature is added before calling it a production release -- this gives me time to tweak the new feature without potentially affecting as many users.)
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).