PDF metadata salad

Started by mazeckenrode, July 04, 2023, 07:14:40 AM

Previous topic - Next topic

Phil Harvey

Darn. The Subject value is too long and gets cut off.  I don't see any control characters in what is shown so far:

Quote  | 6)  Subject = <feff004100420043004400450046004700480049004a00200054006f0077006e0073[snip]

If you could send me the PDF I can take a look at it myself.  My email is philharvey66 at gmail.com

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

mazeckenrode

The PDF in question is the one named "2023-07-04 13;44;01 - Metadata Test - [4] PDF-Xchange+QPDF+ExifToolImport+QPDF.pdf" inside the 7-zip I uploaded, attached to post #4.

If there are any control characters, shouldn't they also show up in the exported JSON?

Phil Harvey

The PDF:Subject contains 2 special characters: 0x201c and 0x201d (left and right double parentheses), which is why ExifTool has to store it as hex-encoded UTF-16.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

mazeckenrode

Parentheses? According to this table, 0x201c and 0x201d are left and right double quotation marks. Typo?

Anyway, those two characters also exist as extended ASCII (hex 93/94, decimal 147/148), which I admittedly use. I also semi-routinely use the following extended ASCII characters in metadata:

'   0x91   145
'   0x92   146
–   0x96   150
—   0x97   151
©   0xa9   169
®   0xae   174
é   0xe9   233

You previously posted:

Quote from: Phil Harvey on July 27, 2023, 02:54:53 PMExifTool only encodes in hex if the string contains non-ASCII characters.

Later that day, you posted:

Quote from: Phil Harvey on July 28, 2023, 03:42:18 PMLooking at the ExifTool source code, I was wrong about my non-ASCII assumption. Strings are written as hex if they contain any control character: \x00-\x08, \x0a-\x1f, \x7f or \xff.

Are you now saying it's both? Either way, from my perspective, if any of the extended ASCII characters specified above are supposedly triggers for encoding as hex strings, ExifTool isn't being consistent, because not every PDF with any of those characters in PDF:Subject (and/or PDF:Title, PDF:Author, etc.) that I'm subjecting to the same ExifTool processing ends up causing a garbled display in DOpus. Please see attached new 7-zip, containing another batch of step-by-step copies of a different PDF, every one of which uses all of the extended ASCII characters above in multiple metadata fields. As with the files in the previously-uploaded 7-zip, the filenames reflect the order of processing as follows:

[1] Original PDF as created and metadata-populated by PDF-XChange Editor
[2] PDF with PDF/XMP creation and modification dates copied from filename
[2a] JSON exported from [2]
[2b] Edited copy of [2a] JSON to be used for re-import
[3] PDF with edited metadata imported from [2b] JSON
[4] Linearized copy of [3] PDF

None of the above have a garbled metadata display in DOpus.

Attached: "2023-08-04 10;18;59 - MAZE - Metadata Test ASCII 128-255.7z" (3,510)

Contents:

"2023-08-04 10;18;59 - MAZE - Metadata Test ASCII 128-255\"
   "2023-08-04 10;18;59 - MAZE - [1] Metadata Test ASCII 128-255.pdf" (5,293)
   "2023-08-04 10;18;59 - MAZE - [2] Metadata Test ASCII 128-255.pdf" (11,178)
   "2023-08-04 10;18;59 - MAZE - [2a] Metadata Test ASCII 128-255 [export].pdf.json" (1,889)
   "2023-08-04 10;18;59 - MAZE - [2b] Metadata Test ASCII 128-255 [import].pdf.json" (1,727)
   "2023-08-04 10;18;59 - MAZE - [3] Metadata Test ASCII 128-255.pdf" (11,858)
   "2023-08-04 10;18;59 - MAZE - [4] Metadata Test ASCII 128-255.pdf" (7,512)

2023-08-04 10;18;59 - MAZE - Metadata Test ASCII 128-255.7z

Phil Harvey

Sorry, you're right.  Double quotation marks.

Quote from: mazeckenrode on August 04, 2023, 11:14:37 AMAre you now saying it's both?

I'm saying I don't remember details about the code that I wrote 16 years ago, and don't have time to study it in detail to give you a precise answer.

And I don't know if it is worth my spending more time on this since this is clearly a DOpus deficiency and not a problem with ExifTool.

(sorry to be blunt)

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).