XMP PDF encoding

Started by Kento, May 07, 2019, 01:15:15 PM

Previous topic - Next topic

Kento

Hi

I am using exiftool to write metadata to PDF files and I notice that if the data I am writing contains specific characters they are encoded inside the XMP.

So for example if I run the following:
exiftool -Keywords="Test1 & Test2" file.pdf

and then open the PDF file using notepad, I see that the keywords XMP is the following:
<pdf:Keywords>Test1 &amp; Test2</pdf:Keywords>

Is there a flag where I can turn off the encoding so that the XMP has the same text that I wrote.

Phil Harvey

If you did this then you would have invalid XMP.  The ampersand must be escaped in any XML value (XMP follows the RDF/XML syntax).

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Kento

The issue is we have an SDK that reads and writes metadata to PDF but it doesn't encode or decode double quotes (").
When we write metadata with exiftool our existing SDK can't read the data because the " is encoded.
The problem is that the existing SDK has already been deployed to a lot of machines. Updating all of them is not feasible.

Is there any flags or exposed APIs that we can use to disable decoding for "

Thank you for taking the time to look into this.

Hayo Baan

#3
Quote from: Kento on May 07, 2019, 02:07:02 PM
The issue is we have an SDK that reads and writes metadata to PDF but it doesn't encode or decode double quotes (").
When we write metadata with exiftool our existing SDK can't read the data because the " is encoded.
The problem is that the existing SDK has already been deployed to a lot of machines. Updating all of them is not feasible.

As it turns out the quotes (single and double) and > don't have to be escaped in text, so your software is actually correct.

Quote from: Kento on May 07, 2019, 02:07:02 PM
Is there any flags or exposed APIs that we can use to disable decoding for "

Thank you for taking the time to look into this.

Not to my knowledge. BUT you could perhaps extract the xmp as binary block, manipulate it, then write it back.

exiftool -b -xmp FILE | sed 's/&quot;/"/g;' | exiftool '-xmp<xmp' -tagsfromfile - FILE

Darn, that doesn't work; the double quote still gets written as &quot;, so we're still at square one.

Update: YES, found a way that works:

exiftool -b -xmp FILE | sed 's/&quot;/"/g;' | exiftool '-xmp<=-' FILE
Hayo Baan – Photography
Web: www.hayobaan.nl

Kento

I just tried doing the same thing but with Adobe Acrobat and I see that double quotes is not encoded in XMP but & is encoded. I tried it in Keywords as well as custom properties in PDF. I have attached the file so you can examine the XMP data.

It seems like Adobe does not encode " for PDF maybe that's why our SDK encodes everything except "

Could the XMP standard in PDF be different from other file types?

StarGeek

#5
To follow up on Hayo's solution, to replace ampersand as well as the quotes, I believe this will work (test it out, I don't usually use sed)

exiftool -b -xmp FILE | sed 's/&quot;/"/g;s/&amp;/&/g;' | exiftool '-xmp<=-' FILE
"It didn't work" isn't helpful. What was the exact command used and the output.
Read FAQ #3 and use that cmd
Please use the Code button for exiftool output

Please include your OS/Exiftool version/filetype

Phil Harvey

It is true that double quotes don't need to be escaped in text strings (see this stackoverflow answer), but currently there is no option in ExifTool to change this.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Hayo Baan

Right, so your software is correct after all; "'> don't need to be escaped in text (& does!). Here's the command to revert them back to unescaped codes:

exiftool -b -xmp FILE | sed $'s/&quot;/"/g;s/&apos;/\'/g;s/&gt;/>/g;' | exiftool '-xmp<=-' FILE

Note: the $ in front of the ' in the sed expression is required to be able to escape the ' inside the expression.

The command should work on both Linux and Mac, for Windows you'll have to look at other means to do the replacement, but the approach (extract, modify, embed) would be the same.
Hayo Baan – Photography
Web: www.hayobaan.nl

Kento

Thank you so much for your solution, we tried it and it worked.
I will look in to a way to do it in Windows.

Kento

In terms of performance, this made our software much slower (we are getting double the time on Mac).
I really appreciate all the effort you have done with Exiftool but I was wondering if there are any plans in the future where this can be fixed so that double quotes are not encoded or provide a flag to not encode it?


Phil Harvey

I don't see the point in modifying ExifTool to accomodate some other buggy software.  I would think a better solution would be to fix your other software so that it accepts valid XML.  From the stackoverflow article I referenced:

"The safe way is to escape all five characters in text"

... which is what ExifTool is doing, and any reasonable XML reader should accept this.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

StarGeek

Quote from: Kento on May 08, 2019, 07:55:55 AMI will look in to a way to do it in Windows.

You can find a Windows port of sed in Windows UnixUtils.

Download link
"It didn't work" isn't helpful. What was the exact command used and the output.
Read FAQ #3 and use that cmd
Please use the Code button for exiftool output

Please include your OS/Exiftool version/filetype

void

Adding some information on the specs on all of this. It doesn't matter if you encode extra characters if you control both sides of the equation: reading/writing data. If we're writing data which we expect others to read, we should follow the spec. If it is in fact safer to encode all the 5 characters, that would be part of the spec.

With the current solution, any data writing while following the proper spec will end up being double decoded which is a very difficult issue to catch.

XMP Specs:

XMP specs do not refer to double quotes directly, they only mention "&", "<", ">".

Quote:

QuoteThe rules from section 2.4 reduce escaping of "&", "<", ">" and the other characters in the RestrictedChar set.
Reference: https://wwwimages2.adobe.com/content/dam/acom/en/devnet/xmp/pdfs/XMP%20SDK%20Release%20cc-2016-08/XMPSpecificationPart1.pdf

XML Specs:

Single-quote and double-quote characters are encoded in attributes only.

Quote:

Quote
To allow attribute values to contain both single and double quotes, the apostrophe or single-quote character (') may be represented as " &apos; ", and the double-quote character (") as " &quot; ".
Reference: https://www.w3.org/TR/REC-xml/#syntax

Adobe's Library for XMP - XMPToolkit:

XMPToolkit does not encode double quotes for values which matches the behaviour of Acrobat itself.

Note: Adobe created XMP

Comment from XMPToolkit Code:

Quote// Append a property or qualifier value to the output with appropriate XML escaping. The escaped
// characters for elements and attributes are '&', '<', '>', and ASCII controls (tab, LF, CR). In
// addition, '"' is escaped for attributes. For efficiency, this is done in a double loop. The outer
// loop makes sure the whole value is processed. The inner loop does a contiguous unescaped run
// followed by one escaped character (if we're not at the end).
Reference: https://www.adobe.com/devnet/xmp.html

Phil Harvey

From the XML 1.1 specification:

4.6 Predefined Entities

[Definition: Entity and character references may both be used to escape the left angle bracket, ampersand, and other delimiters. A set of general entities (amp, lt, gt, apos, quot) is specified for this purpose. Numeric character references may also be used; they are expanded immediately when recognized and must be treated as character data, so the numeric character references "&#60;" and "&#38;" may be used to escape < and & when they occur in character data.]

All XML processors must recognize these entities whether they are declared or not. [...]


Also, for what it's worth, ExifTool has been writing XMP for 14 years, and I can't recall any reader having a problem like this with ExifTool-generated XMP in the past.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).