Display of keywords metadata from PDF file not intuitive

robbie.morrison · February 12, 2017, 04:46:32 PM

Hello Phil, all

ExifTool version 10.42
Ubuntu 16.10
Linux 4.8.0-37-generic

First up, I read FAQ 3a carefully and searched the Forum. The FAQ discusses the problem I am raising. This is therefore not bug in ExifTool, but rather an expectation on what should be reported. Say I add two keywords to a PDF file:

Code Select


$ exiftool \
    -keywords="keyword one" \
    -keywords="keyword two" \
    target.pdf

And then run ExifTool on the resulting file in default mode. I get:

Code Select


Keywords                        : keyword two

If I do the same with pdfinfo, I get:

Code Select


  Keywords:       keyword one, keyword two

The same is reported if I use evince (a PDF viewer in Linux) and select File > Properties > General and examine the Keywords field.

The problem is clear, if I run ExifTool to display more information:

Code Select


$ exiftool -duplicates -groupHeadings target.pdf

This yields:

Code Select


---- PDF ----
Keywords                        : keyword one, keyword two

---- XMP ----
Keywords                        : keyword two

Would it not be better to use the PDF variant of Keywords rather than the XMP variant, when calling ExifTool without any options? This would certainly be more intuitive and would align with the practice adopted by pdfinfo and evince.

Thank you for developing and maintaining this wonderful utility.

with best wishes, Robbie Morrison

Phil Harvey · February 12, 2017, 05:26:28 PM

Hi Robbie,

Thanks for the feedback. It's actually a bit worse than you explained, because the order depends on the specific PDF file. ie)

Code Select

> exiftool -keywords="keyword one" -keywords="keyword two" a.pdf
    1 image files updated

> exiftool -G1 -keywords a.pdf
[PDF]           Keywords                        : keyword one, keyword two

> exiftool -a -G1 -keywords a.pdf
[PDF]           Keywords                        : keyword one, keyword two
[XMP-pdf]       Keywords                        : keyword two

The thing is that without the -duplicates (-a) option, the tag returned depends on the order in the file (which is apparently different in my test above).

There are some cases where I override this behaviour to prefer a specific type of metadata, but for PDF files I would likely choose the native PDF information over XMP if it had to chose.

But perhaps it makes sense to demote XMP-pdf:Keywords specifically because it is a crippled tag (ie. doesn't properly support lists of values). I'll think about this.

- Phil

Edit: I just checked, and I had already lowered the priority of XMP-pdf:Keywords, so I can't understand why it seems to take priority in your output. Could you send me the sample PDF so I can see what is happening? (philharvey66 at gmail.com)

robbie.morrison · February 12, 2017, 06:16:41 PM

Hello Phil. PDF (2017-fsfe-position-feedback.pdf) sent by email as requested. Robbie

robbie.morrison · February 13, 2017, 02:42:19 AM

Hello again Phil

Why don't you instead maintain the XMP-pdf::Keywords value as stringified list, appending a comma-separated string on each new keyword addition. Assuming that the use of a comma as a separator is standardized, that is. This would then keep the reporting from PDF::Keywords and XMP-pdf::Keywords consistent and it would not matter which one was recovered. The code for keyword deletions would need to reflect this logic though. Just a thought.

best wishes, Robbie

Phil Harvey · February 13, 2017, 07:44:02 AM

Hi Robbie,

I got the file, thanks. I see the reason: PDF:Keywords also has a lowered priority to patch to an apparent bug in Adobe Acrobat which can result in duplicate Info dictionaries. I will further lower the XMP-pdf:Keywords priority to get below this.

Note that Adobe Bridge uses XMP:Subject now instead of the Keywords tags in PDF files.

Quote from: robbie.morrison on February 13, 2017, 02:42:19 AM
Why don't you instead maintain the XMP-pdf::Keywords value as stringified list, appending a comma-separated string on each new keyword addition.

It would be nice if there was some standardization for this, but I can't just make it up myself. If I use a comma then I still need some standardized way of escaping a comma inside a keyword.

- Phil

robbie.morrison · February 13, 2017, 09:47:18 AM

Hello Phil, many thanks for addressing the issue, also for your comment on comma-separated lists, Robbie

robbie.morrison · February 14, 2017, 06:09:30 PM

Hello again Phil

I have a related but not identical issue. My LaTeX base file contains the following commands:

\usepackage[backref=page]{hyperref}
\hypersetup{pdfkeywords={git date: \gitCommitterIsoDate{} | git hash: \gitHash{}}}

I know this is an abuse of the concept of a keyword. Nonetheless pdfinfo and evince report:

Code Select


  Keywords:       git date: 2017-02-13 21:44:45 +0100 | git hash: 01a774f00fc2e7b98326cee1555955f2f0d02265

While exiftool (without any options) reports:

Code Select


Keywords                        : git, date:, 2017-02-13, 21:44:45, +0100, |, git, hash:, 01a774f00fc2e7b98326cee1555955f2f0d02265

Note the commas. If this issue is of interest, I can create a minimal working example and email you the TeX and PDF files.

with best wishes, Robbie

Phil Harvey · February 15, 2017, 08:03:09 AM

Hi Robbie,

Augh. This is a problem of list separators again. I had forgotten about this, but the separators are not consistent as used by other software in the PDF:Keywords, so ExifTool will split on white space if there are no commas in the keyword string. You can disable this with -api NoPDFList

- Phil

robbie.morrison · February 15, 2017, 09:01:49 AM

Hi Phil, option -api NoPDFList does indeed work. That is all I need. Robbie

News:

Display of keywords metadata from PDF file not intuitive