Display of keywords metadata from PDF file not intuitive

Started by robbie.morrison, February 12, 2017, 04:46:32 PM

Previous topic - Next topic

robbie.morrison

Hello Phil, all

ExifTool version 10.42
Ubuntu 16.10
Linux 4.8.0-37-generic

First up, I read FAQ 3a carefully and searched the Forum.  The FAQ discusses the problem I am raising.  This is therefore not bug in ExifTool, but rather an expectation on what should be reported.  Say I add two keywords to a PDF file:


$ exiftool \
    -keywords="keyword one" \
    -keywords="keyword two" \
    target.pdf


And then run ExifTool on the resulting file in default mode.  I get:


Keywords                        : keyword two


If I do the same with pdfinfo, I get:


  Keywords:       keyword one, keyword two


The same is reported if I use evince (a PDF viewer in Linux) and select File > Properties > General and examine the Keywords field.

The problem is clear, if I run ExifTool to display more information:


$ exiftool -duplicates -groupHeadings target.pdf


This yields:


---- PDF ----
Keywords                        : keyword one, keyword two

---- XMP ----
Keywords                        : keyword two


Would it not be better to use the PDF variant of Keywords rather than the XMP variant, when calling ExifTool without any options?  This would certainly be more intuitive and would align with the practice adopted by pdfinfo and evince.

Thank you for developing and maintaining this wonderful utility.

with best wishes, Robbie Morrison

Phil Harvey

#1
Hi Robbie,

Thanks for the feedback.  It's actually a bit worse than you explained, because the order depends on the specific PDF file.  ie)

> exiftool -keywords="keyword one" -keywords="keyword two" a.pdf
    1 image files updated

> exiftool -G1 -keywords a.pdf
[PDF]           Keywords                        : keyword one, keyword two

> exiftool -a -G1 -keywords a.pdf
[PDF]           Keywords                        : keyword one, keyword two
[XMP-pdf]       Keywords                        : keyword two


The thing is that without the -duplicates (-a) option, the tag returned depends on the order in the file (which is apparently different in my test above).

There are some cases where I override this behaviour to prefer a specific type of metadata, but for PDF files I would likely choose the native PDF information over XMP if it had to chose.

But perhaps it makes sense to demote XMP-pdf:Keywords specifically because it is a crippled tag (ie. doesn't properly support lists of values).  I'll think about this.

- Phil

Edit:  I just checked, and I had already lowered the priority of XMP-pdf:Keywords, so I can't understand why it seems to take priority in your output.  Could you send me the sample PDF so I can see what is happening?  (philharvey66 at gmail.com)
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

robbie.morrison

Hello Phil. PDF (2017-fsfe-position-feedback.pdf) sent by email as requested. Robbie

robbie.morrison

Hello again Phil

Why don't you instead maintain the XMP-pdf::Keywords value as stringified list, appending a comma-separated string on each new keyword addition. Assuming that the use of a comma as a separator is standardized, that is. This would then keep the reporting from PDF::Keywords and XMP-pdf::Keywords consistent and it would not matter which one was recovered. The code for keyword deletions would need to reflect this logic though. Just a thought.

best wishes, Robbie

Phil Harvey

Hi Robbie,

I got the file, thanks.  I see the reason:  PDF:Keywords also has a lowered priority to patch to an apparent bug in Adobe Acrobat which can result in duplicate Info dictionaries.  I will further lower the XMP-pdf:Keywords priority to get below this.

Note that Adobe Bridge uses XMP:Subject now instead of the Keywords tags in PDF files.

Quote from: robbie.morrison on February 13, 2017, 02:42:19 AM
Why don't you instead maintain the XMP-pdf::Keywords value as stringified list, appending a comma-separated string on each new keyword addition.

It would be nice if there was some standardization for this, but I can't just make it up myself.  If I use a comma then I still need some standardized way of escaping a comma inside a keyword.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

robbie.morrison

Hello Phil, many thanks for addressing the issue, also for your comment on comma-separated lists, Robbie

robbie.morrison

Hello again Phil

I have a related but not identical issue.  My LaTeX base file contains the following commands:

\usepackage[backref=page]{hyperref}
\hypersetup{pdfkeywords={git date: \gitCommitterIsoDate{} | git hash: \gitHash{}}}

I know this is an abuse of the concept of a keyword.  Nonetheless pdfinfo and evince report:


  Keywords:       git date: 2017-02-13 21:44:45 +0100 | git hash: 01a774f00fc2e7b98326cee1555955f2f0d02265


While exiftool (without any options) reports:


Keywords                        : git, date:, 2017-02-13, 21:44:45, +0100, |, git, hash:, 01a774f00fc2e7b98326cee1555955f2f0d02265


Note the commas.  If this issue is of interest, I can create a minimal working example and email you the TeX and PDF files.

with best wishes, Robbie

Phil Harvey

Hi Robbie,

Augh.  This is a problem of list separators again.  I had forgotten about this, but the separators are not consistent as used by other software in the PDF:Keywords, so ExifTool will split on white space if there are no commas in the keyword string.  You can disable this with -api NoPDFList

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

robbie.morrison

Hi Phil, option -api NoPDFList does indeed work. That is all I need. Robbie