Preserving spaces and newlines in metadata

Started by robonabike, August 13, 2015, 06:57:09 AM

Previous topic - Next topic

robonabike

Apologies if this has been asked before, I searched as thoroughly as I could and couldn't find a similar question.

I'm trying to use exiftool to extract the metadata from PDFs. There a "keywords" field whose value is a string of keywords and their values of the form:

KEYWORD1=value1\r\nKEYWORD2=value2\r\nKEYWORD3=value3 contains spaces\r\nKEYWORD4=value4 

...and so forth.  I need to get this with both the spaces in the values and the newlines preserved.  If I use -b, I get every space replaced with a newline, and if I don't, I get both newlines and spaces replaced with the active separator character.  I really need to know which is which.

Am I overlooking something obvious here?  Can anyone point me in the right direction?

Thanks,
Rob


Phil Harvey

Hi Rob,

-b doesn't do any character substitution.  However, for list-type tags with multiple values, a newline is used as a separator.

The spaces should absolutely not be replaced by newlines, so I'm not sure what you are doing here.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

robonabike

Thank you so much for your swift response, Phil.

This is kind of odd, then.  Exiftool version is 9.9.7.0

Looking directly at the "keywords" text within the pdf in an editor I find:

Keywords(BBR.BRANCH=0\r\nBCH.CLAIM.NO=\r\nBCL.CAUSE.OF.LOSS=\r\nBCL.CLAIM.NUMBER=\r\nBCL.CLAIM.TYPE=\r\nBCL.CLIENT.REF=\r\nBCL.DATE.OF.LOSS=\r\nBCL.DESCRIPTION=\r\nBCM.NAME=\r\nBCM.REFNO=\r\nBDI.NOTE=\r\nBDI.S.ENTRY.DATE=11/07/2012\r\nBLT.DOC.TYPE=\r\nBLT.REMARKS=\r\nBMM.DOC.TYPE=\r\nBPY.AGENT=\r\nBPY.ANPREM=\r\nBPY.EXEC=\r\nBPY.EXEC2=\r\nBPY.INSCO=\r\nBPY.NOTES1=\r\nBPY.POL.REF=\r\nBPY.POLNO=\r\nBPY.RDAT=\r\nBPY.REFNO=\r\nBPY.SHORT.POLREF=\r\nDocumentName=APM GAP IN COVER.pdf\r\nDP0.NAME=Ian Tester                    \r\nDP0.REFNO=TEIA001\r\nDPP.AGENT=SBRI\r\nDPP.BRANCH=0\r\nDPP.EXEC=252 \r\nDPP.INSURER=KennCo Underwriting Ltd\r\nDPP.NOTES=Reg:            Make: MITSUBISHI    Model: MIRAGE                       \r\nDPP.POLNO=\r\nDPP.POLREF=TEIA001PC1\r\nDPP.SHORT.POLREF=PC1\r\nP.PY.PROSPECT.REF=\r\nP.PY.SHORT.POLREF=\r\n)

exiftool.exe -keywords "C:\testarea\ImportServices\ODC\APM_00_TEIA001_APM GAP IN COVER.pdf" -b

gives me:

BBR.BRANCH=0
BCH.CLAIM.NO=
BCL.CAUSE.OF.LOSS=
BCL.CLAIM.NUMBER=
BCL.CLAIM.TYPE=
BCL.CLIENT.REF=
BCL.DATE.OF.LOSS=
BCL.DESCRIPTION=
BCM.NAME=
BCM.REFNO=
BDI.NOTE=
BDI.S.ENTRY.DATE=11/07/2012
BLT.DOC.TYPE=
BLT.REMARKS=
BMM.DOC.TYPE=
BPY.AGENT=
BPY.ANPREM=
BPY.EXEC=
BPY.EXEC2=
BPY.INSCO=
BPY.NOTES1=
BPY.POL.REF=
BPY.POLNO=
BPY.RDAT=
BPY.REFNO=
BPY.SHORT.POLREF=
DocumentName=APM
GAP
IN
COVER.pdf
DP0.NAME=Ian
Tester
DP0.REFNO=TEIA001
DPP.AGENT=SBRI
DPP.BRANCH=0
DPP.EXEC=252
DPP.INSURER=KennCo
Underwriting
Ltd
DPP.NOTES=Reg:
Make:
MITSUBISHI
Model:
MIRAGE
DPP.POLNO=
DPP.POLREF=TEIA001PC1
DPP.SHORT.POLREF=PC1
P.PY.PROSPECT.REF=
P.PY.SHORT.POLREF=


Each line in the output above is separated by a LF character, whether there's "\r\n" or a whitespace section in the original text.

Without -b, thus:

exiftool.exe -keywords "C:\testarea\ImportServices\ODC\APM_00_TEIA001_APM GAP IN COVER.pdf"

I get:

Keywords                        : BBR.BRANCH=0, BCH.CLAIM.NO=, BCL.CAUSE.OF.LOSS=, BCL.CLAIM.NUMBER=, BCL.CLAIM.TYPE=, BCL.CLIENT.REF=, BCL.DATE.OF.LOSS=, BCL.DESCRIPTION=, BCM.NAME=, BCM.REFNO=, BDI.NOTE=, BDI.S.ENTRY.DATE=11/07/2012, BLT.DOC.TYPE=, BLT.REMARKS=, BMM.DOC.TYPE=, BPY.AGENT=, BPY.ANPREM=, BPY.EXEC=, BPY.EXEC2=, BPY.INSCO=, BPY.NOTES1=, BPY.POL.REF=, BPY.POLNO=, BPY.RDAT=, BPY.REFNO=, BPY.SHORT.POLREF=, DocumentName=APM, GAP, IN, COVER.pdf, DP0.NAME=Ian, Tester, DP0.REFNO=TEIA001, DPP.AGENT=SBRI, DPP.BRANCH=0, DPP.EXEC=252, DPP.INSURER=KennCo, Underwriting, Ltd, DPP.NOTES=Reg:, Make:, MITSUBISHI, Model:, MIRAGE, DPP.POLNO=, DPP.POLREF=TEIA001PC1, DPP.SHORT.POLREF=PC1, P.PY.PROSPECT.REF=, P.PY.SHORT.POLREF=

And there, I get a comma, wherever there's "\r\n" or a whitespace section in the original text.  If I explicitly specify a different separator character, then I get that character in the place of either"\r\n" or a whitespace section.

I'm sure you can see that what I'm trying to do is pull out each field from the keywords separately.

It's all very curious.

Best,
Rob

Phil Harvey

Hi Rob,

Ah.  Thanks for posting the original PDF metadata.

You are right.  ExifTool is breaking these keywords on any white space.

The PDF specification is incomplete when it comes to specifying how keywords are separated.  It seems that most applications use spaces as separators, so this is the fallback position for ExifTool.  Other applications use commas, so ExifTool will split on commas if they exist in the keywords string.  I haven't yet seen the newline being used as a separator, but thought about checking for newlines and splitting on them if they exist.  The problem is that I have a sample here that has the keywords stored like this:

/Keywords (book Specification PostScript syntax graphics fonts image \rpatterns col\
or)


This is obviously space delimited, but contains a CR too. :(

It currently extracts as

Keywords                        : book, Specification, PostScript, syntax, graphics, fonts, image, patterns, color

which is probably what one wants.

I'm not sure how to best handle this, unless it is to add a new API option to allow the user to specify the PDF list separator.  But this isn't very appealing because the user would then have the same trouble in determining the proper separator.  Also, inserting newlines on the command line is tricky.

So I don't see a clear course of action to take here.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Phil Harvey

I have an idea.  I could change ExifTool to avoid splitting PDF List-type tags when the -b option is used.  I don't think this should break things for anyone else, and I think it would do what you are looking for.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

robonabike

First of all, thanks for getting back to me again so quickly and for confirming that I'm not going insane, at least not in this respect.

Your -b idea would, I think, be perfect.  Anything that gives me the tag completely unchanged would be fine - that's what the Adobe interface seems to give me at present and I can work with that, but for various reasons, it's not practical to keep on using that anyway.

Let me know if there's anything further I can supply in the way of test data - I'm out of the loop in Norway for about ten days, but I'll check replies on my return.

Best,
Rob

Phil Harvey

Hi Rob,

I implemented this for ExifTool 10.00, but I forgot to update something in the Windows version so you need to specify -api NoPDFList as well as -b for the Windows version of 10.00.  I'll fix this in 10.01 so -b alone will be sufficient.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

robonabike

Many thanks, Phil.
I'll give this a try as soon as I get the chance.

robonabike

This seems to work perfectly.  Thank you again.
--
Rob