delete a tag that is available multiple times

Started by danisowa, January 15, 2013, 08:18:52 AM

Previous topic - Next topic

danisowa

Is there a solution to remove a Tag from a PDF file that is available more than once?

i have a pdf documente with the tags:
Author
Author (1)
Author (2)
Author (3)
Author (4)
Author (5)

I want to remove them and set ony the Author field to my new value.

I have tried several things but i was unable to get success :-(


Phil Harvey

Where are these tags stored (extract with -a -G1)?

Without seeing the file I can only suggest things to try, but it is possible that this may take 2 commands:

exiftool -author= FILE

exiftool -author="some author" FILE

But if you want to email me the file (philharvey66 at gmail.com), I may be able to help more.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

danisowa


danisowa

i've extracted on the commandline

all Author values are stored in [PDF]

the command linke commands

exiftool -author= FILE

exiftool -author="some author" FILE


have no effect to the file :-/

Phil Harvey

Interesting.  Could you send me the file?

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Phil Harvey

Thanks for the sample.

Here is what I get:

> exiftool ~/Desktop/test.pdf -author -a -G1
[PDF]           Author                          : all,,DCFR,,Vx
[PDF]           Author                          : all,,DCFR,,Vx
[PDF]           Author                          : all,,DCFR,,Vx

> exiftool ~/Desktop/test.pdf -author=me
    1 image files updated

> exiftool ~/Desktop/test.pdf -author -a -G1
[PDF]           Author                          : me
[XMP-pdf]       Author                          : me
[PDF]           Author                          : all,,DCFR,,Vx
[PDF]           Author                          : all,,DCFR,,Vx


So ExifTool reports 3 Author tags, but only changes one of them.

Looking more closely at the PDF (using the ExifTool -v option) I can see that it has been modified twice, apparently using Hewlett Packard MFP software.  I do not believe that it was updated correctly because the Info dictionary is duplicated each time instead of being replaced.  This results in 3 copies of the Author tag, but ExifTool will edit only the first one.

I don't think that multiple Info dictionaries are allowed by the PDF specification, so I can't fault ExifTool's behaviour here.

I tried rewriting the PDF using Adobe Bridge, and it fixed the duplicate Info dictionary problem.  After this, the writing/deleting the Author tag with ExifTool behaves as one would expect:

  # delete Author
  $exifTool->SetNewValue('Author');
  $exifTool->WriteInfo('test.pdf');


- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

danisowa

Hi Phil,

thanks for your answer.

I have tried to reproduce the "wrong" pdf.

i was able to reproduce by doing the following:
create new pdf set author to test
save document
open document with acrobat pro
change author to test1
save pdf

then i have two entries for the author and exiftool only changes the first one.

for me that means acrobat pro will produce a pdf thats not in the pdf standard right?

i was able to remove the doublicate entries by saving the pdf in reduced size (with acrobat pro)

Phil Harvey

So Acrobat Pro behaves differently than Bridge.  Odd.

But if Acrobat writes it like this, it must be OK.  (Adobe defines the standard.)

So ExifTool must be wrong by displaying information from the other Info dictionaries.

Could you send me a copy of the PDF after you set Author to "test" using Acrobat Pro?  (I don't have Acrobat Pro myself.)

Thanks.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Phil Harvey

#8
Thanks for the sample.

I re-read the PDF 1.7 specification, and all I can say is that that Adobe sucks.  It is clear from the specification that any modified object in an incremental PDF update should have the same object and generation number as before:

Page 63:
Together, the combination of an object number and a generation number uniquely identifies an indirect object. The object retains the same object number and generation number throughout its existence, even if its value is modified.

Page 99:
Because updates are appended to PDF files, a file can have several copies of an object with the same object identifier (object number and generation number). This can occur, for example, if a text annotation (see Section 8.4, "Annotations") is changed several times and the file is saved between changes. Because the text annotation object is not deleted, it retains the same object number and generation number as before.

Also, this is how it is done in the examples in appendix G.6 (page 1075) when modifying text annotations.

But for some reason Acrobat Pro is creating a new Info object instead of replacing the old one (in this case, the new Info object/generation number is 21/0, and the old Info object is 4/0).  Grrrr...  It really seems to me as if they are ignoring their own specification here. :(

The effect is that the old Info object remains visible since it still exists in the cross reference table as a valid entry.

I am convinced that Acrobat Pro is updating the PDF Info dictionary incorrectly, and will submit a bug report.  For this, I need to know the version of Acrobat Pro that you are using, and what system you are running.

The bottom line is that I don't want to patch ExifTool to read only the most recent Info dictionary.  But what I will do is change the priority of the tags in ExifTool so that tags in this Info dictionary take precedence.  This will at least display only the most recently written value of a tag when the -a option is not used.

- Phil

Edit: Thanks, I got the Acrobat Pro version via email, and have submitted the bug report to Adobe.
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).