exif returns information where no information should be available

Started by danisowa, January 22, 2013, 02:44:48 AM

Previous topic - Next topic

danisowa

I have another problem with reading pdf properties:

What I have done in the example is:

Created a new PDF with keywords "exiftool,,library,,perl,,test"
Saved as "this is a test PDF with no Keywords.pdf"

Modified the file using Adobe Acrobat and removed the Keywords
Saved the file as "this is a test PDF with no Keywords modified.pdf"


The exif result is really strange!

exiftool this\ is\ a\ test\ PDF\ with\ no\ Keywords.pdf | grep Keywords
File Name                       : this is a test PDF with no Keywords.pdf
Keywords                        : exiftool,,library,,perl,,test

exiftool this\ is\ a\ test\ PDF\ with\ no\ Keywords\ modified.pdf | grep Keywords
File Name                       : this is a test PDF with no Keywords modified.pdf
Keywords                        : exiftool, library, perl, test

@Phil: i will send you the both PDF Files by mail

Phil Harvey

This is expected because Acrobat doesn't remove the old Info dictionary when updating a PDF.

Adobe haven't responded yet to my bug report.

From the screen dump, it looks as if Acrobat is displaying the XMP-pdf:Keywords

> exiftool ~/Desktop/ -ext pdf -keywords -G1 -a
======== /Users/phil/Desktop/this is a test PDF with no Keywords modified.pdf
[PDF]           Keywords                        : exiftool, library, perl, test
======== /Users/phil/Desktop/this is a test PDF with no Keywords.pdf
[XMP-pdf]       Keywords                        : exiftool,,library,,perl,,test
[PDF]           Keywords                        : exiftool, library, perl, test
    1 directories scanned
    2 image files read


- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

danisowa

any news about the adobe bug?

is there a link where i can "monitor" the adobe bug?

Thanks a lot!

Phil Harvey

No word back yet on the bug report.  And unfortunately I don't see anywhere on their system where I can view the status of the bug report.  I'm guessing that they just filed this one under "ignore". :(

I have a number of contacts at Adobe, but unfortunately nobody in the PDF group.  Without a good contact, sometimes it is difficult to get action on very technical issues like this.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Phil Harvey

OK, since it appears that Adobe isn't going to fix this at their end, I will patch ExifTool 9.17 to check for a duplicate Info dictionary, and issue the following warning:

"[Minor] Ignored duplicate Info dictionary"

and the metadata in the older Info dictionary will be ignored.  The -m option may be used to to ignore this warning and extract the duplicate information (same as the old behaviour).

- Phil

P.S.  I will probably be merging this thread with your other thread
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

stan

I came across some PDFs I wanted to pass on to someone but wanted to remove all tags from first. As usual I ran exiftool -all= -overwrite_original <name_of_pdf> on copies of the files. Unfortunately the info was not removed.

When I ran exiftool <name_of_pdf> I saw the "[Minor] Ignored duplicate Info dictionary" warning. Searched the net for it, found this thread, read it. Adding -m to the command line allowed ExifTool to display the various tags such as Author, Creator, Producer, Create Date, Modify Date etc. So far so good. :)

Problem is -m doesn't seem to affect the first command (for tag removal) in any way! -a did not help either. So how do I remove this pesky duplicate Info dictionary?

Phil Harvey

You should note that ExifTool never actually removes anything from a PDF (see the PDF Tags documentation), so the most you can expect is that ExifTool would hide the info directory through the incremental update.  However, this doesn't appear to work when there is a duplicate.  I haven't looked into this, so I can't say exactly why.

Do you have the ability to linearize a PDF?  This may actually remove the information after ExifTool "deletes" (actually hides) it via an incremental update.  If so, I would try this:

1) delete all PDF info with exiftool
2) linearize the PDF
3) delete all PDF info again with exiftool
4) linearize again

There is a small chance that this may do what you want.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

stan

Quote from: Phil Harvey on September 13, 2014, 07:23:26 AM
You should note that ExifTool never actually removes anything from a PDF (see the PDF Tags documentation), so the most you can expect is that ExifTool would hide the info directory through the incremental update. 

"All metadata edits are reversible. While this would normally be considered an advantage, it is a potential security problem because old information is never actually deleted from the file." Ack! Here I thought the tags were actually being deleted, just like with images. Ok, this is a real problem for me now because it means I've sent PDFs out without actually removing info that shouldn't be shared. I took ExifTool's PDF behavior for granted I suppose based on how well it worked for images, but I guess I should've done some research first. :(

Quote from: Phil Harvey on September 13, 2014, 07:23:26 AM
Do you have the ability to linearize a PDF?  This may actually remove the information after ExifTool "deletes" (actually hides) it via an incremental update.  If so, I would try this:

1) delete all PDF info with exiftool
2) linearize the PDF
3) delete all PDF info again with exiftool
4) linearize again

There is a small chance that this may do what you want.

I don't know what "linearize a PDF" means and will have to look it up, as also how to accomplish it. But even after that if there's a small chance it might actually succeed I have to wonder, why is PDF metadata so resistant to editing? Do I have to switch to Adobe's bloated tools in order to actually have any chance of getting rid of all Info dictionaries in a PDF? >:(

Edit: Ok, so I ran qpdf <in.pdf> <out.pdf> and then checked with exiftool and it reported no tags. Ran exiftool -PDF-update:all= <name_of_pdf> and got the following output:

QuoteError: File contains no previous ExifTool update - <name_of_pdf>
    0 image files updated
    1 files weren't updated due to errors

My PDF viewer does not display any tags now. So is that it finally or can metadata still be lurking somewhere inside the file? How can I be sure?

Phil Harvey

With PDF files, it is difficult to be sure.  They are very (VERY!) complex, and this is the reason that ExifTool uses the simpler incremental update technique for writing.

After running qpdf, it is likely that the main info metadata you deleted is gone for good.  However, there are so many nooks and crannies where metadata can hide in a pdf (eg. each embedded object may have its own metadata, and ExifTool doesn't even extract this), that I would never trust PDF 100% if you are concerned about sharing unwanted metadata.

- Phil

Edit: I was thinking about how to make this behaviour more obvious to the user, and will add this warning when someone tries to delete all metadata from a PDF:

Warning: [minor] ExifTool PDF edits are reversible. Deleted tags may be recovered
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

stan

Thanks for the information about PDFs. And yes, I think displaying Warning: [Minor] ExifTool PDF edits are reversible. Deleted tags may be recovered! would be a great idea.