Metadata recoverable after deletion

Started by penguin, November 17, 2012, 08:04:47 PM

Previous topic - Next topic

penguin

Today I discovered a serious bug when it comes to removing meta-data in pdf-files e.g. for privacy reasons.

After using exiftool -all= [file.pdf] (using Ubuntu 12.04) on several pdf files, I tested whether or not the meta-data e.g. like author are still readable in several pdf readers.

The results are:

Neither exiftool, Adope Reader, Windows Explorer, Evince, Okular, Foxit Reader nor Foca did show any Information (apart from the creation date which was x.x.1970)

But the pdf-xchange viewer shows all of the original data.

I tried this with several pdf files on Windows 7 and on Windows XP.

Both Operating Systems never touched the original files and only saw the pdf files after using exiftool for cleaning.

After Installing the pdf-exchange Shell extension, the Windows explorer shows the meta too.
Other readers still show no meta data other then the ones exiftool edited (in my case nothing)

For everyone who uses exiftool to clean their documents for privacy reasons, this is a serious bug.

edit: My version of exiftool is 8.60

edit2: exiftool 9.60 (windows-version) also does not show the "hidden" meta data. Tested on Windows 7


More Information abaut pdf files I tested this on:

PDF Version:
1.3
1.4
1.5
1.6

Programms used to originally create the pdf file:

Adope Destiller (Version 6, 8, and 9.0.0)
cairo 1.9.5
pdfsam-console (Version 1.1.2e)
bullzip pdf printer
pdfTeX-1.40.3
Microsoft Office (Version 2003,2007,2010, and 2013 Preview, also Mac OS version)
OpenOffice (Version 3.1)
LibreOffice (Version 3.5)


edit3:

When using notepad++ to manually edit the xml tag that contains the metadata, pdf-xchange reader also does not show the deleted informations. Therefor I assume that the metadata are not properly being deleted by exiftool but are still readable apart from standart specification.




Phil Harvey

ExifTool can not be used to irreversibly wipe metadata from PDF images.

This is explained in the PDF Tags documentation.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

penguin

Quote from: Phil Harvey on November 18, 2012, 10:33:31 AM
ExifTool can not be used to irreversibly wipe metadata from PDF images.

This is explained in the PDF Tags documentation.

- Phil

Is there no possibility to permanently delete those tags with exiftool?


I would like to recommend a notification about this, since there are many users out there who rely on exiftool to securely delete those tags.

Phil Harvey

A notification?  Could you be more specific?

I think you can probably delete them permanently by running something like the pdf distiller to linearize the PDF after editing with ExifTool.  There isn't much chance that I will add a permanent delete feature to ExifTool because the PDF structure is very complicated and doing this would be a lot of work.  Bascially, the only reason I was able to add a write feature for PDF at all is because I was able to do an incremental update (which avoids the problem of having to rewrite the entire file).

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

penguin

Quote from: Phil Harvey on November 18, 2012, 11:43:33 AM
A notification?  Could you be more specific?
I think a simple warning in the manpage/manual (about some formats not being securely deleted) should do.
And maybe a warning output when deleting pdf metadata. (And all other formats this might apply to)


Quote from: Phil Harvey on November 18, 2012, 11:43:33 AM
I think you can probably delete them permanently by running something like the pdf distiller to linearize the PDF after editing with ExifTool.  There isn't much chance that I will add a permanent delete feature to ExifTool because the PDF structure is very complicated and doing this would be a lot of work.  Bascially, the only reason I was able to add a write feature for PDF at all is because I was able to do an incremental update (which avoids the problem of having to rewrite the entire file).
- Phil

Well, there are pdf printers that simply generate a new pdf file without the old metadata. How about integrating parts one of those open source printer projekts into exiftool? At least there should be the option to rewrite the pdf file, even though it takes more time.

out of curiosity: are all metadata newly generated when the pdf is generated? Like with the pdf printers?




penguin

Quote from: Phil Harvey on November 18, 2012, 11:43:33 AM
Bascially, the only reason I was able to add a write feature for PDF at all is because I was able to do an incremental update (which avoids the problem of having to rewrite the entire file).

Is this incremental update within the pdf specifications?

Because when I try to edit pdf metadata of files (that were "cleaned" by exiftool) with other tools I get an error message about not readable contents (regarding meta data only)

Phil Harvey

Quote from: penguin on November 18, 2012, 11:54:54 AM
I think a simple warning in the manpage/manual (about some formats not being securely deleted) should do.

The question is where.  I'll think about this.  It already exists in the PDF documentation (the Image::ExifTool::TagNames man page).

QuoteAnd maybe a warning output when deleting pdf metadata. (And all other formats this might apply to)

You mean a runtime warning?  I'd really have to issue a warning then whenever anything is written to any PDF (since the old information is never overwritten).  I think this would be annoying for most users.  This problem doesn't apply to any other file formats.

QuoteWell, there are pdf printers that simply generate a new pdf file without the old metadata. How about integrating parts one of those open source printer projekts into exiftool? At least there should be the option to rewrite the pdf file, even though it takes more time.

This is not an option for ExifTool because I try hard to keep it independent of other packages to keep it portable and easy to install (no compilation necessary).  If you want to use these other libraries you are free to do this outside exiftool.

Quoteout of curiosity: are all metadata newly generated when the pdf is generated? Like with the pdf printers?

By other tools?  I have no idea.

Quote from: penguin on November 18, 2012, 12:07:08 PM
Is this incremental update within the pdf specifications?

Yes, absolutely.

QuoteBecause when I try to edit pdf metadata of files (that were "cleaned" by exiftool) with other tools I get an error message about not readable contents (regarding meta data only)

Which tools?  I have tested this extensively with the Adobe tools and haven't found any compatibility issues.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

penguin

Quote from: Phil Harvey on November 18, 2012, 12:12:04 PM
QuoteBecause when I try to edit pdf metadata of files (that were "cleaned" by exiftool) with other tools I get an error message about not readable contents (regarding meta data only)

Which tools?  I have tested this extensively with the Adobe tools and haven't found any compatibility issues.

When editing the meta data (respectively opening the file) with BeCyPDFMetaEdit I get:

(translated from german)

Operation cannot be completed because of an error. (path to file)

cross-reference table is broken

Illegal/unexpected symbol  while parsing

Phil Harvey

I suggest that you report this problem to BeCyPDFMetaEdit and send them a sample to see what they say.  I'm confident that ExifTool is writing the PDF correctly.

- Phil

P.S. I added a note to the ExifTool home page pointing out the reversibility of PDF edits, and will add a note to the application documentation in the next release as well.
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).