Metadata Deletion on PDF Files

Started by woop, March 29, 2019, 06:48:36 PM

Previous topic - Next topic

woop

Hey everyone,

I recently got quite interested in cleaning metadata in depth for PDF files. I have been playing around with metadata myself quite a bit and been using exiftool to delete the metadata. In the course of my research, I've been stuck on a couple of questions, which I can't seem to find an answer to. Thus, I was wondering if you can point me in the right direction:

- How does exiftool's way of deleting metadata work? Does it simply unreference certain data fields, meaning the information stays in the file but simply isn't referenced and as such harder to find? Is there a manual way to check unreferenced but existing data fields (I imagine a binary analysis might be needed, no?)? Because exiftool seems to be able to easily reconstruct those fields using exiftool -pdf-update:all= <file>, is that correct? Ideally I want to be able to reconstruct them myself using Python - do you have any experience with this?
- In the following post https://dustri.org/b/cleaning-pdf-metadata-in-depth.html the author talked about the approach to deleting metadata that MAT takes. Why does re-rendering of the PDF file remove all the metadata from images? Does this happen by default, when a pdf is re-rendered, or is this simply something that is passed as a parameter to the re-render function in order to remove the metadata? Do you happen to know?

I'm looking forward to hearing from you!

Julius


StarGeek

Quote from: woop on March 29, 2019, 06:48:36 PM
- How does exiftool's way of deleting metadata work? Does it simply unreference certain data fields, meaning the information stays in the file but simply isn't referenced and as such harder to find?  Is there a manual way to check unreferenced but existing data fields (I imagine a binary analysis might be needed, no?)? Because exiftool seems to be able to easily reconstruct those fields using exiftool -pdf-update:all= <file>, is that correct?

See the PDF Tag page for details. 

Unfortunately, I have nothing to help you with regards to the rest of your post.
"It didn't work" isn't helpful. What was the exact command used and the output.
Read FAQ #3 and use that cmd
Please use the Code button for exiftool output

Please include your OS/Exiftool version/filetype

woop

Thank you, StarGeek! I did read through the PDF documentation of exiftool, but it does not completely answer my question.

From what I can gather exiftool uses the incremental update technique, which allows exiftool not to have to rewrite the whole PDF file (which would be more difficult to code and slower), but simply adds an appendix to the end of the file.

The following link (https://resources.infosecinstitute.com/pdf-file-format-basic-structure/) seems to suggest that every incremental update adds the following components to the file:

  • Body update
  • Cross-reference section
  • Updated trailer



All that the incremental update method does is add more information instead of deleting old one, is that correct? Thus, by looking at the cross-referenced sections, which are marked as "deleted", you can recover the old metadata, is that correct?

What happens then when you call exiftool -pdf-update:file:all= <file>? Does it simply delete the last incremental update that was added to delete the metadata?

Do you happen to have any information on how, under the hood, linearlization removes this metadata?

Phil Harvey

Quote from: woop on March 30, 2019, 04:48:17 AM
All that the incremental update method does is add more information instead of deleting old one, is that correct? Thus, by looking at the cross-referenced sections, which are marked as "deleted", you can recover the old metadata, is that correct?

Yes.

QuoteWhat happens then when you call exiftool -pdf-update:file:all= <file>? Does it simply delete the last incremental update that was added to delete the metadata?

It deletes everything that ExifTool has added to the file.

QuoteDo you happen to have any information on how, under the hood, linearlization removes this metadata?

I don't. But effectively it just rewrites the file, discarding the unused sections.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).