file modifications when adding removing metadata...

Started by karasako, November 05, 2020, 09:30:08 AM

Previous topic - Next topic

karasako

Hi all,

I want to be able to deterministically add some metadata and then remove those metadata and get the original hash back.


I created test.config that allows for adding a custom tag.

$ exiftool -config test.config  -XMP-xxx:MyCustomText="1" test1.pdf

Then I do:

$ exiftool -G0:1 -a -s  test1.pdf

and I can see the tag just fine.

$ sha256sum test1.pdf
b74adde186b5ddfa0866e7ceb14251e892e44d74d42544b01451f0826658b812  test1.pdf


If I rerun with the same value:
$ exiftool -config test.config  -XMP-xxx:MyCustomText="1" test1.pdf

The hash has changed:
$ sha256sum test1.pdf
2d41068c07698ef02aec3a80dccf148d1ddc3617b9ce08cefde6c71daed2f345  test1.pdf


Inside the metadata I noticed that File:System's FileModifyDate, FileAccessDate and FileInodeChangeDate  have been changed after I updated the tag with the same value. However, this is information from the OS not the file itself, right? 

It seems so, because when I change the  File:System's FileModifyDate, FileAccessDate and FileInodeChangeDate   with a $ touch test1.pdf ... they do change but the hash remains the same.

So it seems that some other part of the file is being modified by exiftool that I don't know of.

Any ideas?

Thank you in advance,
Kostas

StarGeek

The short version is basically FAQ #13, but the fact that you are using a PDF adds another layer of complexity.

Quote from: karasako on November 05, 2020, 09:30:08 AM
I want to be able to deterministically add some metadata and then remove those metadata and get the original hash back.

You can't.  Exiftool is going to write the metadata differently than say, Adobe, who will write things differently than Foxit.  Every program is going to write things differently depending upon their code

QuoteInside the metadata I noticed that File:System's FileModifyDate, FileAccessDate and FileInodeChangeDate  have been changed after I updated the tag with the same value. However, this is information from the OS not the file itself, right?

Yes, those are file system tags and are designed to change when a file is changed.  Especially the way that exiftool edits files, which is to create a new copy of the file as it creates the edits.  Basically (if you're old enough to remember them), it's like how a library file card would tell you where the book was and some details, but any changes to the card would not be reflected in the actual book.

QuoteIt seems so, because when I change the  File:System's FileModifyDate, FileAccessDate and FileInodeChangeDate   with a $ touch test1.pdf ... they do change but the hash remains the same.

That's because these values are not actually part of the file, they're part of the file system entry that keeps track of such data, as well as permissions and indexes the actual location of the disc blocks where the file is located. 

QuoteSo it seems that some other part of the file is being modified by exiftool that I don't know of.

As I said above, then entire file is changed because no two programs are going to write metadata in the exact same way.  This is why programs must parse the file according to it's structure and not assume that a particular piece of data is at a particular offset, which is something that badly written programs will do.

This is further complicated by the fact that exiftool uses Incremental Updates (just the first link I found to describe the subject, not promoting the product) to change the metadata in PDFs.  As it says on the PDF tags page (3rd paragraph), this has the advantage of "being both fast and reversible".  But it also will increase the size and means exiftool cannot be solely used to remove sensitive/personal metadata.
* Did you read FAQ #3 and use the command listed there?
* Please use the Code button for exiftool code/output.
 
* Please include your OS, Exiftool version, and type of file you're processing (MP4, JPG, etc).

karasako

Thank you for the reply.

I did not intend to use a different program. I am assuming that all modifications are going to happen with exiftool.   

Incremental updates would seem to be an issue anyway though. Is there a way to turn this off ?

I checked the PDF structure and the only thing that changes after a modification is an ID at the end of the PDF:

< /ID [ <807174CBBC05262E0FA1EFC645DFFE8F> <9b7174CBBC05262E0FA1EFC645DFFE8F> ]

The first byte of the second number is incremented by one every time I update the value. Otherwise it seems to be identical. That is probably what creates the hash difference.

StarGeek

Quote from: karasako on November 05, 2020, 02:08:33 PM
I did not intend to use a different program. I am assuming that all modifications are going to happen with exiftool.

I'm not saying use a different program, I'm saying every program will write the file differently.

QuoteIncremental updates would seem to be an issue anyway though. Is there a way to turn this off ?

No, this is how exiftool adds metadata to PDF files.
* Did you read FAQ #3 and use the command listed there?
* Please use the Code button for exiftool code/output.
 
* Please include your OS, Exiftool version, and type of file you're processing (MP4, JPG, etc).