Checksums of metadata in media files.

Started by ScannerBoy, March 07, 2023, 08:45:06 PM

Previous topic - Next topic

ScannerBoy

As part of my efforts to update metadata in media files for my genealogical app (Gramps), I have come to realize that changing any metadata within any media file will change the MD5 checksum of the media file.
(At present, I am still investigating just what consequences this change will have for Gramps, but I do know, that the MD5 checksum is saved in the app's database and used in some ways to identify the specific media file.)

This is of course not unexpected, but what I am trying to figure out, whether there might be ways to calculate checksums for different segments of media files so that one might be able to take the unavoidable changes into account.
Naturally, this will require different approaches for different file types.

In working with metadata for JPG files in the past, I do seem to recall finding several tags which seemed to imply that they represent some sort of checksum or other change indicator.

At the time, I did not really pay enough attention to specific fields nor their possibilities and, right now, am rather hazy even on which data fields these were. In addition there are likely others fields which might be useful for this goal.

Hence I thought it helpful to raise this as a rather general question here in the hope of getting some advice and find any possible options.

Phil Harvey

#1
For JPEG files this is the technique that I have been recommending to get a metadata-independent checksum:

exiftool -all= -o - FILE | md5

This works for JPEG files, but in general doesn't work for most other types.

- Phil

Edit: As of ExifTool 12.58 there is a new ImageDataMD5 tag which returns an MD5 digest of the image data only.
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

ScannerBoy

Thank you.
That is a good start, as most of the files that are of interest are JPGs, though not all.

I suppose there is the option to convert other formats/files of interest to JPG and transfer the metadata  ;D

StarGeek

In theory, using a perceptual hash (pHash) would be a good solution.  Though extremely minute differences might return the same pHash.  Unfortunately, I never found a good solution.  There are pHash libraries, but I never found a command line program that you could use the same ways as md5.

It's been a while since I've checked, though.
"It didn't work" isn't helpful. What was the exact command used and the output.
Read FAQ #3 and use that cmd
Please use the Code button for exiftool output

Please include your OS/Exiftool version/filetype

ScannerBoy

Never had come across that hash; thank you.
As the job of adding/modifying the metadata would mean an external app in any case, libraries implementing some of the hash functionality would be quite acceptable. Since Gramps is written in Python and I have tried to work in Python for my editing app, Python libraries would, of course, be preferred  ;)

Looks like this will take a fair bit more thought and work  :-\
For my personal work, the main file formats would be JPGs, PNGs, TIFFs and PDFs.

My current thoughts are that for genealogy work, using specially prepared/preprocessed media files of 'reasonable size' would be acceptable.
To get files with 'reasonable size', I was thinking that the media files included as part of the package would mean that some original files, such as TIFFs and some of the more  recent (and larger) image file formats, would be preprocessed in any case to reduce the file size.
With this in mind then, it ought to be possible to convert to a common file format (likely JPG) and at the same time the 'standardize' the  metadata, both with respect to locating the metadata within the media file, as well as with a minimum set of tags.

If one could add some sort of 'history', that would be a big bonus.
No doubt, there is a need to keep at least a record and reference to some of the original and potentially larger files, even if they are/might be archived separately.

StarGeek

Quote from: ScannerBoy on March 08, 2023, 01:15:57 PMPython libraries would, of course, be preferred  ;)

When I last looked, I did see some Python libraries for phashes.  But I don't know Python and couldn't figure out a simple way to create a script to just output a phash.

And I'm not sure how much files such as RAW or 16 bit tiffs are handled.  As an experiment, somewhere here I created a config file to use a Perl phash library, but it isn't easy to set up and it dies completely and badly  with greater than 8 bit color file types.
"It didn't work" isn't helpful. What was the exact command used and the output.
Read FAQ #3 and use that cmd
Please use the Code button for exiftool output

Please include your OS/Exiftool version/filetype

ScannerBoy

After looking at some of the description of the phash, I don't think it is what I am looking for.
In changing the metadata, I am (hopefully) not affecting the image data at all. Any changes to the image would be unintentional.

In fact, I am hoping that only the metadata would be modified. The solution Phil mentioned would do that, just pulling out the image, without any metadata at all.
Presumably, the same idea ought to work for the image types of interest to me, though I have not explored this avenue as yet.
Of course, this would require some 'adjustments' to Gramps and the media verification utility as well.

StarGeek

Quote from: ScannerBoy on March 09, 2023, 12:38:39 PMAfter looking at some of the description of the phash, I don't think it is what I am looking for.
In changing the metadata, I am (hopefully) not affecting the image data at all.

Changing the metadata will not change the image, unless you are editing color based tags such as the ICC_Profile.  See FAQ #13.

But creating a phash doesn't edit the image data either.  Two identical images will have the same phash regardless of metadata.  And two similar files will have phashes that will be only slightly different, unlike an MD5 which will be completely different for each file.

This is how image duplicate checkers try to find duplicates.
"It didn't work" isn't helpful. What was the exact command used and the output.
Read FAQ #3 and use that cmd
Please use the Code button for exiftool output

Please include your OS/Exiftool version/filetype

ScannerBoy

That is my understanding as well.
However, how this is handled within Gramps is not really under my control. I am trying to work with Gramps, which stores the MD5 of the image when and as it is submitted the first time.

The media verify utility then looks for images which have have changed or have been deleted/lost or moved.

It defines a 'missing' file as one whose MD5 checksum does not match, even if the file name & path remain the same.
While the utility is 'sort of' under my control because the code is public domain & I can modify it to my hearts content, adding support for the main app is another matter, even though its code is also available.

Handling changes in metadata thus becomes a whole lot more complicated because it also implies a fair bit more work in settings thing right within the main database. Still, I believe it ought to be possible from within an updated/modified verify utility.

My reason for asking about this here is/was because the members of this forum are much more knowledgeable on metadata than I am and also because I am using Exiftool to modify the metadata as part of my utility.  :)