compute image digest

Started by babar, March 21, 2013, 11:19:52 AM

Previous topic - Next topic

babar

Would it be possible for exiftool to compute an image digest (MD5 for example) and
-store it
-compare it with stored value and log differences as warnings

storage is already possible with DNG file we could use that xmp field on other image type.
-> I find it very hard today to check whether my file are corrupted or not:

corruption of metadata -> exiftool can do that in batch and log all the errors
corruption of actual image data -> no one does it. MD5 checksum tools are useless as they will report a changed sum as soon as we edit the image metadata, which may happen frequently (add a location, add a keyword, change the legend or title...) and no one stores it in the file which makes it complicated to deal with.

-> So I some time find my self facing corrupted image that have been backed up as such several times and there is no more good back up to retrieve the image

I am sure it would serve many people

It would really be a great tool

thanks
regards

Phil Harvey

#1
To get an MD5 of the image only, you can do this:

exiftool -all= -o - image.jpg | md5

- Phil

Edit: ExifTool 12.58 has a new ImageDataMD5 tag which may be used.
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

et2511299562

> To get an MD5 of the image only, you can do this:
> exiftool -all= -o - image.jpg | md5

I'm trying to compute image digests as well, but for RAW files.

Is there an equivalent command?

Thanks

Phil Harvey

#3
No, this won't work for RAW files.  I can't think of an easy way to do this.

- Phil

Edit: As of ExifTool 12.58 there is a new Extra ImageDataMD5 tag which returns an MD5 digest of the image data only for JPEG and TIFF-based files.
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

et2511299562

Would it be possible to force ExifTool to treat IFD0, ExifIFD, SubIFD, etc. as deletable groups/tags by using a custom config file in a similar manner outlined in this previous post?

Quote from: Phil Harvey on July 09, 2013, 07:20:11 AM
If I add the ability to strip out this one tag, then I need to add the ability to add it back again.  As you point out, the Canon serial number is inconsistent.  It would be wrong to allow the serial number to be added with cameras that don't write it in the first place, but preventing this would require extra logic that is specific to each model, which is a maintenance nightmare for me.

But luckily, ExifTool provides a way for you to do whatever you want, regardless of whether I want to add this ability to the production ExifTool or not. The following config file will give you the ability to delete the Canon SerialNumber:

%Image::ExifTool::UserDefined = (
    'Image::ExifTool::Canon::Main' => {
        0x0c => {
            Name => 'SerialNumber',
            Writable => 'int32u',
            Permanent => 0, # allow this tag to be deleted
        },
    },
);
1;  #end


See the config file documentation for more information about the ExifTool config file.

- Phil

This would just be for the intermediate step of calculating a unique image digest.

Thanks

Phil Harvey

Quote from: et2511299562 on April 26, 2021, 09:42:24 PM
Would it be possible to force ExifTool to treat IFD0, ExifIFD, SubIFD, etc. as deletable groups/tags

If you delete IFD0 from most raw files you get a file with zero length (ie. everything is gone).

The question is: What do you want to hash?  If you just want to make sure the raw image isn't modified, you could hash the raw data alone.  Doing this might be possible, but it wouldn't be easy.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

et2511299562

Quote from: Phil Harvey on April 27, 2021, 07:03:14 AM
The question is: What do you want to hash?  If you just want to make sure the raw image isn't modified, you could hash the raw data alone.  Doing this might be possible, but it wouldn't be easy.

Yes, I just want to hash the raw data alone.

I am trying to consolidate a large number of RAW image files (over 300,000) from various sources and backups made at different stages of editing.

I would like to find the duplicate files that have the exact same raw image data, even though the EXIF data might have been modified to add GPS coordinates, copyright information, timezone, star rating, etc.

Hashing the raw image data seems to me to be the correct way to do this, so I'd like to figure out a solution if one exists, unless you can think of a better way.

I did want to mention that over the years, I have seen that other people have made similar requests here. I think being able to deterministically calculate a unique hash based on the raw image data would be a useful feature, and would obviously be different than assigning a random unique image identifier.

StarGeek

Quote from: et2511299562 on April 27, 2021, 12:23:47 PM
I am trying to consolidate a large number of RAW image files (over 300,000) from various sources and backups made at different stages of editing.

I would like to find the duplicate files that have the exact same raw image data, even though the EXIF data might have been modified to add GPS coordinates, copyright information, timezone, star rating, etc.

Are you using a Digital Asset Manager (DAM)?  Most DAMs have duplicate image finding ability.  And since the image data on a RAW file cannot be edited, any matches should be duplicates.
* Did you read FAQ #3 and use the command listed there?
* Please use the Code button for exiftool code/output.
 
* Please include your OS, Exiftool version, and type of file you're processing (MP4, JPG, etc).

et2511299562

Quote from: StarGeek on April 27, 2021, 03:24:37 PM
Are you using a Digital Asset Manager (DAM)?  Most DAMs have duplicate image finding ability.  And since the image data on a RAW file cannot be edited, any matches should be duplicates.

I'm not currently using a digital asset manager.

Right now, finding the duplicate images is a short-term goal.

Ultimately, I would like to calculate a meaningful unique and deterministic identifier for each image so I can write these identifiers as ImageUniqueID tags, and also physically organize my files (and derivatives) using a content-addressable storage scheme based on these unique identifiers.

Regardless of the use case, I still think that being able to deterministically calculate a unique hash based on the raw image data would be a useful feature.

Thanks

priort

I was looking for a similar thing and came across this https://exiftool.org/forum/index.php?topic=6659.msg56358#msg56358 I am not sure if it would be able to work with raw files??

priort

I think Glen Butcher is making some progress on a possible solution?? its not using exiftool but its opensource....  https://discuss.pixls.us/t/dumping-unmodified-raw-image-data-from-raw-files/24791/21

et2511299562

Quote from: Phil Harvey on April 27, 2021, 07:03:14 AM
The question is: What do you want to hash?  If you just want to make sure the raw image isn't modified, you could hash the raw data alone.  Doing this might be possible, but it wouldn't be easy.

Can I use exiftool to determine the location in the file where the raw image data starts and where it ends?

Maybe that would be a different way to do this. Ultimately, I just want to dump the unmodified file contents between the start and the end of the raw image data, from point A to point B.

For Sony ARW files, I can use StripOffsets for the starting location and StripByteCounts for the data size. But this won't work for all raw file formats.

StarGeek

Just a thought, for RAW files, do a hash on the embedded preview image.  Editing the metadata on the RAW won't change the preview image.
exiftool -b -PreviewImage file.arw |md5
* Did you read FAQ #3 and use the command listed there?
* Please use the Code button for exiftool code/output.
 
* Please include your OS, Exiftool version, and type of file you're processing (MP4, JPG, etc).

HarGeel

#13
For a long time I wanted to try something like this. Thanks for this easy line of code:

Quote from: Phil Harvey on March 21, 2013, 11:47:35 AM
To get an MD5 of the image only, you can do this:

exiftool -all= -o - image.jpg | md5

- Phil

Now it looks almost too easy. However, when I tested it on some images, I found that there is an adobe-tag still left behind. No big drama as I can take care of this with:
exiftool -all= -adobe:all= -o - image.jpg | md5

Is that behaviour intented?

Thanks for this wonderful program!

Phil Harvey

Yes, this is intended.  From the application documentation:

Note that [...] the JPEG APP14 "Adobe" group is not removed by default with -All= because it may affect the appearance of the image.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).