Feature request: ImageDataMD5 tag - extra related tags for start and end / size

Started by sebutzu, April 30, 2023, 05:47:55 PM

Previous topic - Next topic

sebutzu

It would be really useful if you could add 2 more tags (related to ImageDataMD5 tag). Those would be the ImageDataMD5Start (byte?) and ImageDataMD5End (or size (in bytes)) of the data used for MD5 computation. In this case if we have 2 files with same ImageDataMD5 we could compare all the bytes from start position to end position if they are identical or not, without having to "duplicate" the "magic" that is inside exiftool itself. This can help avoid MD5 collisions.

Thanks.

Phil Harvey

The image is broken into many blocks in some formats, so this is more complicated than you are thinking, and it would be some work to implement.  Also, I don't think that anyone else would ever benefit from this.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

sebutzu

How about a find duplicates option then directly in exiftool, that would benefit a lot of people?
Something like if you pass multiple files (folders) to it, to be able to extract actual duplicates by content (the one you do MD5 on)? Would that be possible? I am fighting a lot with the issue of finding duplicate files after I add tags to some versions and not to others.

StarGeek

Exiftool isn't setup to read multiple files and compare the data between them.

I would suggest using a tool more suited for finding duplicate images.  Both DupeGuru and Czkawka are good at finding duplicate images.
* Did you read FAQ #3 and use the command listed there?
* Please use the Code button for exiftool code/output.
 
* Please include your OS, Exiftool version, and type of file you're processing (MP4, JPG, etc).

olegos

Quote from: sebutzu on May 01, 2023, 02:46:10 PMHow about a find duplicates option then directly in exiftool, that would benefit a lot of people?
Something like if you pass multiple files (folders) to it, to be able to extract actual duplicates by content (the one you do MD5 on)? Would that be possible? I am fighting a lot with the issue of finding duplicate files after I add tags to some versions and not to others.
Here's a way to do this in bash (Linux, or MacOS, or wsl in Windows):

declare -A files
exiftool -T -r -p '$ImageDataMD5 $Directory/$FileName' . | \
  while read sum file; do if [[ -z "${files[$sum]}" ]]; then files[$sum]="$file"; \
    else echo "$sum ${files[$sum]} $file"; fi; done

This will print pairs of files in the current directory and subdirectories that have the same md5sum, together with the sum.

Here's the same in PowerShell (I'm not too proficient with PowerShell, so I asked Bard (Google's AI) to convert the above from bash to PowerShell, and made a couple of small fixes. The comments too were written by Bard.)

# Create an empty associative array
$files = @{}

# Get the output of the exiftool command
exiftool.exe -T -r -p '$ImageDataMD5 $Directory/$FileName' . | ForEach-Object {
  # Get the MD5 hash and the file name
  ($md5, $file) = $_ -split " ", 2

  # Check if the MD5 hash already exists in the array
  if ($files.ContainsKey($md5)) {
    # If it does, print both files
    Write-Host $md5 $files[$md5] $file
  } else {
    # If it doesn't, set the value to the file name
    $files[$md5] = $file
  }
}


sebutzu

If I do this I am not 100% sure the images are duplicates, just that they have the same MD5, which is why I would need in case of same MD5 to check if the actual "content" is "exactly" the same, but I do not know what bytes are inside that MD5 and what should I ignore so that I can do this processing myself. Was just trying to avoid having to understand "deeply" the formats involved in order to be able to check accurately if 2 files (images or movies) are the same (just with different metadata).

StarGeek

If the files have the same ImageDataMD5 number, then the image data will be exactly the same.  The metadata may be different.

There are things to watch out for.  For example, if one file includes an ICC_Profile and the other doesn't, then they will be visually different.  For example, you could use ImageMagick's compare function and the result would show no difference.  But, as long as the program is aware of the profile, the actual display would be different.  See this post for an example.

Additionally, the jpeg algorithm allows the the exact same image data to be encoded in different ways without any lost of data.  The biggest example of this would be a regular jpeg vs a progressive jpeg. The ImageDataMD5 would be different but the result would dispaly exactly the same.
* Did you read FAQ #3 and use the command listed there?
* Please use the Code button for exiftool code/output.
 
* Please include your OS, Exiftool version, and type of file you're processing (MP4, JPG, etc).