Incorporating checksum verification into ExifTool

Started by ww408, February 27, 2025, 03:40:48 PM

Previous topic - Next topic

ww408

The fact that a new copy of a file is required just to add tags to it makes me nervous, so I'm wondering if one can enhance the copy procedure with checksum verification. Grok gave me an elaborate answer about this, but I wanted to run it by real people before considering it.


You're absolutely right that verifying a file copy with a checksum becomes tricky when tools like ExifTool modify metadata, potentially altering the file's structure and size, even if the core content—like the image data—remains unchanged. Standard checksums (e.g., MD5, SHA-1, or SHA-256) work by hashing the entire file, byte-for-byte, so any change in metadata, unused data blocks, or offsets will produce a different hash, even if the "important" parts (like the image itself) are identical. Let's break this down and explore your options.

From what you've shared and what's known about ExifTool, rewriting a file can shrink or expand it due to things like removing unused data blocks (common in JPEGs or RAW files from cameras), adjusting XMP padding, or reorganizing metadata into a more compact form. The image data itself isn't touched—ExifTool is lossless in that sense—but the file as a whole changes. So, if you're copying files and using ExifTool to edit metadata, or even if your hard drive copy process somehow triggers metadata updates (unlikely, but possible with some filesystem quirks), a traditional checksum won't cut it for verifying that the "essence" of the file (the image) stayed intact.

Here's how you could approach this:

1. Hash Only the Image Data
Since ExifTool doesn't alter the actual image data, you could use a tool or script to extract and hash just that portion of the file, ignoring metadata. For JPEGs, this would mean isolating the compressed image stream (everything after the Start of Image marker, excluding EXIF, XMP, or IPTC segments). Tools like exiv2 or even ExifTool itself can help here:
  • Run exiftool -b -JpgFromRaw file.jpg > extracted.jpg (or a similar command depending on the format) to pull out the raw image data.
  • Compute a checksum (e.g., sha256sum extracted.jpg) on that extracted portion.
  • Compare the hashes of the extracted image data from both the original and copied/edited files.
This way, metadata changes won't affect the result. The downside? It's not a one-step process, and you'd need to script it for multiple files.

2. Custom Checksum Script
You're spot-on that a "super intelligent checksum" could solve this. A script could:
  • Parse the file structure (using ExifTool's -htmlDump output or a library like libexif).
  • Identify and exclude metadata sections (EXIF, XMP, IPTC, etc.) and unused data blocks.
  • Hash only the image payload.
This would require some programming (Python with a library like piexif or exifread could work), but it'd give you a metadata-agnostic way to verify integrity. It's more work upfront, but reusable.

3. Verify Copy Without Metadata Changes First
If your goal is to ensure the hard drive copy operation worked before ExifTool touches the files, checksum the originals and copies before any metadata editing. Use md5sum or sha256sum on both sets. If they match, the copy was perfect. Then, after ExifTool runs, accept that the checksums will differ but trust that ExifTool's lossless nature preserved the image data. This splits the problem: copy verification is handled traditionally, and ExifTool's edits are a separate step you can validate differently (e.g., visually or via option 1).

4. Filesystem-Level Verification
If you're worried about the hard drive mishandling the copy, tools like rsync with the --checksum option can verify files during transfer by comparing content, not just timestamps or sizes. This still uses whole-file checksums, so it's best done pre-ExifTool. Post-edit, you'd need one of the above methods.

Why It's Not Impossible
A checksum can verify the file if you define what "verify" means to you—likely that the image data is intact, not the entire file bit-for-bit. The standard approach doesn't work here, but tailoring the process to focus on the unchanged parts (image data) makes it feasible. There's no off-the-shelf "ExifTool-aware" checksum tool, but with a bit of scripting, you can get there.

What's your setup like? Are you copying tons of files, or just a few? And are you more worried about ExifTool's edits or the copy process itself? That'll help narrow down the best practical solution.

Phil Harvey

The ExifTool Extra tags contain some very useful and interesting features.  The ImageDataHash tag may be used to generate an md5, sha256 or sha512 hash of the image data alone.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

StarGeek

And the OriginalImageHash and OriginalImageHashType were created to hold the hash and the hash algorithm. See the XMP-et tags. The -api ImageHashType option is used to specify the  algorithm.

exiftool "-OriginalImageHash<ImageDataHash" -OriginalImageHashType=SHA512 -api ImageHashType=SHA512 /path/to/files/
"It didn't work" isn't helpful. What was the exact command used and the output.
Read FAQ #3 and use that cmd
Please use the Code button for exiftool output

Please include your OS/Exiftool version/filetype