Image data corruption when update large number of raw files with exiftool

Started by sidneyd, February 25, 2023, 06:46:31 AM

Previous topic - Next topic

sidneyd

The additional data testing was conducted using an 18MB Nikon D3 NEF file and a 45MB Nikon D810 file. Testing was primarily conducted on the AMD 5900X AM4 system with MSI motherboard, 64GB RAM, and using three different MVMe SSDs. To provide additional data points, I also ran these same tests on an Intel i5-1030U with MvMe SSD, an Intel i5-6600 with SATA SSD and an Intel i5-6600 with HDD.

I used various batch files to create a spread of testing scenarios with the same D3 or D800 or D810 file copied and renamed into directories with 200, 400, 800, 1200, 1600 and 2200 copies of the same NEF image. Using the same image 6,400 times with different filenames made checking for corruption easier as I could calculate a CRC from the "non-tag" portion of the file.  A set of 4 batch files then executed either a simple or complex set of tag changes to all files in a given directory using either the standard or alternative exiftool windows executable. This allowed testing up to 2,200 files processed with one exiftool command.  Additional wrapper batch scripts were then used to run through 30 iterations of each test – which resulted in 192,000 file updates for each of the 4 variants.  A batch script could be used to check for variations in the non-tagged data CRC compared to a control file and any files with variations flagged and renamed for additional visual comparison in tools such as Adobe Bridge.

During testing, I found some tag options that seemed to change what I believe should be "non-tag data" in the NEF file.  See later section "Issues getting just the raw data without tags for CRC calculations".

So what did the result show?

On all systems, the memory subsystem was checked and passed without error by running PassMark MemTest86 and in the case of the AMD platform this was run for 12 hours without errors. I verified that the latest firmware was installed on all SSD, which it was.

When using a Nikon D3 NEF file with all 4 different scenarios being run through 60 iterations (meaning 1.5million NEF file updates). I was never able to detect any file corruption issue via CRC or visual changes on any one of the test platforms.

However, using virtually identical batch scripts and tests with the 45MB Nikon D810 NEF files intermittent data corruptions occurred. When running through a 30-iteration cycle of the test 192,000 file updates were performed by each test variant. Errors would only occur in folders with more than 400 images, meaning exiftool was processing 400 or more files via one command, rather than each file being updated by a separate instance of exiftool.

Further testing showed what if the copyright tag was not updated as part of the tag set, then these tests would result in no errors. If the copyright tag was updated as part of the multiple tag update with the 45MB Nikon D810 NEF files, then corruption would typically occur at about 2 per 100,000 updates on Intel platforms (and later on the AMD platform).  Initially on the AMD platform, the rate of corruption when setting the copyright tag in the D810 NEF files occurred at significantly higher rates of 20plus corruptions per 100,000 updates.

With the data corruption occurring at a higher rate on the AMD 5900X system, many days were spent running additional tests and different scenarios to see if any contributing factors could be uncovered.  Many different options and configurations were tried, the only two which resulted in any change of corruption rates were as follows:
-    If I ran the test on a SATA-attached SSD rather than one of the three  MVMe SSD I had previously been using. When using this slower storage, interestingly the rate of corruption was about 1/5 of the previous tests.
-    After upgraded the UEFI firmware on the MSI motherboard which included AGESA 1.2.0.8 update, the corruption rate then dropped to be in line with the Intel systems. And subsequent use of SATA or MVMe storage did not appear to make any discernible difference. The only information I could find about AGESA 1.2.0.7 to AGESA 1.2.0.8 update is that it fixed several security items in the UEFI firmware, so I am at a loss to explain why there is now a dramatic change to the corruption occurring when using exiftool with large NEF files, on the AMD system – perhaps some rare corner case timing Phil had mentioned?

In Summary:

Testing was conducted on three Intel and one AMD 5900x systems with over 10 million NEF file updates being performed. All systems had thorough memory tests conducted and hardware burn-in tests run. For the AMD 5900X system, once the UEFI firmware was updated on the MSI motherboard to the latest version which included AGESA 1.2.0.8, the NEF corruption reduced to be inline with levels seem on other Intel systems. 

No corruptions were ever detected for smaller NEF files such as the D3, or even D7000 files.

The file corruption issue only occurs with exiftool when updating large NEF files such as a 45MB D810 NEF when processing 400 or more files at once and when the copyright tag was included in the group of tags to be updated.  Once the AMD issues was resolved the rate on all platforms occurred at a very low rate of about 2 corruptions per 100,000 updates on all platforms AMD and Intel.

The chunk size enhancement has been mentioned as a possible workaround for large files.


Issues getting just the raw data without tags for CRC calculations:

To conduct these tests, I developed some batch commands for exiftool to hopefully output just the raw data without tags so a checksum could be calculated before and after any tag updates.  However, If either of the following tags were used "-*:Software=" or "-Copyright=(c) something" (but not –rights= or –CopyrightNotice=), then CRC data calculated for what I assumed should be the non-tagged portion of the file changed. 

I used a myriad of different exiftool commands in an attempt to get just the raw component of the NEF file minus any tags.  The following command produced the best results, but still was had variations when the copyright tag had been updated on the image.

exiftool" -F -EXIF:all= -IPTC:all=  -XMP:all= -allDates= -all= -q -q  -o - _DSC00000.nef | MD5 –n 

I even tried using "–makernotes=" and some other variations, but to no avail. Maybe Phil can comment on why exiftool still seems to output some copyright data.


Enhancements Ideas:

Chunk size - As previously discussed in this thread, it could be useful to consider having a larger chunk size option for exiftool, as this could reduce the possibility of any bit flipping as Phil mentioned.  It may also make exiftool more efficient on newer systems.  One idea is if the filesize is larger than 20MB, then exiftool could automatically used a larger chunksize. Not too sure of Phil's plans for implementing this?

Raw only data - if there could be some option made to just output the raw data or generate a CRC for the raw data this would be great.

Lastly there is always the utopia of a CRC to be performed on the when the data is chunked from source to destination.

Phil Harvey

Wow, thanks for all of the very thorough testing.

Very interesting.  So the error rate dropped by a factor of 10x with the motherboard firmware update on the AMD 5900X?  And dropped another 5x when you used a  SATA-attached SSD?  It worrying that it never went to zero.

Here is a version of 12.57 (12.57p) which uses a 1 MB buffer instead of 64 kB.  There is also the alternate Windows version to test (but this still uses a 64 kB buffer).

I won't be happy until we get an error rate of zero.

Unfortunately, doing a CRC would have a huge impact on performance, so this isn't really an option.

I don't now why you need a raw-data-only CRC for testing.  If you are writing the same thing to the same files then the result should be the same.  You can just do a CRC on the entire file to see if there was any corruption.  (You are using a set of identical source files, correct?)

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Phil Harvey

Quote from: Phil Harvey on March 06, 2023, 01:10:16 PMUnfortunately, doing a CRC would have a huge impact on performance, so this isn't really an option.

...also, it isn't clear to me that this would detect the error.  We don't know if the bit error is in the disk read, disk cache, i/o, ram cache, ram, or disk write phase, and doing a CRC would test only a subset of these.  (Of course, if it is software related then the problem occurs while data is in ram, but I still don't see how this could be the case -- software writes whole bytes, it would be a very unique failure to toggle individual bits.)

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

sidneyd

Thanks, Phil, I had previously tested with the alt version and that had typically performed the same as the standard version.

I have now tested with the larger buffer version (12.57P) and after a quick 24,000 test run, there were 7 corruptions – so somehow things actually seemed worse.  ::)

I wondered if there was some issue with the self-extracting Perl environment, so I deleted the cache-exiftool-12.57p folder in the temporary par-7369646e6579 directory. Then I reran this short test.  This time the results of the short test were good, with zero errors. I then deleted the cache folder another time & repeated the tests, again all OK.  :o

I quickly did the same tests on one of the other intel machines, deleting the cache after each run. Then one out of 10 times, a similar bunch of errors showed up on that machine. Again when running the test again after the cache was deleted there were no errors. ???

This of course leads makes one to wonder what or how deleting the cache and then rerunning the test should cause a different result and why when things do go wrong, they seem to end up with weird bit flips happening. I am also wondering if someone the block copy function could be tested independently as my SSDs are getting some serious wear.

As to the second part – regarding CRC/MDR or the raw portion

Given the corruption issues which have been experienced and with a library of over 400,000 NEF files, you could probably understand why there is some degree of anxiousness. I wanted some way to double-check image integrity, so I could be sure that no silent corruption is creeping into the image library, in which some tags get changed.

I have something close which can export the "non-tag" portion, but still, there seems to be some embedded data which gets changed by exiftool and can not be excluded from the output. The closest to getting non-tag data from the NEF is using
%EXIFTOOL% -F "-EXIF:all=" "-IPTC:all=" "-XMP:all=" "-allDates=" "-all=" -q -q  -o -  filename.  This seems to work except if –Copyright= was changed and some embedded date (perhaps the GPS date).

Based on the above exiftool command, I have built a set of batch files to store an MD5 checksum in a CSV file for each folder. Then other batch files can perform validation against this data and inform me if some alteration to the non-tag portion of the file has occurred.  Doing this has already shown up just over 20 other corruptions in my raw image archive, which would have not otherwise been detected.

Phil Harvey

Wow.  Darn.  OK.  So changing things that I thought may make a difference (exiftool version and buffer size) didn't help at all.  But changing something that should be unrelated (clearing the ExifTool temporary files) does have an effect.  The alternate version has a different method for unpacking the temporary files, but it also shows problems.

So we're no closer to finding the source of the bit errors.

Even if you come up with a method to verify the raw data itself is OK, I don't know if we can disregard the possibility of a bit error in the rest of the file, which could also make it unreadable. :(

I can't run any tests myself to see if I could reproduce this problem because I don't have enough free disk space on my system here, but I would be very surprised if I was able to replicate this behaviour on MacOS.

I don't know what to suggest at this point.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

sidneyd

Yes I am left with a bit of proverbial head scratching also...

Wondering if it would be worth considering grabbing perl and trying a setup that way?
If so where is a good guide to do that?

Phil Harvey

Doing this with ActivePerl is easy.  Just install ActivePerl then you can run the pure Perl version of ExifTool from any directory (just unpack and run -- no need to install).  (But you may need to run exiftool by typing "perl exiftool" instead of "exiftool".)

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Phil Harvey

I see how it could be useful to verify the image data after writing, so I'm going to look into adding an ImageDataMD5 tag which would represent the MD5 of the image data only.  This tag wouldn't be generated unless specifically requested.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

sidneyd

Thanks Phil, that would be a great benefit.

When you do could you look at support this - both writing the value into a tag or generating the MD5 as an output would good ideas.  I have added three batch scripts to the previous share so you can see examples of how my current method of running an external MD5 mechanism was performed.

I will be happy to do any testing for you on this.

Regards
Sidney

Phil Harvey

Hi Sidney,

Quote from: sidneyd on March 12, 2023, 08:35:23 AMcould you look at support this - both writing the value into a tag or generating the MD5 as an output would good ideas.

These are built-in features for all tags, including an ImageDataMD5 tag if I can add it.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Phil Harvey

I've just released version 12.58 with the new ImageDataMD5 feature.  To write this to a tag, you would do this:

exiftool "-SOMETAG<imagedatamd5" FILE

- Phil

Note that for some JPEG images the ImageDataMD5 value will change in the next ExifTool release (version 12.59).  In this version I will also add JPEG RST segments to the MD5 calculation.

Also note:  I ran a number of tests trying to reproduce this problem on my MacOS system without success (see this post).
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

sidneyd

Phil,

thanks for the new functionality and additional guidance of using this new MD5 feature. I have run a quick test to generate a CSV file containing MD5 image data checksums and it is looking good. I will run this against the much larger collection over the next week.

I have included ome examples of the commands I am using to do these tests, which may be useful for the reference of others):

To generate a CSV file of checksums:
exiftool -p "$filename, $imagedatamd5" -ext nef . > checksum.csv
To write a MD5 image checksum to xmp:identifier
exiftool -overwrite_original -preserve -F "-FileModifyDate=Now" "-xmp:identifier<imagedatamd5" -ext nef .
To simply display the filename, imagedatamd5 and xmp:identifier side by side for comparison:
exiftool -p "$filename, $imagedatamd5, $xmp:identifier" -ext nef .
To check if the value of $xmp:identifier exists:
exiftool -p "$filename" -if "not defined $xmp:identifier or $xmp:identifier eq ''" -ext nef .
To write a value to xmp:identifier if it does not exist, with the MD5 image data:
exiftool -if "not defined $xmp:identifier or $xmp:identifier eq ''" -overwrite_original -preserve -F "-FileModifyDate=Now" "-xmp:identifier<imagedatamd5" -ext nef .
To display all MD5 validated images:
exiftool "C:\Program Files (x86)\Geosetter\tools\exiftool.exe" -q -p "MD5 OK for: $filename" -if "$xmp:identifier eq $imagedatamd5" -ext nef .
To display all images in which the MD5 image data checksum is different to xmp:identifier:
exiftool -p "Bad MD5 in: $filename" -if "$xmp:identifier ne $imagedatamd5" -ext nef .

sidneyd

#27
It seems that the $imagedatamd5 is only working for Nikon raw images or older canon CR2 at the moment, if I am correct?

- $imagedatamd5 works for all Nikon .nef or .nrw raw images which I could find even latest Z9.
- $imagedatamd5 works for all canon .cr2 raw images.
- It does seem to work for canon .cr3
- It does seem to work for minolta .mrw
- It does seem to work for panasonic .rw2

In cases where it does not work, issuing the command such as the following, results in error:
exiftool -p "$Filename,$ImageDataMD5" IMG_0344.CR3
Warning: [Minor] Tag 'ImageDataMD5' not defined - ./IMG_0344.CR3

Phil Harvey

yes, some file types are not yet supported -- I'll be adding more in the next release.  Currently only JPG and TIFF-based formats (except Panasonic raw) are supported.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

sidneyd

Thanks Phil for the confirmation, just wanted to make sure what was currently supported is working  ;D