Image data corruption when update large number of raw files with exiftool

Started by sidneyd, February 25, 2023, 06:46:31 AM

Previous topic - Next topic

sidneyd

While I know that exiftool is not supposed to touch the image data itself, when updating a large number of files I suspect there could be some memory leak or other issue which leads to random corruption of the image data itself.

I have been using exiftool (12.57) via a third-party tool GeoSetter and having some corruption occur in the Nikon raw images (.NEF files) for several months (including with prior versions) whenever there is a large number of files updated.

To check that it was not the third-party tool, I spent a few days performing tests with exiftool on Windows 10x64 system with a fast AMD 5900X CPU and 64GB RAM. I created a few folders with a larger number, 1361 Nikon raw images (.NEF), then created some batch scripts and examined the NEF files in Abobe Photoshop, Adobe Bridge and some other tools to check for any visual corruption of the image data both before and after running exiftfool.

1. When running exiftool as a single command (as per example 1 batch file) against all 1361 NEF files, some files will be intermittently corrupted - that is the image data itself gets damaged. 

2. If I reduce the number of images in the directory to a smaller number, let's say 200, then repeat the task, then the same command never leads to any image data corruption no matter how many times the activity is repeated.

3. If I used a for loop in the batch script (as per example 2 below) which calls exiftool with a practically identical command, but only performs the change one NEF file at a time. Then despite multiple runs of the command with slight variations in tag data, there were never any data corruptions.



Example Trail 1 Batch Script
"C:\Program Files (x86)\Geosetter\tools\exiftool" -v0 -overwrite_original -preserve -F ^
 "-FileModifyDate=Now" "-FileCreateDate<DateTimeOriginal" ^
  "-GPSLatitude=52 00 0.00 N" ^
  "-GPSLongitude=01 01 0.00 W" ^
  "-GPSAltitude=222" ^
  "-GPSDateTime<DateTimeOriginal" ^
  "-CountryCode=GBR" ^
  "-IPTC:Country-PrimaryLocationCode:GBR"^
  "-IPTC:Province-State=England" ^
  "-IPTC:Sub-location=RightHere" ^
  "-Location=LOCATION" ^
  "-City=City Name" ^
  "-Title=The Title" ^
  "-ObjectName=Object Name" ^
  "-Headline=The Headline of Someone" ^
  "-Copyright=(c) Copyright 2222 ABC, all rights reserved" ^
  -ext nef .

Example Trial 2 Batch Script
setlocal enabledelayedexpansion
cd /d %~dp0
FOR %%A IN (*.nef) DO (
  Echo Processing %%A
  "C:\Program Files (x86)\Geosetter\tools\exiftool" -v0 -overwrite_original -preserve -F ^
  "-FileModifyDate=Now" "-FileCreateDate<DateTimeOriginal" ^
  "-GPSLatitude=52 00 0.00 N" ^
  "-GPSLongitude=01 01 0.00 W" ^
  "-GPSAltitude=222" ^
  "-GPSDateTime<DateTimeOriginal" ^
  "-CountryCode=GBR" ^
  "-IPTC:Country-PrimaryLocationCode:GBR"^
  "-IPTC:Province-State=England" ^
  "-IPTC:Sub-location=RightHere" ^
  "-Location=LOCATION" ^
  "-City=City Name" ^
  "-Title=The Title" ^
  "-ObjectName=Object Name" ^
  "-Headline=The Headline of Someone" ^
  "-Copyright=(c) Copyright 2222 ABC, all rights reserved" ^
    %%A
)

Phil Harvey

Can you repeat this test with your antivirus software disabled and on another disk drive?

I suspect either a failing disk drive or interference by antivirus software.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

sidneyd

I had already tried on different SSDs, so that is not the cause. ;D

I will check antivirus, though do not understand why antivirus would affect it when it operates without an issue in one by one mode vs processing large number of files.

Phil Harvey

Quote from: sidneyd on February 25, 2023, 08:39:09 AMdo not understand why antivirus would affect it when it operates without an issue in one by one mode vs processing large number of files.

Ditto for any exiftool problem.  A memory leak in ExifTool would cause an out-of-memory crash, not the symptoms you are seeing.

If disabling the AV doesn't work, send me one of the corrupted files (and the original too if you can), and I'll take a close look at it to see if I can come up with any theories.  My email is philharvey66 at gmail.com

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

StarGeek

How were you able to detect corrupted NEFs?  I want to try and replicate your results but individually loading up 1,000+ files is a formidable task.
"It didn't work" isn't helpful. What was the exact command used and the output.
Read FAQ #3 and use that cmd
Please use the Code button for exiftool output

Please include your OS/Exiftool version/filetype

sidneyd

How I detected the corrupted NEF files was through a rather painful process of visual inspection in a tool such as Adobe Bridge, Thumbs Plus etc which can decode the file. I wish there had been another method, I looked at a number of tools, but nothing worked better than visual inspection.  The long pul is that you have to wait for the program to scan through over 1000 40MB images before you can browse the folder. Then it is usually very clear to see the damaged files as often they have marked colour bands which start part way through the image and cover the rest of the image.
 When compared to the original which did not have that artifact. 

As to Phil's question - is there some other way to send the files to you Phil as eMail gets upset at sending the over 40MB Nikon D810 NEF files?

For further information, I Ran the following tests, on different hardware, different SSDs, with AV on or off:
PC Config (all W10x64 21H2)       Test 1 (1361 NEF)   Test 1B (200 NEF)   Test 2 (1361 NEF one by one)
AMD 5900X 64GB RAM SSD D:      Corruption      All OK         All OK
AMD 5900X 64GB RAM SSD D: AV Off   Corruption      All OK         All OK
AMD 5900X 64GB RAM SSD E:      Corruption      All OK         All OK
AMD 5900X 64GB RAM SSD E: AV Off   Corruption      All OK         All OK
I5-10310U 32GB RAM SSD D:      Corruption      All OK         All OK
I5-10310U 32GB RAM SSD D: AV Off   Corruption      All OK         All OK
I5-10310U 16GB RAM SSD D:      Corruption      All OK         All OK
I5-10310U 16GB RAM SSD D: AV Off   Corruption      All OK         All OK
I5-6600 16GB RAM SSD D:         Corruption      All OK         All OK
I5-6600 16GB RAM SSD D: AV Off      Corruption      All OK         All OK
I5-6600 16GB SSD D:         Corruption      All OK         All OK
I5-6600 16GB SSD D: AV Off      Corruption      All OK         All OK
I5-9400 8GB SSD D:         Corruption      All OK         All OK
I5-9400 8GB SSD D: AV Off      Corruption      All OK         All OK

For test 1, this was repeated multiple times on each system configuration specified using different permutations of SSD and antivirus state to rule out that possibility. When corruptions using test 1 occurred, corruptions would always occur randomly throughout the folder and never were the same files. If exiftool was called separately for each file (as in test 2) or with a smaller number of raw files (as in test 1B with just 200 NEF files in the folder), no matter what CPU, SSD, memory or antivirus setting permutations there would never be a corruption.

This is highly suggestive of some scaling issue - where some resource, pointer etc is being overwritten, out of bounds etc.  When I was involved in product R&D these were the kinds of scaling issues which Systems and Test Engineering would invariably love inflicting on the development team to stress test the program or system.


philbond87

@sidneyd,

Out of curiosity, have you tested this with any other file types?

Thanks,
Phil (not the Phil)

StarGeek

For what it's worth, I was not able to replicate this problem.  But my NEFs are from a 5100, so their size is from 15-20MB instead of 40MB.

I ran your exiftool command over 1,454 random NEFs.  Then I opened Bridge and checked the folder.  No corruption on any file.  Opened up IMatch, made sure it was set to not use WIC codecs but it's own RAW processing, and loaded up the files in that.  None of the files showed any sign of corruption.
"It didn't work" isn't helpful. What was the exact command used and the output.
Read FAQ #3 and use that cmd
Please use the Code button for exiftool output

Please include your OS/Exiftool version/filetype

Phil Harvey

I compared the raw data from two of your files (_DSC9212.nef and _DSC9212_CORRUPT.nef).  There were 14 single-bit differences in the data (see the "hex" column in the binary difference output below).

      offset        char    hex       long    short1 short2    float      double      date
-----------------   ---- -------- ----------- ------ ------ ---------- ----------- ----------
23426844 0165771c   Y/.] 592f9b5d  1570451289  12121  23963  1.398e+18   4.549e-25 2019-10-07 (60%)
23426844 0165771c > Y/.] 592f1b5d  1562062681  12121  23835  6.989e+17   4.549e-25 2019-07-02 (60%)
23426852 01657724   .09. 9b3039bc -1137102693  12443 -17351    -0.0113  -1.543e-10 1933-12-20 (60%)
23426852 01657724 > .0.. 9b30b9bc -1128714085  12443 -17223   -0.02261  -1.543e-10 1934-03-27 (60%)
23426892 0165774c   CC^. 43435e8c -1939979453  17219 -29602 -1.712e-31  -1.759e+30 1908-07-11 (60%)
23426892 0165774c > CCN. 43434e8c -1941028029  17219 -29618 -1.589e-31  -1.759e+30 1908-06-29 (60%)
23426900 01657754   .Ul. c1556c82 -2106829375  21953 -32148 -1.736e-37  4.601e-105 1903-03-29 (60%)
23426900 01657754 > .U.. c155ec82 -2098440767  21953 -32020 -3.473e-37  4.601e-105 1903-07-04 (60%)
23426908 0165775c   .`.. b560ccfc   -53714763  24757   -820  -8.49e+36  -9.355e-15 1968-04-19 (60%)
23426908 0165775c > .`L. b5604cfc   -62103371  24757   -948 -4.245e+36  -9.355e-15 1968-01-13 (60%)
23426916 01657764   ..Y. f0a159e3  -480665104 -24080  -7335 -4.015e+21  -1.03e+128 1954-10-08 (60%)
23426916 01657764 > .... f0a1d9e3  -472276496 -24080  -7207 -8.029e+21  -1.03e+128 1955-01-13 (60%)
23426924 0165776c   .Xb# e3586223   593647843  22755   9058  1.227e-17  2.734e+199 1988-10-23 (60%)
23426924 0165776c > .X.# e358e223   602036451  22755   9186  2.454e-17  2.734e+199 1989-01-29 (60%)
23426948 01657784   f... 660fe496 -1763438746   3942 -26908 -3.685e-25  3.381e+112 1914-02-13 (60%)
23426948 01657784 > f... 660ff496 -1762390170   3942 -26892 -3.943e-25  3.381e+112 1914-02-25 (60%)
23426956 0165778c   ..2. b4ac320a   171093172 -21324   2610  8.603e-33  4.628e-303 1975-06-04 (60%)
23426956 0165778c > ..". b4ac220a   170044596 -21324   2594  7.832e-33  4.628e-303 1975-05-23 (60%)
23427324 016578fc   .!.Q 8121a551  1369776513   8577  20901  8.865e+10  -6.096e+97 2013-05-28 (60%)
23427324 016578fc > .!.Q 8121b551  1370825089   8577  20917  9.724e+10  -6.096e+97 2013-06-10 (60%)
23427796 01657ad4   .... 9fd90a94 -1811228257  -9825 -27638  -7.01e-27    1.8e+304 1912-08-09 (60%)
23427796 01657ad4 > .... 9fd98a94 -1802839649  -9825 -27510 -1.402e-26    1.8e+304 1912-11-14 (60%)
23427804 01657adc   .C.. 9e438f92 -1836104802  17310 -28017 -9.041e-28 -2.974e-297 1911-10-26 (60%)
23427804 01657adc > .C.. 9e430f92 -1844493410  17310 -28145 -4.521e-28 -2.974e-297 1911-07-21 (60%)
23428012 01657bac   .#1. 19233116   372319001   8985   5681  1.431e-25  -1.467e+78 1981-10-19 (60%)
23428012 01657bac > .#.. 1923b116   380707609   8985   5809  2.862e-25  -1.467e+78 1982-01-24 (60%)
23428044 01657bcc   .... e4f1e6cb  -874057244  -3612 -13338 -3.027e+07   7.669e-70 1942-04-21 (60%)
23428044 01657bcc > .... e4f1f6cb  -873008668  -3612 -13322 -3.237e+07   7.669e-70 1942-05-03 (60%)

It is interesting that many of the errors are spaced by exactly 8 bytes, and all of the errors are in the same byte-mod-8, with all in either bit 4 or bit 7.  I see no possible way that ExifTool could cause a problem like this.  It is most certainly a hardware problem.  Bit flips like this can't be software when you are just copying blocks directly from disk to disk.  FYI, here is the ExifTool code that does the copy:

#------------------------------------------------------------------------------
# Copy data block from RAF to output file in max 64kB chunks
# Inputs: 0) RAF ref, 1) outfile ref, 2) block size
# Returns: 1 on success, 0 on read error, undef on write error
sub CopyBlock($$$)
{
    my ($raf, $outfile, $size) = @_;
    my $buff;
    for (;;) {
        last unless $size > 0;
        my $n = $size > 65536 ? 65536 : $size;
        $raf->Read($buff, $n) == $n or return 0;
        Write($outfile, $buff) or return undef;
        $size -= $n;
    }
    return 1;
}

Your theory about pointers being out-of-bounds or overwritten doesn't wash.  This is a hardware issue cut-and-dried.

Since it isn't the disk or AV, it must be a RAM or cache issue of some kind in your system.

- Phil

Edit:  Hmm.  After reading your last post fully I see you have run on multiple hardware systems.  Is there any commonality between these systems other than ExifTool?  If not, you make a strong case, but I still can't see how ExifTool could be the cause.  One more thing to try (although I hate to suggest it because if this fixes the problem we are no closer to finding the cause) is to use one of the other ExifTool packages:  Either the alternate Windows version, or the pure Perl version if you have Perl installed.
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

sidneyd

Thanks for the investigation and as hoted in the footer above, yes I have found this problem on multiple systems, which rules out any hardware such as CPU, RAM, SSD, HDD, GPU.

As to Software commonality, they all run Windows 10 Pro x64 21H2 (19044.2604) which is the latest Windoze 10 and which all have the latest drivers for their hardware and any MS patches.

What I will do, though it will take some time, is to dig through my archives, to see if I have any smaller files than the large 14bit encoded NEF produced by the D810. Then I will also create some batch processed to make some variations on the stress test, with variations in the number of NEF per folder starting at 200, going up to 2200.

I will also try doing an identical runs with the alternate Windows version you mention above to get more data points.

This may take a day or more to run, so stay tune.

Phil Harvey

I analyzed the other 2 samples you sent.  Same thing, but fewer bit errors.  There was 1 bad bit in _DSC9366_CORRUPT.nef and 2 bad bits in _DSC9717_CORRUPT.nef .  The 2 bad bits were again separated by a multiple of 8 bytes, and all were either bit 4 or bit 7.

> subfile ~/Desktop/forum14536/_DSC9366.nef t1 0xf6a18
> subfile ~/Desktop/forum14536/_DSC9366_CORRUPT.nef t2 0xf6a74
> phdump t1 t2
      offset        char    hex       long    short1 short2    float      double      date
-----------------   ---- -------- ----------- ------ ------ ---------- ----------- ----------
 4306456 0041b618   f.G. 66c247c8  -934821274 -15770 -14265 -2.046e+05  1.195e+243 1940-05-18 (10%)
 4306456 0041b618 > v.G. 76c247c8  -934821258 -15754 -14265 -2.046e+05  1.195e+243 1940-05-18 (10%)
> subfile ~/Desktop/forum14536/_DSC9717.nef t1 0xb3034
> subfile ~/Desktop/forum14536/_DSC9717_CORRUPT.nef t2 0xb3094
> phdump t1 t2
      offset        char    hex       long    short1 short2    float      double      date
-----------------   ---- -------- ----------- ------ ------ ---------- ----------- ----------
24596684 017750cc   &O.. 264fe7be -1092137178  20262 -16665    -0.4518  1.604e+299 1935-05-24 (63%)
24596684 017750cc > &O.. 264fe7ae -1360572634  20262 -20761 -1.052e-10  1.604e+299 1926-11-20 (63%)
24597316 01775344   }.~. 7dff7ee7  -411107459   -131  -6274 -1.204e+24  -3.704e+57 1956-12-21 (63%)
24597316 01775344 > }.~g 7dff7e67  1736376189   -131  26494  1.204e+24  -3.704e+57 2025-01-08 (63%)

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Phil Harvey

I should also point out that most of the bit errors happened at around 24 to 25 MB into the file, but one occurred at around the 5 MB mark.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

sidneyd

I have currently run about 1 million updates with exiftool and will share the data in the next few days when all the test runs are complete. Preliminary data does hint that large files such as D810 or other high resolution raw files are more suseptible to corruption.

While the I am still conducting additional tests to gather more data points - I started thinking laterally and wondered if there could be enhancement to exiftool such as a –RobustCopyImage flag. ;D   If this flag were set, then the program would perform a CRC checksum of the source image data and compare that with a CRC checksum of the destination image data.  Then if there were a mismatch, then exiftool could generate a major warning error and not write the output file.  While not addressing the problem headon, it could provide a higher confidence mode that the image data was copied intact.


Phil Harvey

I have had to deal with many instances of bit errors in the past, and detecting them is not always as easy as you think.  If I added the -robostCopyImage flag then the read-back would certainly be fast enough to come out of the disk memory cache, which may not reflect the actual value stored.  (Of course, here I'm still assuming some sort of hardware issue, which is still contentious, but has been the source of 100% of problems like this I have seen in the past.)

My first inclination for an ExifTool mod to patch this problem would be to change the 64 kB buffer size to 1 MB to reduce the frequency of read/write cycles.  I could see this being much more efficient on modern systems, and I'm not sure how common it would be for a system to be asked to switch this quickly between read and write for a large data transfer.  It's a long shot, but it is possible that radiated energy at this switching frequency is exciting the data lines of the two error bits, leading to the problem.  A problem like this would be common to all systems with motherboards of the same layout, but changing the frequency would fix it in all cases.  However, we can strategize more after you present your new test results.

- Phil

...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

sidneyd

Hi Phil,

I know from many years working in R&D that these types of issues can be extremely challenging.

Certainly a different chuck size could be some interesting datapoint or option.  If you did add that mod, perhaps leave 64k as the default incase it breaks something else for other people and do the bigger chunks when an option such as -LargeChunks is set.

There is about one more day of testing to run, as I want to ensure I get a full picture covering as many different permutations on one platform, then also to see if that holds true when shifting to a different system with different CPU (AMD vs Intel), Motherboard, Memory, SSD etc.

Regards
Sidney