[Originally posted by jlbec on 2007-03-21 01:14:32-07]Phil and friends,
I while back I posted about modifying tags in a batch. I'm some ways along now, but I have a question of data stability:
How do I validate that my new copy is good before I delete the old?
In the exiftool manpage, the modification options correctly say "make sure that your operation is successful before deleting the original". However, in a large batch, it is not feasable to ask the user to hand-validate 100 files. Currently, my program does:
generate list of tags to apply > args-file
tmpname = `mktemp`
cp original.raw tmpname
exiftool -overwrite_original_in_place -tagsFromFile @ -@ args-file tmpname
if (all commands returned 0) mv -f tmpname original.raw
First, we only do work on the temporary copy, keeping the original.raw safe. I know that "all commands returned 0" is a decent guarantee that the result is good, but it isn't perfect. What I'd like to do is grab the actual RAW image data from "original.raw" and "tmpname" and compare. Strip all EXIF info, all previews, just the actual image data. Is there an easy way to do this that is valid for all raw file types?
Joel
[Originally posted by exiftool on 2007-03-21 11:35:31-07]
Hi Joel,
The logic you are using here is already done internally inside exiftool
(as I realized after I did the same thing when I posted a script in here
recently). So doing it again in your script is redundant unless you add
another test like you are proposing.
The ideal way to verify that the image data is still intact would be to
use dcraw to generate an image from the raw data. You could generate
an image before and after, and if they are the same then the file must
be fine. Unfortunately, this would be a very time-consuming step, but
would be an absolute test to ensure image integrity. Just validating the
raw data isn't enough because information like white balance is required
to properly generate an image.
However, this won't guarantee that any utility can still read the modified
raw file, only that dcraw can do it. To be sure any given utility will be
able to read the file, you need to test with that specific utility. In the
past I have seen poorly written utilities that assume fixed offsets for
information, and if this is the case almost any change to the raw file
will break the image for the utility.
- Phil
[Originally posted by jlbec on 2007-03-21 15:45:45-07]
Phil,
Hmm, dcraw would, indeed, be the surest and slowest way :-) You make a good point there.
I don't see what you mean about my logic being redundant. Do you mean that exiftool, when overwriting an image in-place, validates its changes before rewriting the image? What about if write errors happen during the final write? Does "overwrite in place" actually not write in place? I say this because I am definitely using "overwrite in place" to try and preserve characteristics.
I'm not concerned about poor utilities. Any EXIF data modification would break those utilities, whether done by exiftool, Adobe Photoshop/Lightroom, Photo Mechanic, etc. I'm expressly looking at modifying IPTC Core information and so on, I expect my image utilities to be capable of handling that :-)
Thanks, Joel
[Originally posted by exiftool on 2007-03-21 16:05:59-07]
Hi Joel,
Yes, the overwrite options go through a temporary file, and only
overwrite the original if there were no errors creating the temporary
file. With -overwrite_original, the final overwrite is accomplished by
renaming the temporary file, so attributes of the original file are lost.
With -overwrite_original_in_place, the final overwrite is
accomplished by opening the original file for writing and copying the
temporary file data to the new file. So -overwrite_original_in_place
is slower because it involves an additional copy of the data.
- Phil
[Originally posted by jlbec on 2007-03-21 23:07:47-07]
Phil,
With -overwrite-original-in-place, what happens on a write error? Is the result corrupt? Does it move the temporary copy on top? Does it have a backup it can restore?
I'm thinking about the case where another process happens to fill the filesystem at the same time you're doing the final overwrite. Do you do the final overwrite with O_TRUNC, or do you leave the full file size around? Do you truncate out to the final length if the image has grown?
Also, what if the disk throws an error in the middle? Is there a backup copy to restore (and yes, I know that a disk can throw an error on that)?
I'm OK with slower. I'm just trying to understand the protection envelope before I give up my logic and trust exiftool's :-)
Joel
[Originally posted by exiftool on 2007-03-22 00:17:08-07]Hi Joel,
ExifTool is designed so that it should never produce a corrupted
file. With
-overwrite_original_in_place, if an error occurs
while writing the temporary file then the temporary file is erased
and the original file is left untouched. If an error occurs while
copying the temporary file on top of the original file, then the
temporary file is renamed to replace the original (ie. the same
behaviour as with
-overwrite_original). This includes
the case where there is a disk error in the middle of the copy,
but not the case where the original file couldn't be opened for
writing. If the original can't be written, then the temporary is
erased and an error message is printed. If the rename
fails, then you are left with the temporary file which you can
rename manually, but this case is extremely unlikely.
The file is not opened for writing, not appending, so the file
is essentially truncated to the new length if shorter.
All of this is probably easier to explain with a flow chart, but
hopefully it makes sense to you. Feel free to look at the code
to see how it is done. Look for the following line in "exiftool":
if ($overwriteOrig > 1) {
In my day job I write data acquisition and data flow systems for
some big physics experiments (we're talking data worth many
hundreds of million dollars here), so I am aware of these issues
and and take data integrity very seriously.
- Phil
[Originally posted by jlbec on 2007-03-22 00:37:06-07]
Phil,
Ok, I see you are doing what I would do (write error on copy means rename the temporary). Cool.
I assume you meant "The file is opened for writing, not appending" in your second paragraph, not "The file is >not< opened for writing, not appending". I was thinking about longer lengths, though. That is, if you start writing a longer new file, it can run out of space still. The disk has enough room for original.raw of 28MB + tmpfile.raw of 28.5MB, but not enough room to extend original.raw out to 28.5MB. Of course, that's a problem whether you detect it up front or after copying 28MB of tmpfile.raw to original.raw. Based on what you say, detecting up front means printing an error, detecting in the middle means renaming tmpfile.raw to original.raw. I'm not sure which is better :-)
I appreciate that you've taken the data integrity seriously (much better than I would have expected the "average" developer to do), and you can see I do the same. Image data is very sensitive, as I'm quite certain you are aware.
Joel
[Originally posted by exiftool on 2007-03-22 01:26:58-07]Hi Joel,
I wrote:
I assume you meant "The file is opened for writing, not appending"
in your second paragraph, not "The file is >not< opened for writing, not appending".
Right. Sorry about that.
I don't even try to detect out-of-space conditions in advance (I don't believe you could
do this reliably across all the different platforms where exiftool runs). So I think the
best strategy is to rename the temporary file on a copy failure. A warning is printed in
this case (not an error because the operation was completed), so at least you have an
indication that this happened.
I'm glad you think about these sorts of things too, and I'm happy to answer your
questions because I think it is an important topic.
- Phil