News:

2023-03-15 Major improvements to the new Geolocation feature

Main Menu

ExifTool as Photo Librarian - eliminating duplicates

Started by hroth, December 11, 2011, 06:22:24 PM

Previous topic - Next topic

hroth

I just discovered ExifTool and it's been fantastic.  I have been using the following command to take my umpteen sources of photos (all JPEGs) and organize it in one directory:

exiftool -overwrite_original -r -d %Y/%m/%d/%Y%m%d-%H%M%S '-filename<${CreateDate}-${MyModel}-${MyFileSize}.%e' .


(see Exiftool_config at end of post).

Thus, IMG_0330.JPG automagically becomes "2011/07/21/20110721-185915-iPhone_4-2107683.jpg"

This works great!  And initially it found duplicates and left them alone, and that was exactly what I needed.  I used the "filesize" as a simple way to distinguish photos that were taken from the same camera at the same CreateDate (down to the second).  I had found that sometimes a photo would share the same CreateDate but then have different FileSize's, so I introduced that as part of the filename.

However, now I'm finding when I run multiple passes on the directory, the change in metadata adjusts the filesize and I end up with duplicates again.

In google'ing "md5" and exiftool I realize exiftool does no processing of the image payload.  I'm wondering if there is a way to define a custom tag for ExifTool to report simply the filesize of the raw image (without the metadata).  Of course if there was a custom tag that calculated the md5/crc32/sha-1/etc of the raw image that would be even better.

-------------------------------------------------------------------------------------------------------------------------
For ExifTool_config, I found this code snippet that works well:

%Image::ExifTool::UserDefined = (
    'Image::ExifTool::Composite' => {
        MyModel => {
            Require => 'Model',
            # translate spaces to underscores
            ValueConv => '$val =~ tr/ /_/; $val',
        },
        MyFileSize => {
            Require => 'FileSize',
            # translate spaces to underscores
            ValueConv => '$val =~ tr/ /_/; $val',
        },
    },
);

1;  #end


Phil Harvey

You may not need the MyFileSize user-defined tag.  This should give the same result as -filesize#.

If you have an md5 application it should be possible to get the md5 of the image only by doing something like this:

exiftool image.jpg -all= -o - | md5 -

Maybe this will give you what you want.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

hroth

Thanks for the reply! 

It sounds like I will have to write a script with the command you suggest, and I'm happy to do that and share that with everyone.

md5 is pretty overkill - I'm just trying to catch multiple images that share the same CreateDate and they should be pretty different, especially in JPEG.

I was wondering if it is possible to set up a custom tag in ExifTool_config that would excerpt the last few bytes of the JPEG image payload, for example

exiftool -overwrite_original -r -d %Y/%m/%d/%Y%m%d-%H%M%S '-filename<${CreateDate}-${MyModel}-${Last4bytes}.%e' .


so that

IMG_0330.JPG becomes "2011/07/21/20110721-185915-iPhone_4-0915.jpg"

Thanks again!

hroth

Ugh, I can answer my own question, unfortunately it seems that ExifTool isn't the proper tool to do this type of Librarian activity; I ended up with duplicates of almost every photo in different orientations.  I'm still figuring out why, but once that happened my md5sum didn't match.

For those curious, here is the script:

find . -name '*.[jJ][pP][gG]' -print0 | xargs -0 -n1 -P8 -Ixyzzy sh -c "exiftool xyzzy -all= -o - | md5sum - | awk '{print \"mv \"
\"xyzzy\", \"Dump/\"\$1\".jpg\"}'"


That bash script will create a bash script to move filenames to <md5>.jpg in a flat directory.

Then, one can run the first exiftool (preserving the file name to create unique file names) command above and throw out duplicates.

Unfortunately, it doesn't work - I will have to understand how I ended up with different orientations and different md5sum's.

Phil Harvey

If you losslessly rotate an image, then the JPEG image data is modified so the MD5 will change.

My suggestion would be to add a unique identifier (Exif:ImageUniqueID) at the start of processing, then use this to catch duplicates.  But of course, this doesn't help with what you have now.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

hroth

Whew.  Found what I needed! Here is my modest internet contribution of the year. :)

It requires not just ExifTool but two other great tools:

1) GNU Parallel
2) GraphicsMagick

So, for the record:

Step A. Put in ~/.ExifTool_config

%Image::ExifTool::UserDefined = (
    'Image::ExifTool::Composite' => {
        MyModel => {
            Require => 'Model',
            # translate spaces to underscores
            ValueConv => '$val =~ tr/ /_/; $val',
        },
    },
);
1;  #end


Step B. Commands:

$ find . -type f -name '*.[jJ][pP][gG]' | parallel --progress 'fingerprint=`gm identify -size 8x8 -format "%k" {}`; mv {} {//}/$fingerprint.jpg'


Will take all JPEGs and establish a "fingerprint" - bringing it down to a small size (it says 8x8 but GraphicsMagick actually goes down to something larger like 484x324) and finding the number of unique colors. It's very quick especially when used with GNU Parallel, and seems to be reasonably 'collision-free' .

Did anyone know that the number of unique colors changes after a JPEG lossless rotation?  However, when resizing the image down so small it is lossless rotation invariant. 

This then renames it by its fingerprint number, e.g.

DSC_1024.JPG => 63214.JPG

Then

$ exiftool -overwrite_original -r -d %Y/%m/%d/%Y%m%d-%H%M%S '-filename<${CreateDate}-${MyModel}-%f.%e' .


This then moves it into the main directory tree, e.g.

63214.JPG => 2011/01/02/20110102-125953-NIKON_D80-63214.JPG

Yay!  Thanks for your tips along the way, Phil.

--

The last thing I need to figure out is how to operate the Fingerprint and ExifTool in a "dump" directory so I don't run this over and over on the main directory.  I.e.

If /Pictures/2010/etc../Pictures/2011/etc is the main directory, I should probably copy new photos to sort into /Pictures/Dump  to move them.

Is there a better way than:

$ exiftool -overwrite_original -r -d /Pictures/%Y/%m/%d/%Y%m%d-%H%M%S '-filename<${CreateDate}-${MyModel}-%f.%e' .



Phil Harvey

Thanks for the tip.

About moving the pictures, that command seems OK (I'm not 100% clear on what you are doing), but the -overwrite_original option has no effect when renaming files because exiftool will not overwrite an existing file.  So you must make sure the original files are moved out of the way before moving them back again.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

hroth

I see, yes, that's fine - all that's left over (not overwritten) is all the "duplicates" which is nice for verification.  Then I can delete the files to save space.

Over 16,200 photos ExifTool did remarkably well.  It actually crashed once (some Perl error message) which unfortunately I didn't save for you, but I just re-ran it and it completed the job without any error.  Photo library done!

The other minor issue is that one of the many camera models had a forward-slash "/" in it so that it created subdirectories.  I'm sure there's a proper way to replace a "/" with a "_" in the model but it didn't happen too often.

Thanks again!

Phil Harvey

Quote from: hroth on December 20, 2011, 12:07:05 PM
Over 16,200 photos ExifTool did remarkably well.  It actually crashed once (some Perl error message) which unfortunately I didn't save for you

I don't count that as doing well.  If you ever have another crash, please provide as many details as possible.  ExifTool should not crash.

QuoteThe other minor issue is that one of the many camera models had a forward-slash "/" in it so that it created subdirectories.  I'm sure there's a proper way to replace a "/" with a "_" in the model but it didn't happen too often.

This can by done by modifying the MyModel logic from above:

        # translate spaces and slashes to underscores
        ValueConv => '$val =~ tr{ /}{__}; $val',


- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).