How does exiftool determine that a file "already exists"?

Started by Mark Vang, December 11, 2016, 12:32:27 PM

Previous topic - Next topic

Mark Vang

I'm working with a photo collection with 415335 jpg images. I'll include the steps below:

start: /photos = 415335 .JPG files

>> converts (camera format) IMG_####.JPG to date/time format "030801-151040.jpg"
exiftool -overwrite_original '-filename<CreateDate' -d %y%m%d-%H%M%S%%-c.%%le -r /photos

>> check for any duplicate files (also run fslint) NO duplicate files reported.
fdupes -r /photos

>> sort the photos into folders by year/month
exiftool '-Directory<CreateDate' -d /sortbydate/%y/%y%m -r /photos

>> 116,666 files are left in /photos as exiftool reports: "already exists" when running.
>> Another (fdupes/fslint) comparison of the folder with 116,666 to the destination /sortbydate returns NO duplicate files.

What is exiftool looking at when it decides that a file "already exists"? Is there a way to get it to look only at filenames?

It's a given that there are probably some duplicate filenames in the original photo archive because the camera only uses a 4-digit # when assigning filenames. That's why I rename them all in the first step.
Since I'm combining photo archives for two photographers it's also possible that some photos will have the same CreateDate even accounting for the use of "%%-c" which would catch photos taken in rapid sequence in the same folder. Two photographers in two locations could presumably take photos at the same time but I can't see that accounting for 116,666 duplicates.
Finally, fdupes/fslint report no duplicate filenames before I run the command to sort by date so I'm stumped.

I've run through this process a few times now and it's pretty time-consuming to re-do 400k+ file batches so help is certainly appreciated.

StarGeek

The only criteria is that there is already a file with the same name in the spot you are trying to move to.  Exiftool doesn't do any file comparison like fdupes does.  fdupes considers two files with different contents to not be duplicates even if they have the same name.

What you might want to do is combine the two commands into a single move and rename.  That way the %-c option will be able to deal with collisions, which isn't happening when two files with the same createdate are in different starting directories.

Try something like this.  Note that -overwrite_original isn't needed since you're not actually changing the file contents, just it's location.  Of course, test this command out before hand to make sure.
exiftool '-filename<CreateDate' -d '/sortbydate/%y/%y%m/%y%m%d-%H%M%S%%-c.%%le' -r /photos

* Did you read FAQ #3 and use the command listed there?
* Please use the Code button for exiftool code/output.
 
* Please include your OS, Exiftool version, and type of file you're processing (MP4, JPG, etc).

Mark Vang

Thanks StarGeek I'll have a test batch ready soon and I'll test out the command you suggest.

I've been thinking of combining the steps but I figured it would be easier to "debug" issues if I kept things in clear steps. Obviously not so as you've explained. Also interesting fact about fdupes. I just started using it two days ago and I didn't realize that.

To try and visualize what was going on I renamed the leftovers adding a batch number using a 6-digit random (hex) string.
exiftool -overwrite_original '-filename<CreateDate' -d %y%m%d-%H%M%S%%-c_$(openssl rand -hex 3).%%le -r /photos

It took me three times to get them all renamed to where they would move into the date/sort folders but I'm now able to view them side-by-side to compare. I know for a fact that the photographers had some duplicate folders but I just thought 116k was too high a number to account for that.

I'll run that test batch this am but it will take a couple of days to run a full set so I'll post an update then. Thanks again.

StarGeek

Quote from: Mark Vang on December 12, 2016, 08:51:22 AMAlso interesting fact about fdupes.

The thing to remember about duplicate checking programs and metadata is that the slightest difference, such as adding a copyright, means that it will be considered a different file.  It's even possible for two jpegs to be graphically identical, but encoded differently (e.g. Progressive vs regular encoding) which would be considered different files by such programs.  That usually wouldn't need to be considered, though, as someone would have to have gone out of there way to do this.

There are programs out there that will compare graphics data only, though I'm unfamiliar with Mac/Linux so I can't recommend anything for those platforms.

QuoteI know for a fact that the photographers had some duplicate folders but I just thought 116k was too high a number to account for that.

That does seem high but it can happen depending upon the circumstances.  For example, I would shoot with two photographer friends at various comic conventions and if I wasn't careful to keep the sources separate, there would be multiple file name collisions as we were often shooting a lot of images (burst shots) at the same time.  Of course, it would be completely different shooting at the zoo.
* Did you read FAQ #3 and use the command listed there?
* Please use the Code button for exiftool code/output.
 
* Please include your OS, Exiftool version, and type of file you're processing (MP4, JPG, etc).

Mark Vang

Regarding fdupes I should note that to simplify my post I left out a step:

exiftool -v "-location<filepath" "-description<filepath" -artist="Bird Explorers" -Author="Bird Explorers" -xmp:copyright="www.birdexplorers.com" -r -overwrite_original /photos

The photographers use file paths to record information on location/date/identification that could look something like this:
/original_photos//*Africa/RSA/RSA 2004 #1 /J'Burgh HIGC #1 02:03:04/IMG_5121Cape Wagtail.JPG

I didn't want to lose that information so I write that to the location/description tags while the original folder structure is intact. Based on what you are telling me, that explains why fdupes is saying "no dupes" while exiftool is saying "already exists". I'll work through the last batch, comparing tags, etc. to confirm. (Final batch is 417485 .jpg files which will include plenty of duplicates to analyze.)

Also, when I write that path to the description and upload photos to Google Drive, the meta description is imported to the image description in Drive and is searchable.

I've got a new batch set up to start running today so I'll be back in a couple days with an update.

RE: comparing photos I've installed a program called dupeGuru (Linux/Win/Mac?) but haven't tested it yet. It is supposed to compare images data.