Ignore extension when reading?

Started by fracai, February 08, 2015, 03:40:53 PM

Previous topic - Next topic

fracai

Is there a way to ignore the file extension when determining the filetype?

I'm working on a script to detect duplicate images, but ignoring any metadata. To do this I use the following:exiftool /path/to/image -all= -o - | md5sum
This works great, but fails if the image has the wrong extension. In other words, if I test a JPG file stored with a PNG extension exiftool only outputs the following message:
QuoteError: Not a valid PNG (looks more like a JPEG) - IMG_0076_00.png
If I just change the extension to JPG, the file is processed normally. I figure that if exiftool can figure out it's probably a JPG, it should be able to process it as one.

Ignoring that I can, should, and probably will just first fix all the incorrectly named files, is there a way to ignore the file extension?

Phil Harvey

I'll look into this, but currently ExifTool considers an incorrect extension to be a serious error, hence it won't process the file.

If you want, you could work around it like this:

cat /path/to/image | exiftool -all= - | md5sum

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

fracai

Ah, excellent idea. cat will work nicely and doesn't add much at all to processing. I can even avoid cat with:exiftool -all= - < /path/to/image | md5sum.

Thanks a lot.

I would appreciate it if there was a way to ignore the extension error, but it's not a critical need.

Phil Harvey

I looked into this.  I think it is just safer to refuse writing a file with the wrong extension.  This could be an indication of a corrupted file.  At the very least, it should be fixed because some other utilities will have trouble reading a file like this.

I thought about making this a minor error, but I'm not too enthusiastic about changing this -- you are the first one in 10 years to complain about it.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

fracai

Would it help if I just got more vocal? ;-)

I agree about just correcting the extension.

So I know I can use $filetype when renaming, is there a way to get "jpg" instead of "JPEG"? The examples so far look like I'd have to take a separate pass for each filetype that I want to handle.

Thanks for your help.

Phil Harvey

Unfortunately, FileType is not designed to be used as an extension, so I wouldn't recommend doing this blindly.  Specifically, some file types with ambiguous extensions have been given FileType names which are unique but don't represent a viable extension (like "Canon 1D RAW" for example, which has a "TIFF" extension).

I would recommend something which is safer but less efficient, and doing one type at a time, as you mentioned, like this:

exiftool -if '$fileType eq "JPEG"' -filename=%f.jpg -v DIR

Here I added the -v option so you would have an indication of which files were renamed.

If it weren't for these ambiguous extensions, you could have done something like this:

exiftool '-filename<%f.${filetype;s/JPEG/JPG/;$_=lc($_)}' -v DIR

to get lower case "jpg" extension for JPEG files, and lower case extensions for other types.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

fracai


TippyTurtle

I have had good luck with _all_ file formats on my big ugly 4tb file server with this script:
http://code.activestate.com/recipes/362459-dupinator-detect-and-delete-duplicate-files/
MP3's, photo's, etc. seem to magically duplicate themselves with new file names.  This has found the cruft for me without any scary moments...so far.

I realize if a single byte has changed in a copy of a file, say because EXIF data was updated, the two files won't be considered duplicates.  Even so, with a family of 5 copying movies/photos/etc off multiple devices at random times, this script has probably saved me terabytes.  :-)

I did add a few lines at the end so I would know more of what actually happened:
print 'Found %d files.' % sizes.__len__()
print 'Found %d sets of potential duplicates.' % potentialCount
print 'Tried to delete %d actual duplicates.' % dupes.__len__()


Hope this helps,
Todd

Phil Harvey

I am thinking about adding a "FileTypeExtension" tag for this purpose.  Currently, my only concerns are:

1) what to do if the file type doesn't normally have an extension

Presumably, in this case the FileTypeExtension would be an empty string, but this leads to the problem of how to avoid adding the trailing "." when using FileTypeExtension to rename files if the FileTypeExtension is empty.

2) what to do if the file extension isn't known

Presumably it would be best to avoid generating the FileTypeExtension tag here.  But then you must be able to handle a missing FileTypeExtension.

I'm willing to listen to any suggestions that anyone may have.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

fracai

You could have "FileTypeExtension" which provides "png" and "FileTypeExtensionWithDot" that provides ".png". In both cases, this would be an empty string if there isn't normally an extension.

I'd think it'd be easy enough to avoid renaming if the extension is an empty string by testing it in an "-if" argument.

Not appending a dot could be handled the same way, but the additional "WithDot" tag would probably be useful enough to justify it.

Phil Harvey

Thanks for the suggestion.  I would prefer not to add another tag for this (I was even dubious about adding FileTypeExtension).

Thinking about it, maybe the advanced formatting expression will provide a reasonable alternative, something like this:

exiftool "-filename<test%c${FileTypeExtension;$_=qq(.$_) if $_}" DIR

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

plaw

Hi Phil, just wondering if this is still the case that there's no flag for ignoring the "file type inaccurate" error? I tried -m but I guess you consider it more than minor?

It turns out that users upload all sorts of files with the wrong filetypes. I can do some processing on my end to fix it before using exiftool but I'm just looking if there's a flag or something to use first.

Phil Harvey

Currently the only way around this is to fix the extension since it still isn't a minor error.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).