I'm working on a program similar to ExifToolGUI and I'm trying to make it as fast, robust and as 'bullet proof' as possible.
I have an option in the program to turn on or off filtering files by their file name extension so as to only look at the 'supported' file extensions. And in general that narrows things down pretty well unless there is something that has the wrong extension on it like an .XMP extension but is really plain old XML (ie the output of the -listx command saved with a .XMP extension).
But I have also found with file filtering off, I can have a JPEG file with and extension of .zzz and exiftool reads it just fine and knows the FileType is JPEG. So I want to be able to select any file and then if it is a FileType that is appropriate for exiftool then process it, otherwise just ignore it. The problem comes about when there is a file that really is not appropriate for exiftool, but once exiftool starts processing the file it can take a long time and generate a big output. (I'm using the -X option to get my output in XML format from exiftool.) As an example of a bad file to feed exiftool is the XML output of exiftool from the -listx command. If this 8mb file is feed into exiftool and processed with -X -l -t options it generates 97mb of output and takes about 95 seconds on my system.
To get around this I could call exiftool first to just get the File:FileType only and then based on that either ignore or process the file. But then I have to call exiftool twice for every file. Calling this on a normal JPEG file takes about 30-40ms and then a second call to get all the tags in the JPEG file takes about 50ms, so to do this (make the extra call to get the FileType) increases the time to get that tag values by about 80%, so is not an optimal solution.
So I implemented logic that when I am getting the XML output from exiftool (a line at a time) I look for the <File:FileType> element and if it is FilteType XML then I ignore the rest of the output that exiftool returns. This dramatically reduces the output down to 2.5k but the execution time (as expected) is still about 95 seconds.
I know this is a weird case that really should not even be doing, but in my quest to make my program as bullet proof and as robust and as fast as possible, I was hoping for a way in a single call to exiftool (using -X -l -t) to get all the tags for a file BUT if exiftool determines the FilteType was XML (and possibly other 'bad' FileTypes) to have exiftool immediately stop processing the file.
Any way I can do that? or other ideas on how to stop or not start exiftool processing 'inappropriate' FileTypes?
Thanks!
Curtis
Edit:
Also I do give the user a chance to stop/abort when exiftool processing time takes longer than a user specified amount of time, but this can be annoying with a large number of 'inappropriate' files and the exiftool process has to be stopped (via SIG_INT) and then restarted.
Have you already tried using the -if clause? I'm quite sure this will help speed things up.
The -if option won't help here because ExifTool processes the entire input file before evaluating the condition.
Let me think about this.
- Phil
Quote from: Phil Harvey on November 03, 2014, 07:10:40 AM
The -if option won't help here because ExifTool processes the entire input file before evaluating the condition.
Yes, it will have processed the whole file, but it won't try to form the xml output, and I'm quite certain that will speed things up (too).
Hayo,
I tried your suggestion and it did help. For my test 'bad' XML file the processing time went from 95 sec to 63 sec, quite an improvement. And as expected the output size was 0. So this is a much better way to do what I was doing with the looking for the <File:FileType> element and then stop saving output from that point on. (coding for that seemed very kludgy to me and am glad to get rid of that code!)
But, it does seem that since exiftool must early on determine the type of file it is about to process that as a special case a test on the FileType tag could be used to stop exiftool immediately from any further processing. Hopefully Phil has some ideas for that. :)
Thank you for your idea, it really helped!
Curtis
Hi Curtis,
I looked into this. You're right that ExifTool knows early for XML files, but this isn't the case in general (the FileType for TIFF files may be determined from the Compression tag if it exists), so it wouldn't be possible in the general case to add an option which aborts processing after identification. (Also, there are a number of ZIP-based files which require parsing of an embedded XML file to determine the file type.) :(
- Phil
Thanks for looking into this Phil,
Hayo's idea helped in a 'clean' way (compared to me looking for <File:FileType> element) and I'll be using the -if , it does save considerable time and there are enough other options I give the user to not run into this problem too often.
Thanks again!
Curtis
I had an idea. I could extend the -fast option to add a -fast3 feature which avoids processing the file entirely. This option couldn't set the FileType tag (at least, not reliably), but I could add a new FastFileType as the initial guess at the file type. In this model, the initial guess for all XML-format files would be "XMP". Would this help?
- Phil
Hi Phil,
I probably need to better understand how exiftool can use/uses XML and XMP files (and possibly other file types) in order to answer your question.
First:
Is there any reason to try to get/extract tags from XML file type?
Is there any reason to try to get/extract tags from XMP file type? (seems like XMP files are used by exiftool as input to get tag values from to write to other files, correct?)
If the answer is No for both questions, then your solution would greatly help in that I could just always not process (ie get tags for) any XML file types. If yes just for XMP file type, then your solution would still be helpful in that I could give the user the option to filter out all XML file types (even though XMP is a valid file type), which may be the most common situation anyways, and this could really save on execution time.
For some background, basically, my program is in many ways similar to ExifToolGUI in that I can browse for files and get a list of files, etc. Then I send all the files (in separate thread(s)/task(s)) to exiftool to get all the tags for all the files (using -X -l -t options). I want to give the user the option to get all the tags for all 'appropriate' file types or just for file types the user selects. (There may be other file types besides XML that are not 'appropriate' to get tags from and I'd want to also not get tags from those file types.)
For my own use, I would mainly be using my program for getting and editing tags in jpg and pdf files, but I want to eventually make it available to anyone to use and that is why I'm trying to make it as versatile as possible by trying to 'intelligently' deal with any file type (even though at this point I may not fully understand all the nuances of the uses of the other file types).
For your idea, it seems like once you have determined it is an XML formatted file, then it would be possible to determine for sure it is an XMP file type by looking at the first few lines. Although these lines are optional in the XMP specification, if they are present you would know for sure you have an XMP file type but if they were not there, you would not know for sure you don't have an XMP file type.
Optional XMP elements:
<?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d'?>
<x:xmpmeta xmlns:x='adobe:ns:meta/' x:xmptk='any text here'>
It seems like it may be better to return for FastFileType XML file type for XML unless by looking at the first few lines it was determined to be XMP file type then return XMP. I know that the XMP elements could be embedded down deep inside the XML and if that were the case then this logic would return XML file type for an XML file type that has XMP inside it.
I appreciate your time looking at this and your ideas...
Thanks!
Curtis
Hi Curtis,
Quote from: Curtis on November 04, 2014, 02:54:28 PM
First:
Is there any reason to try to get/extract tags from XML file type?
Is there any reason to try to get/extract tags from XMP file type? (seems like XMP files are used by exiftool as input to get tag values from to write to other files, correct?)
The reasons depend entirely on your requirements. Some people use this ability, but I can't say if you will. However, I will point out that XML is problematic, and ExifTool doesn't officially support it (you will notice that XML isn't in the list of supported file types).
QuoteIt seems like it may be better to return for FastFileType XML file type for XML unless by looking at the first few lines it was determined to be XMP file type then return XMP. I know that the XMP elements could be embedded down deep inside the XML and if that were the case then this logic would return XML file type for an XML file type that has XMP inside it.
This is true. Currently ExifTool identifies any RDF/XML as XMP. It would make sense if FastFileType did this too, but that would be more work for me. I will look into this.
- Phil
Quote from: Phil Harvey on November 04, 2014, 07:20:53 PM
Currently ExifTool identifies any RDF/XML as XMP.
Note that this includes the
exiftool -X output, which is RDF/XML.
QuoteIt would make sense if FastFileType did this too, but that would be more work for me. I will look into this.
OK, I did the work. It was easiest if I just aborted the XMP parser after the initial type determination, which means that it was easier to just keep it as FileType instead of FastFileType, although now I need to document why the FileType may change with the
-fast3 options, and I'm sure I'll get some questions about this.
- Phil
Phil you are fantastic! :D You have made your Perl code easy to accommodate even the weirdest requests! Thank you!
So, if I wanted exiftool to not process XML files (but XMP files I do) then I would use:
-fast3 'if $File:FileType ne "XML"'
Would these options have exiftool get files it determines to be XMP but skip files that are plain XML?
Are there other file types that -fast3 would cause filetype to be determined quick? All file types?
Thanks!
Curtis
Hi Curtis,
Quote from: Curtis on November 04, 2014, 09:13:27 PM
So, if I wanted exiftool to not process XML files (but XMP files I do) then I would use:
-fast3 'if $File:FileType ne "XML"'
The syntax would be:
-fast3 -if '$File:FileType ne "XML"'But it seems as if you want ExifTool to process the file and extract all tags if the condition is true. This won't happen. You would call ExifTool once with
-fast3 to get FileType, then call it again for files you are interested in. So I don't think that the
-if condition makes much sense.
With
-fast3, ExifTool will return quickly for all file types (and not extract any embedded metadata).
- Phil
I am just running some tests on this feature. Getting FileType for my test set of 8200 files takes 140 seconds without -fast3, and 8 seconds with it.
There are some annoying inconsistencies between the FileType returned with and without -fast3, but nothing serious (mainly ZIP-based formats).
- Phil
Hi Phil,
Ahhh, I understand now how it will all work...
I hope that for my 'normal' file (jpeg, etc) that the overhead of calling exiftool twice does not outweigh the benefit of avoiding potential long processing times of the 'bad' filetypes (XML). I did try this before with filetype as mentioned in my original post and it cost about 30-40ms to get the FileType for a 'normal' jpg file.
Would be nice if -fast3 could take an argument list of comma separated FileTypes to process or ignore, OR allow multiple -fast3 each with just one optional filetype
maybe this would not process XML and XMP filetypes, but all others
--fast3 'XML,XMP' or --fast3 XML --fast3 XMP
this would only process jpg and pdf files
-fast3 'JPEG,PDF' or -fast3 JPEG -fast3 PDF
and just this would do what you have it doing now
-fast3
I'm just concerned about adding any extra processing time to all files processed when the majority of the files I do want to process, just to save processing time for the few that may have long processing times.
Thanks so much!
Curtis
Hi Curtis,
If you are using the -stay_open option, then the overhead should be minimal.
I can't change the -fast option to accept an argument while still maintaining backward compatibility.
The 30-40 ms should be greatly reduced with -fast3 (assuming you are using -stay_open).
- Phil
Hi Phil,
OK, let me know when I can get the new version and I'll do some timing tests.
What about a new option called -filetype (or something else besides -fast) that could take arguments and would work as I described for -fast3 with arguments?
Maybe a format like
-notfiletype=XML to exclude all but listed filetype or
-filetype=JPEG to only include listed filetype
You have already done a lot on this, so if the above still would not work well, I understand. Looking forward to doing the above mentioned timing tests.
Thanks again!
Curtis
Hi Curtis,
Quote from: Curtis on November 05, 2014, 12:35:45 PM
What about a new option called -filetype (or something else besides -fast) that could take arguments and would work as I described for -fast3 with arguments?
I really try to resist adding new options because there are already too many, unless it would be useful for many people. The reason I thought the
-fast3 was an acceptable addition is because it I have had other requests for a way to quickly extract the system tags.
- Phil
Sounds very reasonable..... looking forward to next version! Curtis
PS: In rereading you last post, I am curious, with -fast3 what tags will be available besides File:Filetype? You mentioned "system tags", so does that include all System: and File: tags??
Hi Curtis,
Any tag that doesn't require processing of the file. For example:
> exiftool a.jpg -fast3 -G1
[ExifTool] ExifTool Version Number : 9.76
[System] File Name : a.jpg
[System] Directory : .
[System] File Size : 4.7 MB
[System] File Modification Date/Time : 2014:11:04 11:42:07-05:00
[System] File Access Date/Time : 2014:11:05 13:09:11-05:00
[System] File Inode Change Date/Time : 2014:11:04 11:42:07-05:00
[System] File Permissions : rw-r--r--
[File] File Type : JPEG
[File] MIME Type : image/jpeg
- Phil
Great! That is all the tags I would want to get for files with the FileTypes I don't want to process.... I see in the output it is exiftool 9.76.... when do you expect it to be available?
Curtis
Hi Curtis,
I may be busy this weekend, so it could be next weekend before this gets released. But if I get a chance I'll do it this weekend.
- Phil
Great! Thank you again for your time on this. :)
Hi Curtis,
Version 9.76 is out now.
- Phil
Thanks Phil! I'll give it a try shortly.
Curtis
Hi Phil,
Got the new exitool.exe. Tested the new -fast3 and it works well. Costs about 10-20ms for typical .jpg files or 'bad' XML files, must faster than before (up to 90 sec!). So having this as an option is great!
So... next is there a -list type option that will give a list of all the FileTypes that exiftool supports/recognizes? or is that already what the -listf or -listr give?
Curtis
Hi Curtis,
The -listf, -listr and -listwf options give the list of supported, recognized and writable file extensions respectively.
- Phil
So the tag value for File:FileType gives the same values as for the supported/recognized file name extensions given by -listr and -listf ? correct?
Curtis
PS... also always wanted to ask, what is the difference between supported files types and recognized file types?
Quote from: Curtis on November 17, 2014, 12:07:24 PM
So the tag value for File:FileType gives the same values as for the supported/recognized file name extensions given by -listr and -listf ? correct?
This is not strictly true, but it does hold for most common types. (An example of an exception would be the Canon 1D raw file which has an extension of TIFF but returns a FileType of "Canon 1D RAW".)
QuotePS... also always wanted to ask, what is the difference between supported files types and recognized file types?
ExifTool will return the FileType for a recognized file, but will only extract metadata from the file if it is supported.
- Phil
OK, sounds like there are few exceptions to the FileType value not being the same as the values given by -listr and -listf and there is no -list to get all the actual FileType values such as the "Canon 1D RAW".
Thanks for all the info!
Curtis
Hi Phil....
(sorry to be a pest....) :-\
What I'm looking for is a list of
FileTypes (strings), (which I would display to the user and allow them to pick from) that I can then use to compare against the
FileTypes coming from exiftool (specifically when I'm using the
-fast3 option).
For things like JPEG
FileType, it may come from a file with really any file name extension, but typically from .jpg. But
FileType returned as a tag value by exiftool would never be JPG, is that correct? JPG is listed from both
-listr and
-listf.
It seems like the list of all
FileTypes returned as a tag value from exiftool is a subset of what
-listr or
-listf give and in a few cases as you mentioned the
FileType value is valid but not listed in
-listr or
-listf.
If it is 'easy' would be nice to get the actual list of
FileType values that exiftool can possibly return (whether or not they are supported or just recognized).
Below is an attachment showing a screen shot of some pertinent info from my program. The
File Type column is the exiftool FileType tag and the
Type column is the file type returned by Windows (just for info).
Thanks so much for your time on this... this is not a critical need...
Curtis
Hi Curtis,
What you are requesting is difficult. Unfortunately, since extensions do not map 1:1 into file types. (As you point out, as well as multiple FileType's for a single extension, there are also multiple extensions for a single FileType, like JPEG, JPG and JPE for FileType JPEG.) Also unfortunately, the FileType logic is not centralized in the code, so there is no way to dig out all of the possible FileType values automatically. But I'll give this a bit of thought.
- Phil
Thanks Phil, like I said it is not critical... no need to spend too much time on this.
Thanks!
Curtis
Just FYI, here is a list of file types that differ from the extensions from my set of test files:
> exiftool ../pics -if 'uc $fileextension ne $filetype' -fileextension -filetype -T | sort | uniq
aif AIFF
ait AI
azw3 MOBI
cos XML
dcm ACR
dcm DICOM
dic DICOM
djvu DJVU (multi-page)
eip IIQ
epsf EPS
exe ELF executable
exe Mach-O executable
fff FLIR
gz GZIP
htm HTML
icm ICC
indt INDD
jpg CR2
jpg JPEG
lfr LFP
m4a MP4
m4v MP4
mov MP4
mp4 M4P
mpg MPEG
mts M2TS
nrw NEF
ogg OGV
pct PICT
pspimage PSP
qt MOV
tif Canon 1D RAW
tif IIQ
tif TIFF
tiff IIQ
torrent Torrent
ts M2T
Thanks Phil, I appreciate the list... pretty cool how you can use exiftool to make such a list so easily!