Print Page - how to improve performance when bulk analysis is not possible

Title: how to improve performance when bulk analysis is not possible
Post by: Enrico on October 26, 2022, 11:34:22 AM

Hi guys,
currently driving an experiment by using exiftool.
In my case I have my API receiving a big amount of calls, each one pointing to a different file in a location different from previous one. I would like to apply all the performance improvements I can, in order to reduce the computation time but I cannot use the bulk analysis, due to what I just mentioned: files are known only during the call and in different locations.
So I am wondering about 2 points:
1) does the -stay_open really fit my case to avoid every time the loading overhead?
Could you by any chance make an example for me where I load 2 different files with this technique?
For what I understood my ARGFILE is a file without extension that looks like this:

-stay_open
True
-@
test/firstFile.jpg

and as soon as I receive another call it becomes:

-stay_open
True
-@
anotherTest/secondFile.jpg

Or am I missing something?

2) there is no streaming available, right? What I was wondering is if exiftool was capable to upstream the content of the file until it's enough for it to get the metadata. Example: a .mov is 200GB big, but only the first 50MB have valuable metadata info while the rest is body and therefore not needed to be loaded in memory.

Thanks in advance!

Title: Re: how to improve performance when bulk analysis is not possible
Post by: Phil Harvey on October 26, 2022, 08:40:34 PM

Quote from: Enrico on October 26, 2022, 11:34:22 AM1) does the -stay_open really fit my case to avoid every time the loading overhead?

Yes.

QuoteCould you by any chance make an example for me where I load 2 different files with this technique?

"load"? You need to be more specific. Are you just reading the files? What are you reading? Where is the output going (console or file)? What format is the output?

QuoteFor what I understood my ARGFILE is a file without extension that looks like this:

The extension is totally up to you. I usually use ".args" myself.

Quote-stay_open
True
-@
test/firstFile.jpg

and as soon as I receive another call it becomes:

-stay_open
True
-@
anotherTest/secondFile.jpg

Or am I missing something?

Actually, you are missing everything. The -@ option specifies the input argfile, not an image file. The argfile should look more like this:

Code Select

test/firstFile.jpg
-execute
anotherTest/secondFile.jpg
-execute

But if you are doing the same thing to every file, then the -stay_open option doesn't make sense. Just do this:

exiftool test/firstFile.jpg anotherTest/secondFile.jpg

Quote2) there is no streaming available, right? What I was wondering is if exiftool was capable to upstream the content of the file until it's enough for it to get the metadata. Example: a .mov is 200GB big, but only the first 50MB have valuable metadata info while the rest is body and therefore not needed to be loaded in memory.

Yes, ExifTool does this in some cases. Using the -fast option helps here, but this only works for some file formats.

- Phil

Title: Re: how to improve performance when bulk analysis is not possible
Post by: Enrico on October 27, 2022, 06:09:06 AM

Quote"load"? You need to be more specific. Are you just reading the files? What are you reading? Where is the output going (console or file)? What format is the output?

You are right! So, let me explain where I am coming from.
We need to build a service (no strict requirements on the implementation) that is in charge of extrapolating metadata from any file this service is hit with. This service can be a running web service, a simple executable do-the-work-and-stop or whatever else.
This service, whenever being hit with a request of extrapolating metadata from a certain file, has to do the extrapolation and return the result as soon as possible.
I am trying to understand if, by having a singleton receiving the requests on the (web) API side, I can make sure that the exiftool is loaded only once in memory (maybe through this -stay_open parameter) in such a way that every subsequent request hitting our service doesn't need to re-load the exiftool.

Quoteexiftool test/firstFile.jpg anotherTest/secondFile.jpg

Hope I managed to explain why this is not good for us since firstFile.jpg and secondFile.jpg hit our service in 2 different moments, while we need to return a result asap.

QuoteUsing the -fast option helps here, but this only works for some file formats

Can I read more about this somewhere maybe? If there is a size constraint, or which formats does this apply to?

Thanks again for all the help!

Title: Re: how to improve performance when bulk analysis is not possible
Post by: Phil Harvey on October 27, 2022, 11:52:31 AM

Hi Enrico,

OK, so you definitely want to use the -stay_open to reduce latency. You'll then either pipe arguments to ExifTool via stdin or use some temporary file (stdin is preferable).

The application documentation (https://exiftool.org/exiftool_pod.html#fast-NUM) explains the -fast option. No more detail about how it works for various file types is documented. Basically, try it out and see if it speeds things up.

- Phil

Title: Re: how to improve performance when bulk analysis is not possible
Post by: StarGeek on October 27, 2022, 11:59:35 AM

Quote from: Enrico on October 26, 2022, 11:34:22 AMExample: a .mov is 200GB big, but only the first 50MB have valuable metadata info while the rest is body and therefore not needed to be loaded in memory.

It should be noted that some video files place the metadata at the end of the file instead of at the beginning. See this post (https://exiftool.org/forum/index.php?topic=6080.msg29931#msg29931).

Title: Re: how to improve performance when bulk analysis is not possible
Post by: Enrico on October 28, 2022, 08:56:19 AM

Quote from: Phil Harvey on October 27, 2022, 11:52:31 AMThe application documentation (https://exiftool.org/exiftool_pod.html#fast-NUM) explains the -fast option. No more detail about how it works for various file types is documented. Basically, try it out and see if it speeds things up.

Hi Phil, I did read that documentation but my question was more about if there is something out of scope. Reason behind is that I did read in the fast param docs:

QuoteExifTool will not scan to the end of a JPEG image to check for an AFCP or PreviewImage trailer, or past the first comment in GIF images or the audio/video data in WAV/AVI files to search for additional metadata.

So I was wondering if the fast param speed up is exclusively for these mentioned examples/formats or extended to whatever possible.
Thanks!

Kind regards,

Enrico Ribelli

Title: Re: how to improve performance when bulk analysis is not possible
Post by: Phil Harvey on October 28, 2022, 01:50:56 PM

Without looking into this in detail, the -fast (API FastScan) option is used by the following modules:

Code Select

> grep -rl FastScan lib/Image/ExifTool
lib/Image/ExifTool/QuickTime.pm
lib/Image/ExifTool/RIFF.pm
lib/Image/ExifTool/AFCP.pm
lib/Image/ExifTool/EXE.pm
lib/Image/ExifTool/GIF.pm
lib/Image/ExifTool/XMP.pm
lib/Image/ExifTool/Font.pm
lib/Image/ExifTool/Real.pm
lib/Image/ExifTool/Exif.pm
lib/Image/ExifTool/Validate.pm
lib/Image/ExifTool/TagNames.pod
lib/Image/ExifTool/PNG.pm
lib/Image/ExifTool/DNG.pm
lib/Image/ExifTool/Text.pm
lib/Image/ExifTool/Writer.pl
lib/Image/ExifTool/M2TS.pm
lib/Image/ExifTool/AIFF.pm
lib/Image/ExifTool/PostScript.pm
lib/Image/ExifTool/VCard.pm

So it should have some effect for all of these formats. (Note that many different audio/video file formats are based on QuickTime and RIFF.)

- Phil

Title: Re: how to improve performance when bulk analysis is not possible
Post by: Enrico on October 31, 2022, 10:46:28 AM

Quote from: Phil Harvey on October 28, 2022, 01:50:56 PMWithout looking into this in detail, the -fast (API FastScan) option is used by the following modules:

Code Select Expand
> grep -rl FastScan lib/Image/ExifTool lib/Image/ExifTool/QuickTime.pm lib/Image/ExifTool/RIFF.pm lib/Image/ExifTool/AFCP.pm lib/Image/ExifTool/EXE.pm lib/Image/ExifTool/GIF.pm lib/Image/ExifTool/XMP.pm lib/Image/ExifTool/Font.pm lib/Image/ExifTool/Real.pm lib/Image/ExifTool/Exif.pm lib/Image/ExifTool/Validate.pm lib/Image/ExifTool/TagNames.pod lib/Image/ExifTool/PNG.pm lib/Image/ExifTool/DNG.pm lib/Image/ExifTool/Text.pm lib/Image/ExifTool/Writer.pl lib/Image/ExifTool/M2TS.pm lib/Image/ExifTool/AIFF.pm lib/Image/ExifTool/PostScript.pm lib/Image/ExifTool/VCard.pm
So it should have some effect for all of these formats. (Note that many different audio/video file formats are based on QuickTime and RIFF.)

- Phil

Thanks for the answer Phil.
I gave it a try and our main problem seems to be big files located over the internet unfortunately.
We have all our files located on an S3 Amazon bucket and therefore need to be referenced through http call.
So, I stored a big mp4 file (around 3GB) in there and tried:

Code Select

curl -s https://ours3bucket.amazonaws.com/BigVideo.mp4| exiftool -fast1 -
and unfortunately takes ages and ages to load.. talking about 30 mins or so.
I guess I am using the tool as it's supposed to be, and not missing anything out, right?
Don't get me wrong here please, I love the exiftool :) just trying to understand if it covers all our cases without missing anything.

Title: Re: how to improve performance when bulk analysis is not possible
Post by: Phil Harvey on October 31, 2022, 11:27:13 AM

Hi Enrico,

So we're talking about MP4 videos here. Unfortunately for many MP4 videos the metadata is stored at the end of the file. If you show me the output of the exiftool -v3 command, I can tell you exactly where the metadata is in your test file.

- Phil

Title: Re: how to improve performance when bulk analysis is not possible
Post by: Enrico on October 31, 2022, 11:44:37 AM

Mmmm understood..
You can find the result in the attachment.
So I believe that even though streaming is up and available, exiftool doesn't have the capability of saying "skip the body and load only the metadata, even if it is only at the end" ?
For Mp4 too, if I remember correctly, info on where the body starts and ends are shown at the very beginning of the file, giving the chance of skipping all the body-bytes and reposition the reader right after it.

Title: Re: how to improve performance when bulk analysis is not possible
Post by: Phil Harvey on October 31, 2022, 12:10:21 PM

For your file the metadata is contained within the first 5 MB. There is nothing after the 2+ GB 'mdat' atom.

I could enhance the -fast2 option to stop processing at 'mdat', which would work well for this file (then only reading the first 5 MB), but it may miss reading some metadata in other files.

- Phil

Title: Re: how to improve performance when bulk analysis is not possible
Post by: Phil Harvey on October 31, 2022, 12:14:16 PM

Re: "skipping" the middle. ExifTool actually does "skip" over the mdat by seeking forward in the file, but over a pipe (eg. via curl) the entire data still needs to be read.

- Phil

Title: Re: how to improve performance when bulk analysis is not possible
Post by: Enrico on October 31, 2022, 04:30:50 PM

Quote from: Phil Harvey on October 31, 2022, 12:14:16 PMRe: "skipping" the middle. ExifTool actually does "skip" over the mdat by seeking forward in the file, but over a pipe (eg. via curl) the entire data still needs to be read.

- Phil

So, if I understand correctly, exiftool doesn't have another way to read the metadata of a file on the internet (via url) if not through piping (e.g. via curl), right?
And therefore in any case the file would need to be downloaded in full at first before processing it?

Title: Re: how to improve performance when bulk analysis is not possible
Post by: Phil Harvey on October 31, 2022, 04:56:24 PM

Correct.

ExifTool Forum

ExifTool => Newbies => Topic started by: Enrico on October 26, 2022, 11:34:22 AM