Out of memory error when using curl to pipe large file from URL to ExifTool

Started by martinrwilson, October 04, 2018, 05:58:33 AM

Previous topic - Next topic

martinrwilson

We want ExifTool to process an image from a URL (rather than a file) so we are using curl like this:

curl -s "https://somewhere/abigimage.png" | exiftool -fast -j -c %+.6f -

This works fine for files of up to a certain size but, for larger ones, it is failing (the kernel is killing the process). I suspect this is because curl is downloading the file faster than ExifTool can process it and so one of the tools (or pipe?) is buffering it in memory. Upping the amount of memory in the server makes it work.

I guess this is probably a question about curl or pipe rather than ExifTool but - is there a way round this that will ensure it works for files of any size?

Phil Harvey

Actually, ExifTool is buffering the file in memory.  Changing this would only be possible for certain file types (PNG is one of them), but it would be a lot of work to do this.  Let me think it.

- Phil

Edit:  This would require changing my file i/o object, which hasn't seen any code changes in 10 years!
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

martinrwilson

Thanks Phil.
We're trying to avoid downloading the whole file if the information ExifTool is needed is near the start of the file, but I guess we could switch to downloading the whole file first if it's above a certain size.

Phil Harvey

What file formats do you do this for?

I can avoid buffering PNG, JPG and QuickTime-based files (and have a working test version that does this now -- OK, so it wasn't as much work as I thought for these), but other files such as TIFF-based will require buffering due to the way they are structured.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

martinrwilson

We can't predict the file type but most will be popular image formats, so JPEG and PNG will cover a good proportion of them. That sounds great - we can just fall back to downloading the whole file to disc for other large files not in these formats. Thanks!

Phil Harvey

...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

martinrwilson

Is it possible to determine programmatically whether a particular file has its metadata and the start of the file? If not, is it the case that files of certain formats will always have their metadata at the start? If so, how can I find out which formats?

We want to use the following logic:
- If the metadata is at the start of the file, use curl piping to exiftool to avoid having to download the whole file.
- Otherwise, download the whole file and then run exiftool on the file.

Does this seem reasonable?
Thanks!

Phil Harvey

Generally, for files with metadata at the start, it should appear in the first 2 MB or so.  So you could limit the amount you send to ExifTool to 2 MB, and if you don't see the metadata, then download the whole thing.  I think you may be able to do it like this:

curl --header "Range: bytes=0-2000000" -s URL | exiftool -fast ...

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

martinrwilson

Sorry for the very delayed reply!
For various reasons related to performance, we are now revisiting the idea of streaming the file direct from S3 but reverting to downloading the entire file to disk if the metadata is not at the start of the file.

My question is: how do we tell if the metadata was (were?) found or not if we use the following command to pipe the first (say) 10MB of the file to ExifTool?
curl --header "Range: bytes=0-10000000" -s https://mybucket.s3.eu-west-1.amazonaws.com/some.mov | exiftool -fast -

It seems that if metadata is truncated, we see an error like this:
Warning : Truncated 'moov' data (missing 8712544 bytes)

Is this reliable, i.e. can we use this warning in our logic of "if the metadata was found in the first 10MB then great, otherwise download the whole file"?

Phil Harvey

This depends on what metadata you want to extract.  Timed metadata is generally found interleaved with the video stream, and in this case the whole file would need downloading.  Static metadata should exist within the moov chuck, so if you don't get this warning then I think you have a good chance of seeing all of the static metadata.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

martinrwilson

We're only interested in the static metadata so we'll give this a try. Thanks Phil :-)