We want ExifTool to process an image from a URL (rather than a file) so we are using curl like this:
curl -s "https://somewhere/abigimage.png" | exiftool -fast -j -c %+.6f -
This works fine for files of up to a certain size but, for larger ones, it is failing (the kernel is killing the process). I suspect this is because curl is downloading the file faster than ExifTool can process it and so one of the tools (or pipe?) is buffering it in memory. Upping the amount of memory in the server makes it work.
I guess this is probably a question about curl or pipe rather than ExifTool but - is there a way round this that will ensure it works for files of any size?
Actually, ExifTool is buffering the file in memory. Changing this would only be possible for certain file types (PNG is one of them), but it would be a lot of work to do this. Let me think it.
- Phil
Edit: This would require changing my file i/o object, which hasn't seen any code changes in 10 years!
Thanks Phil.
We're trying to avoid downloading the whole file if the information ExifTool is needed is near the start of the file, but I guess we could switch to downloading the whole file first if it's above a certain size.
What file formats do you do this for?
I can avoid buffering PNG, JPG and QuickTime-based files (and have a working test version that does this now -- OK, so it wasn't as much work as I thought for these), but other files such as TIFF-based will require buffering due to the way they are structured.
- Phil
We can't predict the file type but most will be popular image formats, so JPEG and PNG will cover a good proportion of them. That sounds great - we can just fall back to downloading the whole file to disc for other large files not in these formats. Thanks!
OK, great. Expect to see this update in ExifTool 11.13.
- Phil
Is it possible to determine programmatically whether a particular file has its metadata and the start of the file? If not, is it the case that files of certain formats will always have their metadata at the start? If so, how can I find out which formats?
We want to use the following logic:
- If the metadata is at the start of the file, use curl piping to exiftool to avoid having to download the whole file.
- Otherwise, download the whole file and then run exiftool on the file.
Does this seem reasonable?
Thanks!
Generally, for files with metadata at the start, it should appear in the first 2 MB or so. So you could limit the amount you send to ExifTool to 2 MB, and if you don't see the metadata, then download the whole thing. I think you may be able to do it like this:
curl --header "Range: bytes=0-2000000" -s URL | exiftool -fast ...
- Phil
Sorry for the very delayed reply!
For various reasons related to performance, we are now revisiting the idea of streaming the file direct from S3 but reverting to downloading the entire file to disk if the metadata is not at the start of the file.
My question is: how do we tell if the metadata was (were?) found or not if we use the following command to pipe the first (say) 10MB of the file to ExifTool?
curl --header "Range: bytes=0-10000000" -s https://mybucket.s3.eu-west-1.amazonaws.com/some.mov | exiftool -fast -
It seems that if metadata is truncated, we see an error like this:
Warning : Truncated 'moov' data (missing 8712544 bytes)
Is this reliable, i.e. can we use this warning in our logic of "if the metadata was found in the first 10MB then great, otherwise download the whole file"?
This depends on what metadata you want to extract. Timed metadata is generally found interleaved with the video stream, and in this case the whole file would need downloading. Static metadata should exist within the moov chuck, so if you don't get this warning then I think you have a good chance of seeing all of the static metadata.
- Phil
We're only interested in the static metadata so we'll give this a try. Thanks Phil :-)