ExifTool PHP Fast Processing Script using StayOpen and Gearman

Started by TSM, November 05, 2013, 08:13:39 AM

Previous topic - Next topic

TSM

Ive created a script that can be used to run ExifTool in StayOpen mode within PHP.
The script can be inited as a singleton or instance.
Ive also supplied two additional scripts that wrap the class ready for use with gearman for scaleout.
The class does not apply any logic to the parameters supplied to ExifTool, what you push though gets passed though and the result is then returned.
The class detects if ExifTool has died and restarts it on the next call.

It still has some work to do and cleaning up but seems to work well in my environment and scales out quite nicely.
Originally I was using PHP streams but this proved to be a problem so instead using fgets to parse the return.
Note that script is hard coded to work with 9.03+ of ExifTool only because this is all I have been testing against since.

Let me know what you think or any changes to make it better.

https://github.com/tsmgeek/ExifTool_PHP_Stayopen


Performance tests.
These figures are to be taken as a guide of performance increase possible with supplied scripts but will vary depending on your hardware setup and arguments supplied to ExifTool.

100 iterations fetching metadata from a JPEG (-use MWG -g -j -*:*)

1 GM Instance - 52s
2 GM Instances - 25s
3 GM Instances - 17.5s
4 GM Instances - 12.5s

Usage

Below is a basic example on how to use this class.
Put all your commands in an array and push it into the stack using the $exif->add() function, you can add multiple jobs to process before calling fetch/fetchAll.

getInstance setup class as a singleton
setExifToolPath($path) set/change the path of ExifTool if not supplied at start
close() terminate ExifTool background process
start() start ExifTool background process
test() to check if ExifTool is running.
clear() clear the stack
fetch() will return one processed item off the stack at a time.
fetchAll() will return a single array with all items in the stack processed.

There are also calls to fetchDecoded/fetchAllDecoded which essentialy will decode the output in one step if your default arguments contains '-j' JSON output, the default for the script is ('-g','-j') to assist in this.

As you fetch items they are taken off the stack.





$data=array('-use MWG','-g','-j','-*:*','test1.jpg');
$exif = ExifToolBatch::getInstance(/path/to/exiftool');
$exif->add($data);
$result=$exif->fetchAll();

Phil Harvey

This looks great, but I'm a bit surprised by the slow speed.  On my iMac here, running exiftool on 100 random JPEG images:

> time exiftool -use MWG -g -j -all:all tmp3 > out.txt
    1 directories scanned
  100 image files read
1.041u 0.017s 0:01.15 91.3% 0+0k 1+2io 0pf+0w


That is 1.15 clock seconds for all 100 files (or 11.5 ms per file).  But you are getting 52 seconds for 100 files?  Are you sure you aren't somehow launching a separate ExifTool application for each file?  That would explain the difference (45x slower).

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

TSM

Hmmm, i think its all got slow once i moved to using fgets using a buffered socket, before i was using streams but found it incompatible with different version of php.
Ive checked and the PID does not change of the underlying perl once it has been started so its related to the fgets.
Ile look into it and get it sorted hopefully.

TSM

Ok I was going in circles then tested the file I had done the original performance tests on and found it was the JPG that had the problem.
It was the same JPG that caused this issue https://exiftool.org/forum/index.php/topic,5074.msg24427.html#msg24427

So I re tested it again with another file.

Note that this is putting 100 individual tasks into the GM queue and running them as individual '-execute' calls to exiftool, all GM instances were running on same vmware machine.

1 GM Instance - 1.3s
2 GM Instances - 0.9s
3 GM Instances - 0.7s
4 GM Instances - 0.6s


More stats
300 iterations with 4 GM Instances - 1.6s
100 iterations 3 files with 4 GM Instances - 1.4s
50 iterations 6 files with 4 GM Instances - 1.1s
100 iterations 10 files with 4 GM Instances - 3.1s

Note that all testing was looping over the same file stored locally.

I think I can get a little more speed out of the script by changing if my script uses buffered/unbuffered streams.

Phil Harvey

...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

TSM

Is it normal that when using stay_open mode you cannot set params such as '-use MWG' on each cycle, it seems that I have to set it when i first open the script but afterwards it can cause issues such as the script returning nothing rather than an error.

Phil Harvey

The -use option is funny that way:

       -use MODULE
            Add features from specified plug-in MODULE.  Currently, the MWG
            module is the only plug-in module distributed with exiftool.  This
            module adds read/write support for tags as recommended by the
            Metadata Working Group.  To save typing, "-use MWG" is assumed if
            the "MWG" group is specified for any tag on the command line.  See
            the MWG Tags documentation for more details.  (Note that this
            option is not reversible, and remains in effect until the
            application terminates, even across the "-execute" option.
)


I should really stop quoting the documentation in here.

I will remove the parentheses in the documentation, since they may obscure this important point.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

TSM

Ahhhaaa

Ile modify my script so that the startup params are configurable in the class.

ps. Ive updated Github with the new code and changes.
There is no need for users to use streams as they are very CPU intensive on the parent PHP process compared to fgets, this may be due to the way I pull the data back as ive used streams very little.

New functions.

getInstance($path,$args) You can now pass both the path and arguments (as array) when you init the class
setDefaultExecArgs($args) - Set default args when starting exiftool (eg -use MWG)
getDefaultExecArgs() - Get extra args used for starting exifool
setDefaultArgs($args) - Set default args used when processing commands, these will be applied before your supplied arguments
getDefaultArgs() - Get default args used for processing commands
run() - Handles -execute sequencing

ankutsa

Very usefull class, thanks. But the exiftool process keeps running even after close() is called. I think it's because exiftool doesn't listen for the SIGTERM signal that proc_terminate() is sending :(

Phil Harvey

The proper way to exit the exiftool process is to send the following arguments (as per step 5 in the -stay_open documentation):

-stay_open
False


I don't think the PHP interface is doing this.

The other way of terminating the exiftool process is by sending a SIGINT, but this will interrupt any processing that exiftool is currently doing.  I would not recommend using SIGTERM because exiftool may leave behind temporary files if it was in the process of writing to a file.

- Phil

P.S.  I also found that I needed to create a watchdog process to terminate exiftool in the case where the main program exits abnormally.  See my new C++ Interface for ExifTool for an example of how I did this in C++.
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

ankutsa

Adding
fwrite($this->_pipes[0], "-stay_open\nFalse\n");
in the close() method did it. Thank you :P

Also this class doesn't work with the -q argument, which is supposed to suppress info messages or warnings.


Phil Harvey

Quote from: ankutsa on January 01, 2014, 10:51:13 AM
Adding
fwrite($this->_pipes[0], "-stay_open\nFalse\n");
in the close() method did it. Thank you :P

Great.

QuoteAlso this class doesn't work with the -q argument, which is supposed to suppress info messages or warnings.

-q suppresses the "{ready}" response, which would break most -stay_open implementations.  If you really need to use -q but still require the "{ready}" response, you can use these arguments:

-q
-echo3
{ready}


- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

TSM

Quote from: ankutsa on January 01, 2014, 10:51:13 AM
Adding
fwrite($this->_pipes[0], "-stay_open\nFalse\n");
in the close() method did it. Thank you :P

Also this class doesn't work with the -q argument, which is supposed to suppress info messages or warnings.

ived added that fix to the github repository

TSM

Ive changed some of the internal code.

Added: setEchoMode() allows easy setting of the echo mode
Changed: execute() to only merge args and default args, execute_args() can be called seperatly if you want to process some arguments without the default arguments

TSM

Quote from: Phil Harvey on January 01, 2014, 02:52:43 PM
-q suppresses the "{ready}" response, which would break most -stay_open implementations.  If you really need to use -q but still require the "{ready}" response, you can use these arguments:

-q
-echo3
{ready}


- Phil

Is there a way to reverse the -q when in stay_open.