ExifTool PHP Fast Processing Script using StayOpen and Gearman

Started by TSM, November 05, 2013, 08:13:39 AM

Previous topic - Next topic

TSM

Ive created a script that can be used to run ExifTool in StayOpen mode within PHP.
The script can be inited as a singleton or instance.
Ive also supplied two additional scripts that wrap the class ready for use with gearman for scaleout.
The class does not apply any logic to the parameters supplied to ExifTool, what you push though gets passed though and the result is then returned.
The class detects if ExifTool has died and restarts it on the next call.

It still has some work to do and cleaning up but seems to work well in my environment and scales out quite nicely.
Originally I was using PHP streams but this proved to be a problem so instead using fgets to parse the return.
Note that script is hard coded to work with 9.03+ of ExifTool only because this is all I have been testing against since.

Let me know what you think or any changes to make it better.

https://github.com/tsmgeek/ExifTool_PHP_Stayopen


Performance tests.
These figures are to be taken as a guide of performance increase possible with supplied scripts but will vary depending on your hardware setup and arguments supplied to ExifTool.

100 iterations fetching metadata from a JPEG (-use MWG -g -j -*:*)

1 GM Instance - 52s
2 GM Instances - 25s
3 GM Instances - 17.5s
4 GM Instances - 12.5s

Usage

Below is a basic example on how to use this class.
Put all your commands in an array and push it into the stack using the $exif->add() function, you can add multiple jobs to process before calling fetch/fetchAll.

getInstance setup class as a singleton
setExifToolPath($path) set/change the path of ExifTool if not supplied at start
close() terminate ExifTool background process
start() start ExifTool background process
test() to check if ExifTool is running.
clear() clear the stack
fetch() will return one processed item off the stack at a time.
fetchAll() will return a single array with all items in the stack processed.

There are also calls to fetchDecoded/fetchAllDecoded which essentialy will decode the output in one step if your default arguments contains '-j' JSON output, the default for the script is ('-g','-j') to assist in this.

As you fetch items they are taken off the stack.





$data=array('-use MWG','-g','-j','-*:*','test1.jpg');
$exif = ExifToolBatch::getInstance(/path/to/exiftool');
$exif->add($data);
$result=$exif->fetchAll();

Phil Harvey

This looks great, but I'm a bit surprised by the slow speed.  On my iMac here, running exiftool on 100 random JPEG images:

> time exiftool -use MWG -g -j -all:all tmp3 > out.txt
    1 directories scanned
  100 image files read
1.041u 0.017s 0:01.15 91.3% 0+0k 1+2io 0pf+0w


That is 1.15 clock seconds for all 100 files (or 11.5 ms per file).  But you are getting 52 seconds for 100 files?  Are you sure you aren't somehow launching a separate ExifTool application for each file?  That would explain the difference (45x slower).

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

TSM

Hmmm, i think its all got slow once i moved to using fgets using a buffered socket, before i was using streams but found it incompatible with different version of php.
Ive checked and the PID does not change of the underlying perl once it has been started so its related to the fgets.
Ile look into it and get it sorted hopefully.

TSM

Ok I was going in circles then tested the file I had done the original performance tests on and found it was the JPG that had the problem.
It was the same JPG that caused this issue https://exiftool.org/forum/index.php/topic,5074.msg24427.html#msg24427

So I re tested it again with another file.

Note that this is putting 100 individual tasks into the GM queue and running them as individual '-execute' calls to exiftool, all GM instances were running on same vmware machine.

1 GM Instance - 1.3s
2 GM Instances - 0.9s
3 GM Instances - 0.7s
4 GM Instances - 0.6s


More stats
300 iterations with 4 GM Instances - 1.6s
100 iterations 3 files with 4 GM Instances - 1.4s
50 iterations 6 files with 4 GM Instances - 1.1s
100 iterations 10 files with 4 GM Instances - 3.1s

Note that all testing was looping over the same file stored locally.

I think I can get a little more speed out of the script by changing if my script uses buffered/unbuffered streams.

Phil Harvey

...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

TSM

Is it normal that when using stay_open mode you cannot set params such as '-use MWG' on each cycle, it seems that I have to set it when i first open the script but afterwards it can cause issues such as the script returning nothing rather than an error.

Phil Harvey

The -use option is funny that way:

       -use MODULE
            Add features from specified plug-in MODULE.  Currently, the MWG
            module is the only plug-in module distributed with exiftool.  This
            module adds read/write support for tags as recommended by the
            Metadata Working Group.  To save typing, "-use MWG" is assumed if
            the "MWG" group is specified for any tag on the command line.  See
            the MWG Tags documentation for more details.  (Note that this
            option is not reversible, and remains in effect until the
            application terminates, even across the "-execute" option.
)


I should really stop quoting the documentation in here.

I will remove the parentheses in the documentation, since they may obscure this important point.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

TSM

Ahhhaaa

Ile modify my script so that the startup params are configurable in the class.

ps. Ive updated Github with the new code and changes.
There is no need for users to use streams as they are very CPU intensive on the parent PHP process compared to fgets, this may be due to the way I pull the data back as ive used streams very little.

New functions.

getInstance($path,$args) You can now pass both the path and arguments (as array) when you init the class
setDefaultExecArgs($args) - Set default args when starting exiftool (eg -use MWG)
getDefaultExecArgs() - Get extra args used for starting exifool
setDefaultArgs($args) - Set default args used when processing commands, these will be applied before your supplied arguments
getDefaultArgs() - Get default args used for processing commands
run() - Handles -execute sequencing

ankutsa

Very usefull class, thanks. But the exiftool process keeps running even after close() is called. I think it's because exiftool doesn't listen for the SIGTERM signal that proc_terminate() is sending :(

Phil Harvey

The proper way to exit the exiftool process is to send the following arguments (as per step 5 in the -stay_open documentation):

-stay_open
False


I don't think the PHP interface is doing this.

The other way of terminating the exiftool process is by sending a SIGINT, but this will interrupt any processing that exiftool is currently doing.  I would not recommend using SIGTERM because exiftool may leave behind temporary files if it was in the process of writing to a file.

- Phil

P.S.  I also found that I needed to create a watchdog process to terminate exiftool in the case where the main program exits abnormally.  See my new C++ Interface for ExifTool for an example of how I did this in C++.
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

ankutsa

Adding
fwrite($this->_pipes[0], "-stay_open\nFalse\n");
in the close() method did it. Thank you :P

Also this class doesn't work with the -q argument, which is supposed to suppress info messages or warnings.


Phil Harvey

Quote from: ankutsa on January 01, 2014, 10:51:13 AM
Adding
fwrite($this->_pipes[0], "-stay_open\nFalse\n");
in the close() method did it. Thank you :P

Great.

QuoteAlso this class doesn't work with the -q argument, which is supposed to suppress info messages or warnings.

-q suppresses the "{ready}" response, which would break most -stay_open implementations.  If you really need to use -q but still require the "{ready}" response, you can use these arguments:

-q
-echo3
{ready}


- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

TSM

Quote from: ankutsa on January 01, 2014, 10:51:13 AM
Adding
fwrite($this->_pipes[0], "-stay_open\nFalse\n");
in the close() method did it. Thank you :P

Also this class doesn't work with the -q argument, which is supposed to suppress info messages or warnings.

ived added that fix to the github repository

TSM

Ive changed some of the internal code.

Added: setEchoMode() allows easy setting of the echo mode
Changed: execute() to only merge args and default args, execute_args() can be called seperatly if you want to process some arguments without the default arguments

TSM

Quote from: Phil Harvey on January 01, 2014, 02:52:43 PM
-q suppresses the "{ready}" response, which would break most -stay_open implementations.  If you really need to use -q but still require the "{ready}" response, you can use these arguments:

-q
-echo3
{ready}


- Phil

Is there a way to reverse the -q when in stay_open.

Phil Harvey

-q effects only one command, so its effects are reversed after the next -execute.  There is no way to un-do it in the same command in which it was used, if that is what you are asking.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

TSM

Quote from: Phil Harvey on January 11, 2014, 08:51:36 PM
-q effects only one command, so its effects are reversed after the next -execute.  There is no way to un-do it in the same command in which it was used, if that is what you are asking.

- Phil

is the same to be said of the -echo3 command, is that per run or for the lifetime of stay_alive?
If -echo3 replaces -execute in your example then it causes a problem with my script as -execute allowed for a sequence number to be appended but your example seems to imply that -echo3 does -execute anyway.

I have tried to make it work but it does not, when using -echo3 i only get the warnings to stdout not the data, ive tried echo4 as well.
Are the order of the commands when using -echo very important?

Phil Harvey

I think you are confused.  I didn't mean to imply that -echo take the place of -execute.  I just meant to say that if you add -q, then you must add an echo if you want to still receive the "{ready}" message.  You can echo whatever you want, so you could add a sequence number if you want too.  In all cases, -execute is still required.  Read the description of the -echo command in the application documentation.  It gives all of the details about what this option does.  There is no black magic.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

TSM

Right I was going around in circles until i checked the code.

This has only become available since Jan. 27, 2013 - Version 9.15, hmm im still on 9.02

Time for an upgrade me thinks.

Also to help others, syntax should be as follows, remember each arg needs to be on its own line.

-q
-echo3
{ready}
testfile.jpg
-execute


Now I can update the PHP script to allow for this.

Phil Harvey

Ah.  I forgot to check to see if you were using a recent version.  Glad you caught this.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

TSM

Another thing, when posting multiline data such as IPTC:Caption-Abstract, normally we are using \n to split the line as we are linux based but this causes an issue to the stay_open as it uses \n to split commands, what do you do in this instance?

Phil Harvey

There are various work-arounds.  See FAQ 21 for details.  The first technique won't work for argfiles, but the others will.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

TSM

Hmmm interesting.

Would seem the best is to accept entities then 
 in my case (option b) as im feeding in the tags directly for each image i process.
May be an idea for me to add this directly into my API so others do not come across the problem, make it a toggle to switch it on/off and handle the conversion itself.

Also, normally when I was writing tags on the CLI I would wrap the data of each tag in quotes, it seems if this is done with stay_open it renders these quotes, would I be correct in saying the quotes on the CLI are interpreted by the shell before passing the args to exiftool but when using stay_open it is already within the app it does not and so it considers everything after the '=' to be the data to the '\n' character for that tag?

Phil Harvey

Quote from: TSM on May 21, 2014, 01:06:34 PM
would I be correct in saying the quotes on the CLI are interpreted by the shell before passing the args to exiftool but when using stay_open it is already within the app it does not and so it considers everything after the '=' to be the data to the '\n' character for that tag?

Correct.

  -@ ARGFILE
            Read command-line arguments from the specified file.  The file
            contains one argument per line (NOT one option per line -- some
            options require additional arguments, and all arguments must be
            placed on separate lines).  Blank lines and lines beginning with
            "#" and are ignored.  Normal shell processing of arguments is not
            performed, which among other things means that arguments should
            not be quoted and spaces are treated as any other character
.


- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

TSM

I hope final one for today.

When using stay_open for writing to a file, I never seem to get any return data to say its done or if there was an error, ive forced errors such as file not existing etc but still nothing.
The only core params that are being set before writing args are ...
-g
-j
-coordFormat
%.6f
-E

Is this normal behaviour?

Phil Harvey

Typically the errors go to stderr.  Are you reading this exiftool output?

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

TSM

I am but only to a log file, ile have a look at parsing the data but there are a fair few things to check.
It would be nice if possible to have a basic output to the STDOUT when writing files that can be parsed easily with a more verbose log to the STDERR.
Also the error log does not contain any output such as {ready}, the only way I was going to work with it is read all the data into a variable then parse it after the STDOUT has returned {ready}, on the next cycle reset the buffer and start again.

Phil Harvey

You can add "{ready}" to the stderr output with -echo4.

See cpp_exiftool for an example of how to do this in C++.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

TSM

Thank you, that was helpful and ive taken a few functions from the C++ implementation into the PHP one.

Ive updated the API to implement getError/getErrorStr/getSummary

getError($id) - false if there is no error, the last error string
getErrorStr($id) - returns STDERR output regardless if there was an error
getSummary($msg) - returns value for summary message

Note that $id should only be supplied if you have passed a batch of different requests and executed fetchAll() as the return data will be an array of results, the index will be $id, if you leave as blank then it will only return if you had used fetch();

Phil Harvey

I just took a look at this code again.  By odd coincidence I'm using the -php output formatting for the C++ ExifTool communication.  But you're using -json format with PHP.  :P

I think the reason I did this was because the -json option didn't support binary output when I wrote the C++ ExifTool interface.  But the binary output in JSON is a bit of a kludge, so I don't regret this.  There may have also been a problem with JSON requiring well-formed UTF-8.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

TSM

Thats just the defaults for the API and as it happens we store the JSON data directly in the DB instead of using PHP serialise/unserialize, we considered json to be more portable than PHP arrays, pros and conns for everything.
It does not cause any problem putting it in PHP mode in my script as if you use fetch() it does not try and decode it, passes it raw back to you to do as you please, if you use fetchDecoded() then it will require the output format to be json.
Why I did this I do not know, but works for us.
Im just altering our internal app so we move all exiftool writing over to use the API via Gearman instead of local, before we were only optimizing the reading.

klarakos

Hmmm, i think its all got slow once i moved to using fgets using a buffered socket, before i was using streams but found it incompatible with different version of php.
Ive checked and the PID does not change of the underlying perl once it has been started so its related to the fgets.
Ile look into it and get it sorted hopefully.

TSM

Quote from: klarakos on August 01, 2014, 05:52:49 AM
Hmmm, i think its all got slow once i moved to using fgets using a buffered socket, before i was using streams but found it incompatible with different version of php.
Ive checked and the PID does not change of the underlying perl once it has been started so its related to the fgets.
Ile look into it and get it sorted hopefully.

??

mauricio

I have written a simple php script to run exiftool, it works correctly but I am not sure if it is the best way to implement it, they could give me advice or give their point of view.
thank you in advance for your cooperation


<?php 
  $env 
null
  
$cwd "."
  
$descriptorspec = array ( 
  
=> array ( "pipe" "r" ),  
  
=> array ( "pipe" "w" ), 
  ); 
  
 
$command escapeshellcmd("exiftool -json imagenes/test3.jpg");
 
$process proc_open($command,$descriptorspec,$pipes,$cwd,$env); 

  if (
is_resource($process)) {
   
    
fwrite($pipes[0], "-stay_open\nFalse\n");
    
fclose($pipes[0]);

    echo 
stream_get_contents($pipes[1]);
    
fclose($pipes[1]);
    
    
proc_close($process);

}
?>


jaireaux

In case someone finds this forum post, as I did, seven years later, I was just looking for some good examples of using exiftool in PHP and came across this thread. I'm not working with a massive number of photos so I almost ignored it but tested it for just 10 pictures and the speed increase is dramatic. Here is the debug output I've used to time the difference between approaches.

using direct exec start time is 2020-03-13 11:04:14:773
using direct exec end time is 2020-03-13 11:04:16:900

using exif handler start time is 2020-03-13 11:04:16:900
using exif handler end time is 2020-03-13 11:04:17:495


So with this small test, using a 'shell_exec("exiftool...")' took  2,127milliseconds and using '$exif->..." took 595 milliseconds. That's almost a 75% improvement.

In summary, thanks to everyone who's kept this script alive. It's going to save me a lot of time, even with a smaller pool of images.

jlb30504

Please excuse the newby question.  I take it that EXifToolBatch.php does not work in the Apache environment since PCNTL doesn't work in Apache.  Is this correct?