Run Exiftool using multiple cores

Started by RossTP, October 17, 2016, 02:01:14 AM

Previous topic - Next topic

Hayo Baan

Here is a (perl) script that will process all files in parallel. Save below text in a file (e.g. exiftoolParallel), edit it and change anything necessary to cater for your needs (all you should need to change are the first three variables) and run it (e.g with perl exiftoolParallel). I used Phil's one-liner version of your command, so things should really be snappy :)

Let me know if you have any questions.

Enjoy,
Hayo

P.S. Regarding StarGeek's comment, while in the end disk I/O will become a bottleneck, processing things in parallel still will speed up things considerably I have found in practice.

#!/usr/bin/perl
use strict;
use warnings;

use File::Find;
use File::Temp qw(tempfile);

# The list of directories with all the images
my @imagedir_roots = ("/Users/Ross/Desktop/Images");

# Number of parallel processes
my $parallel = 3;

# The exiftool command (the files to process will be added automatically, so do not include them here!)
my $exiftool_command = 'exiftool -all= -tagsfromfile @ -all:all --gps:all --xmp:geotag -unsafe -icc_profile -overwrite_original';


################################################################################

# Create the (temporary) -@ files
my @atfiles;
my @atfilenames;
for (my $i = 0; $i < $parallel; ++$i) {
    my ($fh, $filename) = tempfile(UNLINK => 1);
    push @atfiles, $fh;
    push @atfilenames, $filename;
}

# Gather all JPG image files and distribute them over the -@ files
my $nr = 0;
find(sub { print { $atfiles[$nr++ % $parallel] } "$File::Find::name\n"  if (-f && /\.(?:jpg|jpeg)/i); }, @imagedir_roots);

# Process all images in parallel
printf("Processing %d JPG files...\n", $nr);
for (my $i = 0; $i < $parallel; ++$i) {
    close($atfiles[$i]); # So it is fully written to when using it in exiftool
    my $pid = fork();
    if (!$pid) {
        # Run exiftool in the background
        system qq{$exiftool_command -@ \"$atfilenames[$i]\"};
        last;
    }
}

# Wait for processes to finish
while (wait() != -1) {}
Hayo Baan – Photography
Web: www.hayobaan.nl

RossTP

Hi Hayo,

Thank you so much for spending the time to write that script. It worked perfectly! I'm not entirely sure what disk I/O is, but I'm using a solid state drive, so I'm hoping the bottleneck won't be too severe. I've just run your script on a batch of 10,000 images and it only took a couple of minutes. This is a huge improvement to my previous workflow.

Thanks again to everyone that commented and assisted in finding these solutions. I really do appreciate it.

Cheers,
Ross

Phil Harvey

...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Hayo Baan

Hayo Baan – Photography
Web: www.hayobaan.nl

RossTP

Good day Hayo,

I wonder if you wouldn't mind assisting me in expanding on your current (amazing!) Perl script? I'm looking to extract image metadata to a csv file, but to use multiple cores to speed up the process. So far I've been running this from Terminal using a single core:

exiftool -r -csv /Volumes/HDD_CBASE/Images > /Users/Ross/Documents/metadata.csv

I figured your Perl script should be very useful here, but I don't know how to specify different csv files for each core, and then combine them once all cores are finished. Is this even possible, or am I expecting too much?

Your Perl script has really increased my efficiency (at that of my computer), so I must thank you again for spending the time to develop it.

Thanks in advance for any assistance you might be able to provide.

Regards,
Ross

Phil Harvey

Hi Ross,

The columns in the -csv output depend on the information in the processed files, so you can't really combine the output of -csv from different files.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

RossTP

Hi Phil,

That's a great point, which I clearly didn't think of.

I'll be importing the csv file/s into R, so I think I'll be able to combine the folders there. Only trouble now is how to specify different csv files (for each core) and write to them using Hayo's script.

Appreciate any assistance.

Regards,
Ross

Phil Harvey

Hi Ross,

I think you may be able to write separate output files from the script with something like this:

        my $result = `$exiftool_command -@ \"$atfilenames[$i]\"`;
        open OUTFILE, ">outfile$i.csv":
        print OUTFILE $result;
        close OUTFILE;


- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

RossTP

Hi Phil,

Thank you very much for you're help.

I'm afraid I don't quite know where to put the additional lines of code (my knowledge of Perl is practically non-existent). My attempts have resulted in syntax errors, compilation errors, and warnings about forgetting to declare variables. At the moment I've got this (without your additional code):

Have I defined the $exiftool_command correctly for this task?

Appreciate any assistance.
Regards,
Ross

#!/usr/bin/perl
use strict;
use warnings;

use File::Find;
use File::Temp qw(tempfile);

# The list of directories with all the images
my @imagedir_roots = ("/Users/Ross/Desktop/Images");

# Number of parallel processes
my $parallel = 8;

# The exiftool command (the files to process will be added automatically, so do not include them here!)
my $exiftool_command = 'exiftool -r -csv';

################################################################################

# Create the (temporary) -@ files
my @atfiles;
my @atfilenames;
for (my $i = 0; $i < $parallel; ++$i) {
    my ($fh, $filename) = tempfile(UNLINK => 1);
    push @atfiles, $fh;
    push @atfilenames, $filename;

}

# Gather all JPG image files and distribute them over the -@ files
my $nr = 0;
find(sub { print { $atfiles[$nr++ % $parallel] } "$File::Find::name\n"  if (-f && /\.(?:jpg|jpeg)/i); }, @imagedir_roots);

# Process all images in parallel
printf("Processing %d JPG files...\n", $nr);
for (my $i = 0; $i < $parallel; ++$i) {
    close($atfiles[$i]); # So it is fully written to when using it in exiftool
    my $pid = fork();
    if (!$pid) {
        # Run exiftool in the background
        system qq{$exiftool_command -@ \"$atfilenames[$i]\"};
        last;
   
    }

}
       
# Wait for processes to finish
while (wait() != -1) {}

Phil Harvey

Hi Ross,

All you do is replace the line starting with "system" with the lines I provided.  But I don't have time to test this out right now, so I can't guarantee that it will work.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Hayo Baan

Quote from: Phil Harvey on February 07, 2017, 07:24:07 AM
All you do is replace the line starting with "system" with the lines I provided.  But I don't have time to test this out right now, so I can't guarantee that it will work.

I think it should work too. Another solution could be to add > outfile$i.csv just before the closing backtick on the system line.
Hayo Baan – Photography
Web: www.hayobaan.nl

RossTP

Thanks Phil and Hayo,

As soon as I'm back at my computer I'll give this a try. Will revert back if I stumble again...

Cheers,
Ross

Hayo Baan

Quote from: Hayo Baan on February 07, 2017, 04:52:56 PM
Quote from: Phil Harvey on February 07, 2017, 07:24:07 AM
All you do is replace the line starting with "system" with the lines I provided.  But I don't have time to test this out right now, so I can't guarantee that it will work.

I think it should work too. Another solution could be to add > outfile$i.csv just before the closing backtick on the system line.
To be sure, I meant to say curly brace...

Here's the complete line: system qq{$exiftool_command -@ \"$atfilenames[$i]\" > outfile$i.csv}
Hayo Baan – Photography
Web: www.hayobaan.nl

chuck lee

Dear all,
  Thanks for the posts.  This give me the idea to split the files in to several groups for exiftool to parallel process them.  It really speeds up the process, I am using SSD HD.

job_round_robin(){
  if [[ ${#img[@]} -gt 0 ]]
  then
    local -n img_tmp="$1" || return 1
    img_lot=(${img[@]:0:$lot})   
    img=(${img[@]:$lot})         
    img_tmp+=(${img_lot[@]})     
  fi
}

mp_exiftool(){

# unset job arrays 
unset img 
# img array will be used to store all the file names

for j in $(seq 1 1 $threads); do unset img$j; done
# split the files into array img1, img2,...  later

[[ "$recursive" == "r" ]] && img=(**) || img=(*) 
# check variable $recursive
# img array with all file names

img=(${img[@]/\**/})  # no file -- array has (*) or (**), remove the element in the array

# round bobin all the files to each job
  while [[ ${#img[@]} -gt 0 ]] 
  do
    for job in $(seq 1 $threads)
    do
       [[ ${#img[@]} -gt 0 ]] && job_round_robin "img$job"
    done
  done 
 
# send the job(array) to exiftool in background mode multi_processing

  for job in $(seq 1 $threads)
  do
    eval [[ \${#img${job}[@]} -gt 0 ]] &&  eval exiftool -fast2 -q -overwrite_original '-subject\<filename' -ext jpg  \${img${job}[@]} &
  done 

  wait  # in case next process needs the output of these processes
}


threads=10
# number of threads used for this program
lot=200
# min number of images for each process(exiftool), tune the number to fit your process


# replace [space], [(], or [)] with [_] in the filename for all the files
rename -n --nopath 's/(\s|\(|\))+/_/g' *    # -n means No action: print names of files to be renamed, but don't rename.
                                            # remove -n if the outcome is correct
                                            # it will rename the files in the current directory
                                           
# to recursively rename the files in the directories                                             
# shopt -s globstar
# rename -n --nopath 's/(\s|\(|\))+/_/g' **  # remove -n if the outcome is correct

mp_exiftool




Many thanks to you all.

Chuck Lee


Phil Harvey

Hi Chuck,

You can speed things up quite a bit more if you keep exiftool running with the -stay_open option and pass the arguments via stdin rather than launching a new instance for each command.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).