Run Exiftool using multiple cores

Started by RossTP, October 17, 2016, 02:01:14 AM

Previous topic - Next topic

RossTP

Does Exiftool run on multiple cores, if they're available? If not, is there a way to do this using something like GNU Parallel? I'm trying to clear GPS metadata from >50,000 images (a task I run almost every week), but it takes quite long to run.

To do this I first need to clear the Makernotes (because there are major and minor errors in many of the images I need to work with):

exiftool -r -all= -tagsfromfile @ -all:all -unsafe -icc_profile -overwrite_original -ext jpg .

And then I clear the GPS metadata using this:

exiftool -r -gps:all= -xmp:geotag= -overwrite_original -ext jpg .

Any ideas on how to speed this process up?
FYI - I always work on a copy of my data, hence why I overwrite originals in the above code.

Thanks in advance!

Phil Harvey

You could separate your images into groups and run multiple exiftool commands simultaneously, one on each group.

What platform are you on?  It shouldn't be too hard to create a script to do this without the need to physically separate the images into different directories.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

RossTP

Hi Phil,

I run both Mac (OSX El Capitan) and Windows-based machines. It would be great if some code could work for both, but if not then a Windows (OS 10) system would be the most convenient.

Appreciate any help.

Cheers,
Ross

Phil Harvey

Hi Ross,

If scripted, the script would have to be different for Windows and Mac.  I wouldn't know how to write the Windows script (or .bat file).

But if you could find a clear way to divide the images then it could work for both.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

RossTP

Hi Phil,

So I've managed to figure out how to split the image folder up into smaller folders using terminal/bash, but once I've done that, how to I run multiple exiftool commands simultaneously on these folders? I've currently got three smaller folders, containing ±10,000 images in each.

Thanks in advance.
Ross

Hayo Baan

On a Mac/Linux system, you can easily start any command in the background by putting an & at the end of the command. E.g. exiftool ARGS DIR & will run it as a background process, allowing you to do this multiple times. The output of each of these background processes will mingle with your current process so you'd probably like to capture the output of the background processes in a different output file. If your main script needs to wait for the background processes to finish, you can make use of e.g. the wait function, please see the man page of your shell for more information on this.

On windows, I don't know of a way to do this (at least the standard command-line does not support it, but the powershell might have ways).
Hayo Baan – Photography
Web: www.hayobaan.nl

Alan Clifford

Quote from: RossTP on October 18, 2016, 12:33:29 AM
Hi Phil,

... but once I've done that, how to I run multiple exiftool commands simultaneously on these folders? I've currently got three smaller folders, containing ±10,000 images in each.


You can open three terminal windows, cd to a different directory in each window, then type the appropriate exiftool command in each window.


RossTP

Thanks Phil, Hayo and Alan,

Really appreciate all the help. So in summary, I managed to split the image folder into smaller sub-folders of 10,000 images using this code (for mac):

#!/bin/bash
x=0
y=0
for i in `ls -1`
do
if [ $x = 10000 ]; then
x=0
fi
if [ "$x" = "0" ]; then
y=`expr $y + 1`
mkdir $y.folder
echo -n "."
fi
x=`expr $x + 1`
mv $i $y.folder
done


Then I opened up three terminal windows and ran the exiftool scripts. A bit of a round-about way of achieving the objective, but it works.

Thanks again.
Ross

Hayo Baan

From your original question, I gather you do this often? If so, I'd suggest changing the script so it automatically calls exiftool on the created directories. As said if you add an & at the end of the command, it will be run in the background, so it will then process all directories in parallel.
Hayo Baan – Photography
Web: www.hayobaan.nl

Alan Clifford

Quote from: Hayo Baan on October 19, 2016, 01:34:08 AM
From your original question, I gather you do this often? If so, I'd suggest changing the script so it automatically calls exiftool on the created directories. As said if you add an & at the end of the command, it will be run in the background, so it will then process all directories in parallel.

I'd possibly agree with the background '&' method but suggested the three terminals because it is conceptually easier if someone is not guru level with unix style terminal commands. Personally, I'd be happier with the three windows.

Phil Harvey

Since you are running a script anyway, an alternative would be to create separate lists of files to process instead of moving the files to separate directories.  Then you could use the exiftool -@ option to process files in each list.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Hayo Baan

Quote from: Phil Harvey on October 19, 2016, 07:09:56 AM
Since you are running a script anyway, an alternative would be to create separate lists of files to process instead of moving the files to separate directories.  Then you could use the exiftool -@ option to process files in each list.

Actually this is best as it saves you from having to move the files altogether. If you tell me the exiftool command you'd like to run on the files and how you specify the location of the files/dirs to process, I'll look into creating a little script for you to do this.
Hayo Baan – Photography
Web: www.hayobaan.nl

RossTP

Hi all,

Thanks very much for this assistance, I really do appreciate it.

Hayo – there are two commands that need to be run. First, I need to rewrite all the metadata using:

exiftool -r -all= -tagsfromfile @ -all:all -unsafe -icc_profile -overwrite_original -ext jpg .

Then I need to clear the GPS metadata using:

exiftool -r -gps:all= -xmp:geotag= -overwrite_original -ext jpg .

Lets assume that the folder (which can consist of 30,000 to 150,000 images) is on my desktop /Users/Ross/Desktop/images

Thanks again in advance!

Phil Harvey

I'll just point out that these two commands may be done in a single operation:

exiftool -r -all= -tagsfromfile @ -all:all --gps:all --xmp:geotag -unsafe -icc_profile -overwrite_original -ext jpg .

- Phil

(2x the speed without any effort)
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

StarGeek

A little late to the thread but I really think that disk i/o is going to be more of a bottleneck than CPU use.  I've tried running a bunch of command simultaneously and after about 4-5 commands things slow down on my old computer due to disk use while there's still plenty of cpu power available.
* Did you read FAQ #3 and use the command listed there?
* Please use the Code button for exiftool code/output.
 
* Please include your OS, Exiftool version, and type of file you're processing (MP4, JPG, etc).

Hayo Baan

Here is a (perl) script that will process all files in parallel. Save below text in a file (e.g. exiftoolParallel), edit it and change anything necessary to cater for your needs (all you should need to change are the first three variables) and run it (e.g with perl exiftoolParallel). I used Phil's one-liner version of your command, so things should really be snappy :)

Let me know if you have any questions.

Enjoy,
Hayo

P.S. Regarding StarGeek's comment, while in the end disk I/O will become a bottleneck, processing things in parallel still will speed up things considerably I have found in practice.

#!/usr/bin/perl
use strict;
use warnings;

use File::Find;
use File::Temp qw(tempfile);

# The list of directories with all the images
my @imagedir_roots = ("/Users/Ross/Desktop/Images");

# Number of parallel processes
my $parallel = 3;

# The exiftool command (the files to process will be added automatically, so do not include them here!)
my $exiftool_command = 'exiftool -all= -tagsfromfile @ -all:all --gps:all --xmp:geotag -unsafe -icc_profile -overwrite_original';


################################################################################

# Create the (temporary) -@ files
my @atfiles;
my @atfilenames;
for (my $i = 0; $i < $parallel; ++$i) {
    my ($fh, $filename) = tempfile(UNLINK => 1);
    push @atfiles, $fh;
    push @atfilenames, $filename;
}

# Gather all JPG image files and distribute them over the -@ files
my $nr = 0;
find(sub { print { $atfiles[$nr++ % $parallel] } "$File::Find::name\n"  if (-f && /\.(?:jpg|jpeg)/i); }, @imagedir_roots);

# Process all images in parallel
printf("Processing %d JPG files...\n", $nr);
for (my $i = 0; $i < $parallel; ++$i) {
    close($atfiles[$i]); # So it is fully written to when using it in exiftool
    my $pid = fork();
    if (!$pid) {
        # Run exiftool in the background
        system qq{$exiftool_command -@ \"$atfilenames[$i]\"};
        last;
    }
}

# Wait for processes to finish
while (wait() != -1) {}
Hayo Baan – Photography
Web: www.hayobaan.nl

RossTP

Hi Hayo,

Thank you so much for spending the time to write that script. It worked perfectly! I'm not entirely sure what disk I/O is, but I'm using a solid state drive, so I'm hoping the bottleneck won't be too severe. I've just run your script on a batch of 10,000 images and it only took a couple of minutes. This is a huge improvement to my previous workflow.

Thanks again to everyone that commented and assisted in finding these solutions. I really do appreciate it.

Cheers,
Ross

Phil Harvey

...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Hayo Baan

Hayo Baan – Photography
Web: www.hayobaan.nl

RossTP

Good day Hayo,

I wonder if you wouldn't mind assisting me in expanding on your current (amazing!) Perl script? I'm looking to extract image metadata to a csv file, but to use multiple cores to speed up the process. So far I've been running this from Terminal using a single core:

exiftool -r -csv /Volumes/HDD_CBASE/Images > /Users/Ross/Documents/metadata.csv

I figured your Perl script should be very useful here, but I don't know how to specify different csv files for each core, and then combine them once all cores are finished. Is this even possible, or am I expecting too much?

Your Perl script has really increased my efficiency (at that of my computer), so I must thank you again for spending the time to develop it.

Thanks in advance for any assistance you might be able to provide.

Regards,
Ross

Phil Harvey

Hi Ross,

The columns in the -csv output depend on the information in the processed files, so you can't really combine the output of -csv from different files.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

RossTP

Hi Phil,

That's a great point, which I clearly didn't think of.

I'll be importing the csv file/s into R, so I think I'll be able to combine the folders there. Only trouble now is how to specify different csv files (for each core) and write to them using Hayo's script.

Appreciate any assistance.

Regards,
Ross

Phil Harvey

Hi Ross,

I think you may be able to write separate output files from the script with something like this:

        my $result = `$exiftool_command -@ \"$atfilenames[$i]\"`;
        open OUTFILE, ">outfile$i.csv":
        print OUTFILE $result;
        close OUTFILE;


- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

RossTP

Hi Phil,

Thank you very much for you're help.

I'm afraid I don't quite know where to put the additional lines of code (my knowledge of Perl is practically non-existent). My attempts have resulted in syntax errors, compilation errors, and warnings about forgetting to declare variables. At the moment I've got this (without your additional code):

Have I defined the $exiftool_command correctly for this task?

Appreciate any assistance.
Regards,
Ross

#!/usr/bin/perl
use strict;
use warnings;

use File::Find;
use File::Temp qw(tempfile);

# The list of directories with all the images
my @imagedir_roots = ("/Users/Ross/Desktop/Images");

# Number of parallel processes
my $parallel = 8;

# The exiftool command (the files to process will be added automatically, so do not include them here!)
my $exiftool_command = 'exiftool -r -csv';

################################################################################

# Create the (temporary) -@ files
my @atfiles;
my @atfilenames;
for (my $i = 0; $i < $parallel; ++$i) {
    my ($fh, $filename) = tempfile(UNLINK => 1);
    push @atfiles, $fh;
    push @atfilenames, $filename;

}

# Gather all JPG image files and distribute them over the -@ files
my $nr = 0;
find(sub { print { $atfiles[$nr++ % $parallel] } "$File::Find::name\n"  if (-f && /\.(?:jpg|jpeg)/i); }, @imagedir_roots);

# Process all images in parallel
printf("Processing %d JPG files...\n", $nr);
for (my $i = 0; $i < $parallel; ++$i) {
    close($atfiles[$i]); # So it is fully written to when using it in exiftool
    my $pid = fork();
    if (!$pid) {
        # Run exiftool in the background
        system qq{$exiftool_command -@ \"$atfilenames[$i]\"};
        last;
   
    }

}
       
# Wait for processes to finish
while (wait() != -1) {}

Phil Harvey

Hi Ross,

All you do is replace the line starting with "system" with the lines I provided.  But I don't have time to test this out right now, so I can't guarantee that it will work.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Hayo Baan

Quote from: Phil Harvey on February 07, 2017, 07:24:07 AM
All you do is replace the line starting with "system" with the lines I provided.  But I don't have time to test this out right now, so I can't guarantee that it will work.

I think it should work too. Another solution could be to add > outfile$i.csv just before the closing backtick on the system line.
Hayo Baan – Photography
Web: www.hayobaan.nl

RossTP

Thanks Phil and Hayo,

As soon as I'm back at my computer I'll give this a try. Will revert back if I stumble again...

Cheers,
Ross

Hayo Baan

Quote from: Hayo Baan on February 07, 2017, 04:52:56 PM
Quote from: Phil Harvey on February 07, 2017, 07:24:07 AM
All you do is replace the line starting with "system" with the lines I provided.  But I don't have time to test this out right now, so I can't guarantee that it will work.

I think it should work too. Another solution could be to add > outfile$i.csv just before the closing backtick on the system line.
To be sure, I meant to say curly brace...

Here's the complete line: system qq{$exiftool_command -@ \"$atfilenames[$i]\" > outfile$i.csv}
Hayo Baan – Photography
Web: www.hayobaan.nl

chuck lee

Dear all,
  Thanks for the posts.  This give me the idea to split the files in to several groups for exiftool to parallel process them.  It really speeds up the process, I am using SSD HD.

job_round_robin(){
  if [[ ${#img[@]} -gt 0 ]]
  then
    local -n img_tmp="$1" || return 1
    img_lot=(${img[@]:0:$lot})   
    img=(${img[@]:$lot})         
    img_tmp+=(${img_lot[@]})     
  fi
}

mp_exiftool(){

# unset job arrays 
unset img 
# img array will be used to store all the file names

for j in $(seq 1 1 $threads); do unset img$j; done
# split the files into array img1, img2,...  later

[[ "$recursive" == "r" ]] && img=(**) || img=(*) 
# check variable $recursive
# img array with all file names

img=(${img[@]/\**/})  # no file -- array has (*) or (**), remove the element in the array

# round bobin all the files to each job
  while [[ ${#img[@]} -gt 0 ]] 
  do
    for job in $(seq 1 $threads)
    do
       [[ ${#img[@]} -gt 0 ]] && job_round_robin "img$job"
    done
  done 
 
# send the job(array) to exiftool in background mode multi_processing

  for job in $(seq 1 $threads)
  do
    eval [[ \${#img${job}[@]} -gt 0 ]] &&  eval exiftool -fast2 -q -overwrite_original '-subject\<filename' -ext jpg  \${img${job}[@]} &
  done 

  wait  # in case next process needs the output of these processes
}


threads=10
# number of threads used for this program
lot=200
# min number of images for each process(exiftool), tune the number to fit your process


# replace [space], [(], or [)] with [_] in the filename for all the files
rename -n --nopath 's/(\s|\(|\))+/_/g' *    # -n means No action: print names of files to be renamed, but don't rename.
                                            # remove -n if the outcome is correct
                                            # it will rename the files in the current directory
                                           
# to recursively rename the files in the directories                                             
# shopt -s globstar
# rename -n --nopath 's/(\s|\(|\))+/_/g' **  # remove -n if the outcome is correct

mp_exiftool




Many thanks to you all.

Chuck Lee


Phil Harvey

Hi Chuck,

You can speed things up quite a bit more if you keep exiftool running with the -stay_open option and pass the arguments via stdin rather than launching a new instance for each command.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

chuck lee

Quote from: chuck lee on July 12, 2023, 05:18:20 AMDear all,

I am sorry the code will cause problem scanning directory
.
.
.

[[ "$recursive" == "r" ]] && img=(**) || img=(*)
.
.
.
should be

[[ "$recursive" == "r" ]] && img=($(find . -type f -iregex '.*\.\(jpg\|heic\)$' )) || img=($(find . -maxdepth 1 -type f -iregex '.*\.\(jpg\|heic\)$'))



and Phil,

Thanks for your suggestion and I will look into it. 

regards,

Chuck Lee