Feature request: Option to suppress extraction of specific <rdf:Seq> elements

Started by johnrellis, November 15, 2021, 12:35:10 PM

Previous topic - Next topic

johnrellis

Summary: An option to suppress extraction of specific <rdf:Seq> elements from XMP could significantly speed up searching of Lightroom .xmp sidecar files.

Details:

My Any Filter Lightroom plugin uses Exiftool to search metadata fields of thousands of photos and their XMP sidecars, and some searches of sidecars are going over a hundred times slower than they ideally should. 

Lightroom stores its develop settings in sidecars. With most common develop settings, the sidecars are small, typically 4KB, and ExifTool searches them on my computer at a rate of 240/sec. But it also stores the coordinates of adjustment brush strokes in <rdf:Seq> elements in sidecars, and in a fully retouched photo, those strokes could consume 100KB, 1MB, or even 20MB.  (See the attached files for examples.) ExifTool searches 100KB sidecars at a low rate of about 5.7/sec, and it takes 19 seconds to search a 20MB file.

The result is that, in a catalog with less than 5% of a user's photos fully retouched, Any Filter/ExifTool's searches can slow down by over 100x.

ExifTool could search sidecars must faster if it didn't have to extract the fields containing the brushstrokes. By default, ExifTool extracts no more than 1000 <rdf:Seq> elements per XML tag, while the -m option extracts all of the elements.

Any Filter has been using "-m -q -q" to downgrade minor errors to warnings and to suppress those. That was causing ExifTool to extract all the points in every brushstroke, even though the brushstrokes themselves aren't being searched.  My timings show that omitting -m would speed up sidecar searching by roughly 1.5 - 2x, depending on the average number of elements per sidecar.

But there are two downsides to omitting -m:

- The first 1000 elements would still be extracted, which is still costly. I fed a bunch of timings into a two-variable regression (file size and number of elements) and estimated the time savings if no elements were extracted.  Medium-sized sidecars with less than 1000 elements (50KB) would go 2x faster, and larger sidecars with 5,000 elements (261KB) would go 1.25 - 1.5x faster than when extracting a maximum of 1000 elements.

- It would stop the downgrading of minor errors to warnings.  Any Filter was using "-m -q -q" to ignore minor errors, while remaining errors would be elevated to the user. If minor errors aren't downgraded, what kinds of errors would additionally be displayed, and how would Any Filter recognize them? Are there other consequences for using "-q -q" instead of "-m -q -q"?

An option for entirely suppressing the extraction of specific XMP fields would give the most speedup, e.g.

-noExtract MaskGroupBasedCorrectionsCorrectionMasksMasksDabs
-noExtract RetouchAreasMasksDabs

A final note: On Windows, I compared the Oliver Betz/Strawberry Perl version of Exiftool (12.34) with the default ActivePerl version (12.33), and the Strawberry Perl version was consistently less than 1% faster when searching sidecars.

StarGeek

The ability to suppress tags already exists with the --TAG option.  And it allows the use of wildcards, so you could use
--RetouchAreaMask*
to suppress all tags starting with "RetouchAreaMask".
"It didn't work" isn't helpful. What was the exact command used and the output.
Read FAQ #3 and use that cmd
Please use the Code button for exiftool output

Please include your OS/Exiftool version/filetype

Phil Harvey

StarGeek is correct, but I don't think that will increase processing speed at all.

The root problem here is that Adobe has decided (in their infinite wisdom, and I have complained bitterly to them about this) to mix image editing data in with the metadata.  This is so obviously the wrong place for this information.  And here is one pitfall of this idiotic design decision of theirs.

But we have no choice but to deal with the mess that they have created.

This is tricky because ExifTool must process the metadata to determine the tag name, but this alone will slow things down.  Also, testing each tag to see if it is excluded will slow things down.

This could be done at a slightly higher level by allowing entire XMP namespaces to be ignored.  Perhaps something like --xmp-crs:all would do what you want if it prevented the XMP crs properties from being processed in the first place.  Would this be satisfactory?  I think this is do-able.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Phil Harvey

I have a hybrid solution working that allows xmp to be ignored based on ExifTool family 1 group name and  XMP property name.  I don't really like this asymmetry, but here is how it is working using your first attached file as an example:

> % time exiftool xmp4.xmp -aaa
Warning: [Minor] Extracted only 1000 crs:MaskGroupBasedCorrectionsCorrectionMasksMasksDabs items. Ignore minor errors to extract all - xmp4.xmp
1.616u 0.020s 0:01.64 99.3% 0+0k 0+0io 0pf+0w
> time exiftool xmp4.xmp -aaa --xmp-crs:all
0.219u 0.016s 0:00.23 95.6% 0+0k 0+0io 0pf+0w
> time exiftool xmp4.xmp -aaa --xmp-crs:dabs
0.904u 0.021s 0:00.92 100.0% 0+0k 0+0io 0pf+0w
> time exiftool xmp4.xmp -aaa --xmp-all:MaskGroupBasedCorrections
0.219u 0.017s 0:00.23 95.6% 0+0k 0+0io 0pf+0w


So the clock time to read this file drops from 1.64 seconds to 0.23 seconds when either ignoring all XMP-crs tags, or any MaskGroupBasedCorrections XMP property.  Ignoring just the Dabs themselves didn't give as much of a speed benefit.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

johnrellis

QuoteSo the clock time to read this file drops from 1.64 seconds to 0.23 seconds when either ignoring all XMP-crs tags, or any MaskGroupBasedCorrections XMP property.  Ignoring just the Dabs themselves didn't give as much of a speed benefit.

Wow, that 7.4x speedup is more than the 2x my naive regression predicted! That would make a big improvement to Any Filter's speed for the commercial photographers who do a lot of retouching.

A couple of questions:

1. Why is there a 4x difference between --xmp-crs:Dabs and --xmp-all:MaskGroupBasedCorrections?  In that test file, there are 15 <crs:Dabs>, but 52,410 elements within the Dabs.  Almost all the bytes and tags of the file are contained within the 15 <rdf:Seq> inside <crs:Dabs>.

2. I assume that multiple exclusions could be specified, e.g.

exiftool --xmp-all:MaskGroupBasedCorrections --xmp-all:RetouchAreas

Phil Harvey

Hi John,

It makes some sense if you look at the number of tags remaining after excluding the Dabs and the MaskGroupBasedCorrections:

> exiftool xmp4.xmp --xmp-crs:dabs | wc -l
     880
> exiftool xmp4.xmp --xmp-all:MaskGroupBasedCorrections | wc -l
     220


So it looks like there is significant overhead in storing the value of a tag and its associated properties.  Actually, excluding the Dabs only saves on a single tag (even though it is a list tag with many elements).

And yes, multiple exclusions are always possible.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

johnrellis

QuoteSo it looks like there is significant overhead in storing the value of a tag and its associated properties. 

That makes sense (and contradicts my initial intuition). Another way of looking at it is looking at the amount of CPU time spent per line of the file:

53179 lines, 1.64 secs
1309 lines, 0.92 secs (excluding all elements of <crs:Dabs>)
285 lines, 0.23 secs (excluding the rest of <crs:MaskGorupBasedCorrections>)

From this, we can calculate that it costs 14 microsecs per Dabs element, and 674 microsecs per line for all the other tags.  So it's doing 50 times as much work for those other lines.  The elements are easy to process, just adding their text to an array (effectively). All the other lines involve storing hierarchical structure.

Phil Harvey

Hi John,

ExifTool 12.36 is now available with this new feature.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).