Ability to search for a specific value across all TAG'S?

Started by Athlete, June 06, 2025, 05:16:36 AM

Previous topic - Next topic

Athlete

Day 2 of my induction/introduction to this software.

I have a large number of PDF's (>=10,000) that I would like to search the metadata in order to find if, and where, the search "string" exists and report a) the filename & b) either the tag name(s) that contain the string with their content and/or c) all the tags for the files that comply. Is this feasible?

The type of question I am trying to answer is something like -

"What files refer to 'Ennerdale' in their metadata and which are the tags that do so and what is their current content/context that relates to 'Ennerdale'

greybeard

On a Mac I would do the following:

exiftool -a -G1 -s -ext pdf . | egrep "Ennerdale|====="

Athlete

Thank you. I should have stated that I am running on a Windows platform.

I will have to look up what the impact on the code you provide would be to get the same when running Windows.

What is "egrep"?

Any pointers would be most welcome.

greybeard

Quote from: Athlete on June 06, 2025, 07:21:12 AMThank you. I should have stated that I am running on a Windows platform.

I will have to look up what the impact on the code you provide would be to get the same when running Windows.

What is "egrep"?

Any pointers would be most welcome.

I don't know about Windows - egrep is a command that runs on MacOS or Linux and allows you to filter output using a regular expression.

So I am piping the complete metadata from all of the pdf files to the egrep filter which displays the name of every pdf file (regardless whether there is a match) followed by the tag name and value for each tag that contains the "Ennerdale" string).

I would add -r to the command if the pdf files were contained within multiple sub-folders.

Athlete

Thank you. As far as I am aware "egrep" is not a command/function within Windows.

StarGeek

Unfortunately, it's not possible to do exactly as you say. There's no option to print only the tags that match a pattern. You would still end up printing most or all of the tags.

Try this (remove the i from "/Ennerdale/i" if you want a case-sensitive match)
exiftool -G1 -a -s -ext pdf -if "$All:All=~/Ennerdale/i" -PDF:All -XMP:All /path/to/files/

This will print out all PDF and XMP tags in a pdf if exiftool finds "Ennerdale" (case-insensitive) in any of the tags. From there, you would have to search the output to find exactly which tag contains "Ennerdale".

Exiftool's processing of PDF can be slow sometimes, especially when the PDF file is encrypted.
"It didn't work" isn't helpful. What was the exact command used and the output.
Read FAQ #3 and use that cmd
Please use the Code button for exiftool output

Please include your OS/Exiftool version/filetype

Athlete

Thank you. Can you point me in the direction of understanding the $All:All component.

Is simply that you are searching across all tags?

Can that be refined to just say searching the PDF ones buy using $PDF:All syntax?

Having search round to try and answer my original post I came by the term filter as an API option. Does this functionality mimic that of the if?

StarGeek

Quote from: Athlete on June 06, 2025, 12:44:31 PMThank you. Can you point me in the direction of understanding the $All:All component.

From the docs on the -TAG option
QuoteA special tag name of All may be used to indicate all meta information (ie. -All)

QuoteIs simply that you are searching across all tags?

Yes. It is doing a Perl RegEx (Regular Expression) against all the tags.

QuoteCan that be refined to just say searching the PDF ones buy using $PDF:All syntax?

It doesn't look like it. It would work with XMP tags (and you want to include XMP tags because that is the more up to date PDF standard), but XMP:All is handled a bit differently than PDF:All

Example:
C:\>exiftool -G1 -a -s -if "$all:all=~/Adobe Acrobat/i" -pdf:all test.pdf
[PDF]           PDFVersion                      : 1.6
[PDF]           Linearized                      : Yes
[PDF]           CreateDate                      : 2011:04:07 20:51:13-05:00
[PDF]           Creator                         : Adobe Acrobat 10.0
[PDF]           ModifyDate                      : 2012:09:06 13:46:21-05:00
[PDF]           Producer                        : Adobe Acrobat 10.0 Paper Capture Plug-in
[PDF]           PageCount                       : 66

C:\>exiftool -G1 -a -s -if "$pdf:all=~/Adobe Acrobat/i" -pdf:all test.pdf
    1 files failed condition

QuoteHaving search round to try and answer my original post I came by the term filter as an API option. Does this functionality mimic that of the if?

No. The -api Filter option applies a bit of Perl code to all the tags in the file. A common use of this would be to replace some characters or words with others in multiple tags. For example, you might want to replaces all Carriage Returns/Line Feeds with Line Feeds if the output is on a Linux/Mac system.

There's a way to use it with the -if option to find files that match things, but it doesn't work with PDF:All
"It didn't work" isn't helpful. What was the exact command used and the output.
Read FAQ #3 and use that cmd
Please use the Code button for exiftool output

Please include your OS/Exiftool version/filetype

Phil Harvey

Quote from: StarGeek on June 06, 2025, 02:01:05 PMXMP:All is handled a bit differently than PDF:All

I don't understand this statement.  In a -if expression, any $GROUP:all variable should evaluate to 1 if any tag exists in that group.  From the -p option documentation:

            When "All" is used as a tag name, a
            value of 1 is returned if any tag exists in the specified group,
            or 0 otherwise (unless the "All" group is also specified, in which
            case the values of all matching tags are joined).


So you may return the values of all PDF tags as you wanted by using $all:pdf:all.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

StarGeek

Quote from: Phil Harvey on June 06, 2025, 03:13:16 PM
Quote from: StarGeek on June 06, 2025, 02:01:05 PMXMP:All is handled a bit differently than PDF:All

I don't understand this statement.  In a -if expression, any $GROUP:all variable should evaluate to 1 if any tag exists in that group.

Sorry, you're correct. I thought I had used something that allowed a comparison against the XMP data in bulk, but I can't figure out what I may have done, so I must have been wrong.

QuoteSo you may return the values of all PDF tags as you wanted by using $all:pdf:all.

So this -if option would work?
-if "$All:xmp:all=~/Search Term/i or $all:pdf:all=~/Search Term/i"
"It didn't work" isn't helpful. What was the exact command used and the output.
Read FAQ #3 and use that cmd
Please use the Code button for exiftool output

Please include your OS/Exiftool version/filetype

Phil Harvey

...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

StarGeek

I think I figured out what led me to my error. I was trying a lot of things out and one thing I tried was
-api "filter=s/Adobe Acrobat//i" -if "$XMP# ne $XMP"

That would change the value from Binary data XXXX bytes to Binary data YYYY bytes and register as true. But since there isn't a similar tag for PDF, that was failing. I just forgot that I wasn't using XMP:All.
"It didn't work" isn't helpful. What was the exact command used and the output.
Read FAQ #3 and use that cmd
Please use the Code button for exiftool output

Please include your OS/Exiftool version/filetype

Phil Harvey

Ah.  Interesting way to do this.  Using $all:xmp:all=~/Adobe Acrobat/ is maybe a bit more straightforward, but note that this was a fairly recent feature:

Jan. 23, 2024 - Version 12.74
  - Enhanced tag name strings (eg. -if and -p option arguments) to allow values
    of multiple matching tags to be concatenated when a group name of "All" is
    specified


- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Athlete

Sorry for the late reply of "Thanks" as I have been offline for a couple of days.

I appreciate the answers and an insight to your "coding" but that was way beyond my level of understanding.

Should you be able to include the ability "to print only the tags that match a pattern" as a future enhancement that would be an asset for me.

greybeard

This is fascinating - I'll make a note of this technique.

Much better than my first response - which doesn't strictly work because it also searches for matching strings  in the tag and group names.