Cleaning up Hierarchical Keywords

Started by emalvick, May 22, 2014, 06:07:25 PM

Previous topic - Next topic

emalvick

I have some images that over-time have had their hierarchical keywords contaminated with "non-hierarchical" keywords:

e.g.

XMP:HierarchicalSubject   =   A|B|C, D, A|D|E, etc. 

in different variations.

I want to essentially remove those keywords that do not have a | symbol since that distinguishes that the specific keyword is part of the hierarchy.

Is there a way to do this?  I've seen a lot of examples regarding specific keywords, but I suspect I would need regular expressions, and I'm not sure if it is possible to use regular expressions to match existing keywords for removal (or exclude from removal).


For instance, would it be possible to remove all keywords that include "A|" by using something like:

-XMP:HierarchicalSubject-="/^A\|/"   

Is using a regular expression in some way like that valid (couldn't see it in the FAQ or anywhere easily). 

And, what about the opposite... i.e. deleting keywords except ones that match a regular expression?

If this is too complicated, it's ok.  I'm searching a lot, but not finding much, which makes me think it isn't likely to be possible.

Phil Harvey

I think that this may do what you want:

exiftool "-keywords<${keywords;s/(^|, )[^,|]+(, |$)/$1/g;s/, $//}" -sep ", " FILE

where FILE is one or more file and/or directory names.  Use single quotes instead of double quotes if you are on Mac or linux.

To do this I have used the advanced formatting feature, which accepts regular expressions, but isn't for the faint of heart.

In this case, to do the opposite, you could use [^,]*\|[^,]* instead of [^,|]+ in the expression, although you may need to double the backslash depending on what shell you are using.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

emalvick

Thanks.  I'll check it out.  I can see the idea, but I'll have to read up a bit to understand.  I guess I also need to look at the advanced formatting feature.  I think I saw you post something similar in another thread, but I didn't really know what it was until your post here.  I'll report back once I get this working (or not).


emalvick

Ok

I'm using something like this:

exiftool "-HierarchicalSubject<${HierarchicalSubject;s/(^|, )[^,|]+(, |$)/$1/g;s/, $//}" -sep ", " FILE

This mostly works, But, it only seems to remove one of the offending keywords at a time.  However, it does remove the correct keywords, and if I run it multiple times on a file it will remove all the correct keywords leaving the desired ones in place.

Is there something I should add, subtract to try and get this in one pass?

Now, I'm also trying to figure out exactly what is going on to learn the program... 

It looks like the code you provided is looking for two things (thus the two /s operators). 
1. It looks like it is looking for the keywords with the | and if it finds one, keeps it. 
2. It looks for the keywords that don't have the | and replaces them with nothing (thus making them blank)... 

I guess the /g in the first /s operation is there to match each instance of |word| (or variation) that could occur.


Phil Harvey

Quote from: emalvick on May 22, 2014, 11:31:49 PM
exiftool "-HierarchicalSubject<${HierarchicalSubject;s/(^|, )[^,|]+(, |$)/$1/g;s/, $//}" -sep ", " FILE

Ah, right.  HierarchicalSubject.

QuoteThis mostly works, But, it only seems to remove one of the offending keywords at a time.

Try this:

exiftool "-HierarchicalSubject<${HierarchicalSubject;s/(^|, )[^,|]+//g;s/^, //}" -sep ", " FILE

QuoteIt looks like the code you provided is looking for two things (thus the two /s operators). 
1. It looks like it is looking for the keywords with the | and if it finds one, keeps it.

Close.  It is looking for the keywords without |, and then deletes them.

Quote2. It looks for the keywords that don't have the | and replaces them with nothing (thus making them blank)... 

No.  The second expression was to remove any dangling ", " at the end (but I have changed this now to remove it from the beginning because the new expression will leave it at the beginning instead of the end) -- you need to worry about the edge cases and clean up the unnecessary separators that may remain after deleting a keyword from the beginning or end of the string.

QuoteI guess the /g in the first /s operation is there to match each instance of |word| (or variation) that could occur.

Yes, but it wasn't finding all of them because the expression included the separators on both sides, so it would skip every other keyword in between.  I have tried to fix this with the new command.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

emalvick

Thanks... It's making more sense now.

The RegEx here is a little different than what I'm used to using, but I'm getting it.  Thanks.

Phil Harvey

Darn,  I realized that my 2nd attempt won't work either.  This certainly is a bit tricky.  I removed the check for the terminating separator, so now it can remove part of a keyword.  Ouch.

I had to consult my Perl book to figure this out:

exiftool "-HierarchicalSubject<${HierarchicalSubject;s/(^|, )[^,|]+(?=(, |$))//g;s/^, //}" -sep ", " FILE

I think this has a better chance of working, but unfortunately I don't have time to test it out right now.

Perhaps a simpler alternative is to do this with a user-defined Composite tag, since then the filtering is more straightforward.  The disadvantage is that you need to use a config file.  You should be able to search the forum for "UserDefined Keywords" for some examples.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

emalvick

That last version worked.  I hadn't even tried the previous version.

I had thought about trying a custom field suspecting that what I wanted to do might not be possible.

Thanks for your help.