ExifTool Forum

ExifTool => Newbies => Topic started by: emalvick on May 22, 2014, 06:07:25 PM

Title: Cleaning up Hierarchical Keywords
Post by: emalvick on May 22, 2014, 06:07:25 PM
I have some images that over-time have had their hierarchical keywords contaminated with "non-hierarchical" keywords:

e.g.

XMP:HierarchicalSubject   =   A|B|C, D, A|D|E, etc. 

in different variations.

I want to essentially remove those keywords that do not have a | symbol since that distinguishes that the specific keyword is part of the hierarchy.

Is there a way to do this?  I've seen a lot of examples regarding specific keywords, but I suspect I would need regular expressions, and I'm not sure if it is possible to use regular expressions to match existing keywords for removal (or exclude from removal).


For instance, would it be possible to remove all keywords that include "A|" by using something like:

-XMP:HierarchicalSubject-="/^A\|/"   

Is using a regular expression in some way like that valid (couldn't see it in the FAQ or anywhere easily). 

And, what about the opposite... i.e. deleting keywords except ones that match a regular expression?

If this is too complicated, it's ok.  I'm searching a lot, but not finding much, which makes me think it isn't likely to be possible.
Title: Re: Cleaning up Hierarchical Keywords
Post by: Phil Harvey on May 22, 2014, 08:51:09 PM
I think that this may do what you want:

exiftool "-keywords<${keywords;s/(^|, )[^,|]+(, |$)/$1/g;s/, $//}" -sep ", " FILE

where FILE is one or more file and/or directory names.  Use single quotes instead of double quotes if you are on Mac or linux.

To do this I have used the advanced formatting feature, which accepts regular expressions, but isn't for the faint of heart.

In this case, to do the opposite, you could use [^,]*\|[^,]* instead of [^,|]+ in the expression, although you may need to double the backslash depending on what shell you are using.

- Phil
Title: Re: Cleaning up Hierarchical Keywords
Post by: emalvick on May 22, 2014, 09:52:52 PM
Thanks.  I'll check it out.  I can see the idea, but I'll have to read up a bit to understand.  I guess I also need to look at the advanced formatting feature.  I think I saw you post something similar in another thread, but I didn't really know what it was until your post here.  I'll report back once I get this working (or not).

Title: Re: Cleaning up Hierarchical Keywords
Post by: emalvick on May 22, 2014, 11:31:49 PM
Ok

I'm using something like this:

exiftool "-HierarchicalSubject<${HierarchicalSubject;s/(^|, )[^,|]+(, |$)/$1/g;s/, $//}" -sep ", " FILE

This mostly works, But, it only seems to remove one of the offending keywords at a time.  However, it does remove the correct keywords, and if I run it multiple times on a file it will remove all the correct keywords leaving the desired ones in place.

Is there something I should add, subtract to try and get this in one pass?

Now, I'm also trying to figure out exactly what is going on to learn the program... 

It looks like the code you provided is looking for two things (thus the two /s operators). 
1. It looks like it is looking for the keywords with the | and if it finds one, keeps it. 
2. It looks for the keywords that don't have the | and replaces them with nothing (thus making them blank)... 

I guess the /g in the first /s operation is there to match each instance of |word| (or variation) that could occur.

Title: Re: Cleaning up Hierarchical Keywords
Post by: Phil Harvey on May 23, 2014, 07:22:21 AM
Quote from: emalvick on May 22, 2014, 11:31:49 PM
exiftool "-HierarchicalSubject<${HierarchicalSubject;s/(^|, )[^,|]+(, |$)/$1/g;s/, $//}" -sep ", " FILE

Ah, right.  HierarchicalSubject.

QuoteThis mostly works, But, it only seems to remove one of the offending keywords at a time.

Try this:

exiftool "-HierarchicalSubject<${HierarchicalSubject;s/(^|, )[^,|]+//g;s/^, //}" -sep ", " FILE

QuoteIt looks like the code you provided is looking for two things (thus the two /s operators). 
1. It looks like it is looking for the keywords with the | and if it finds one, keeps it.

Close.  It is looking for the keywords without |, and then deletes them.

Quote2. It looks for the keywords that don't have the | and replaces them with nothing (thus making them blank)... 

No.  The second expression was to remove any dangling ", " at the end (but I have changed this now to remove it from the beginning because the new expression will leave it at the beginning instead of the end) -- you need to worry about the edge cases and clean up the unnecessary separators that may remain after deleting a keyword from the beginning or end of the string.

QuoteI guess the /g in the first /s operation is there to match each instance of |word| (or variation) that could occur.

Yes, but it wasn't finding all of them because the expression included the separators on both sides, so it would skip every other keyword in between.  I have tried to fix this with the new command.

- Phil
Title: Re: Cleaning up Hierarchical Keywords
Post by: emalvick on May 23, 2014, 10:14:37 AM
Thanks... It's making more sense now.

The RegEx here is a little different than what I'm used to using, but I'm getting it.  Thanks.
Title: Re: Cleaning up Hierarchical Keywords
Post by: Phil Harvey on May 23, 2014, 10:54:32 AM
Darn,  I realized that my 2nd attempt won't work either.  This certainly is a bit tricky.  I removed the check for the terminating separator, so now it can remove part of a keyword.  Ouch.

I had to consult my Perl book to figure this out:

exiftool "-HierarchicalSubject<${HierarchicalSubject;s/(^|, )[^,|]+(?=(, |$))//g;s/^, //}" -sep ", " FILE

I think this has a better chance of working, but unfortunately I don't have time to test it out right now.

Perhaps a simpler alternative is to do this with a user-defined Composite tag, since then the filtering is more straightforward.  The disadvantage is that you need to use a config file.  You should be able to search the forum for "UserDefined Keywords" for some examples.

- Phil
Title: Re: Cleaning up Hierarchical Keywords
Post by: emalvick on May 23, 2014, 01:33:50 PM
That last version worked.  I hadn't even tried the previous version.

I had thought about trying a custom field suspecting that what I wanted to do might not be possible.

Thanks for your help.