ExifTool Forum

ExifTool => Newbies => Topic started by: Pantik on November 04, 2016, 07:17:29 PM

Title: Remove duplicates (repeatet) words from image description
Post by: Pantik on November 04, 2016, 07:17:29 PM
Hi guys =) Ihave 1000 files with some description, and it contained some repeated words in (xpsubject as i think), like this:

Cartoon horse on white background. Cartoon horse vector illustration. Cartoon cute horse farm animals happy mane stallion character design.

I need to delete (just cut) all duplicates of cartoon horse, but first words must be saved. The final result i need is:

Cartoon horse on white background. vector illustration. cute farm animals happy mane stallion character design.

Any ideas?
Title: Re: Remove duplicates (repeatet) words from image description
Post by: Phil Harvey on November 05, 2016, 10:20:33 AM
What algorithm would you use to decide which words to remove?  I could imagine a way to remove all duplicate words, but then you would have words like "the" removed too, which you may want duplicates of.

- Phil
Title: Re: Remove duplicates (repeatet) words from image description
Post by: Pantik on November 05, 2016, 11:06:04 AM
Quote from: Phil Harvey on November 05, 2016, 10:20:33 AM
What algorithm would you use to decide which words to remove?  I could imagine a way to remove all duplicate words, but then you would have words like "the" removed too, which you may want duplicates of.
- Phil

Hi, Phil. The, is, are and other it's no matter. The main goal is removing duplicated words, excluding first.
First of all , i am not a programmer, i am begginer  :) So, as i understand, i need some function like this java, but for exiftool

public class FindDuplicateWordsInText {

    public static Set<String> findDuplicateWordsInText(String text) {
        String[] words = text.split(" ");
        Set<String> duplicatesRemovedSet = new HashSet<>();
        Set<String> duplicatesSet = Arrays.stream(words).filter(string -> !duplicatesRemovedSet.add(string))
                .collect(Collectors.toSet());
        return duplicatesSet;
    }
}


Another variant is using tempory files with export metadate, but it's still difficult for me  :o

Hope exiftool have some function
Title: Re: Remove duplicates (repeatet) words from image description
Post by: Pantik on November 05, 2016, 11:07:58 AM
And thanks for your great product, i see many guys use ExifTool 👍 :)
Title: Re: Remove duplicates (repeatet) words from image description
Post by: Pantik on November 05, 2016, 11:53:34 AM
Another decisions of the problem is cut all symbols after first "." symbols

Was
Cartoon horse on white background. Cartoon horse vector illustration. Cartoon cute horse farm animals happy mane stallion character design.

Need
Cartoon horse on white background

This variant is good too  :D
Title: Re: Remove duplicates (repeatet) words from image description
Post by: Phil Harvey on November 05, 2016, 03:38:07 PM
Cutting everything from after the first "." is easy:

exiftool "-imagedescription<${imagedescription;s/\..*//}" DIR

Removing duplicates is trickier:

exiftool "-imagedescription<${imagedescription;my (@a,%h);$h{lc $_} or push(@a,$_),$h{lc $_}=1 foreach split;$_=join ' ',@a}" DIR

- Phil
Title: Re: Remove duplicates (repeatet) words from image description
Post by: Pantik on November 05, 2016, 04:55:38 PM
Thank you. Phil!
First variant is working, second is not - no file specifed
As i understand, you have your own function mean my  :)

exiftool "-imagedescription<${imagedescription;my (@a,%h);$h{lc $_} or push(@a,$_),$h{lc $_}=1 foreach split;$_=join ' ',@a}" DIR

Anyway i try to understand it, but i am still slowpok 8)
Title: Re: Remove duplicates (repeatet) words from image description
Post by: StarGeek on November 05, 2016, 05:35:28 PM
Quote from: Pantik on November 05, 2016, 04:55:38 PM
As i understand, you have your own function mean my  :)

For this context, my is declaring array @a and hash %h.  Every thing from the my to the closing brace is perl commands.

Quotesecond is not - no file specifed

Are you sure you remembered to replace DIR with the file or directory?  Or did you make sure to copy the quotes correctly?  This error indicates that a file to process was not included.
Title: Re: Remove duplicates (repeatet) words from image description
Post by: Pantik on November 05, 2016, 06:36:24 PM
Quote from: StarGeek on November 05, 2016, 05:35:28 PM
Are you sure you remembered to replace DIR with the file or directory?  Or did you make sure to copy the quotes correctly?  This error indicates that a file to process was not included.

Yes, my code is
exiftool "-XPSubject<${XPSubject; my(@a,%h); $h{lc $_} or push(@a,$_),$h{lc $_}=1 foreach split; $_=join ' ', @a}" C:\-\ -overwrite_original -r -k

In first case it works with it's directory (remove befor ".")

ps Thank you for explaining, i'll try to learn it!  :D