Need to remove a comma from +80.000 jpgs containing a comma in <mwg-rs:Name>

Started by brunos, February 03, 2020, 01:48:42 PM

Previous topic - Next topic

brunos

First thing first: sorry to be a pest and unable to find a solution on my own!

My archive of animal advocacy events and related photos in Italy already contains more than 80 thousands jpgs. I'm using Picasa 3.9 as the archive's frontend, and I'm adding face tags for all the known activists. The tags are stored both in the Picasa database and in the photo itself - in the internal XMP under  <mwg-rs:Regions rdf:parseType="Resource">.

The typical face tag name is "code Firstname Surname, City, Association", but often also "code Firstname Surname,City,Association" or "code Firstname Surname City Association" - without comma. Being typed all manually, the coherency of face tags name is unfortunately low.

A horrible discovery happened today, during a rebuilding of somehow corrupted Picasa database, the first time in 5 years: many of face tags didn't load as expected. Instead to get the face tag name right on from a photo, I've got "unnamed person" instead for thousands of photos. I stopped the database rebuilding and than I re-checked carefully the XMPs, I've identified some photos with face tag name acknowledged automatically and some ones where it doesn't load. What I've found out comparing XMP is that all the images where the face tag name won't load automatically, contain one or more comma signs. Looking better in Picasa's properties of multi-face photos, I've understood it was my fault: about one year ago I've started to name activists by "comma" syntax. I never got that Picasa uses comma as a delimiter between each face on a multi-person photos. It doesn't error: it just doesn't autoload the face tag. I noticed some also before, but I believed it was my fault - I never realized that it was a rule, and not an exception. Now, I did a manual test on a small set of photos in an ad-hoc Picasa database created from scratch, I've loaded photos "as-is" in Picasa, modified the persons's names by removing the commas, removed photos from Picasa, and imported them back - without commas it works! #@!

Now, as my main database is broken, and the rebuilding would take another +24 hrs (the files are on a mirrored disk array, not the fastest one in the world), I'm looking for a way to remove commas from the <mwg-rs:Name>face tag name with commas</mwg-rs:Name> BEFORE attempting to rebuild the database. A real-life example of the line is <mwg-rs:Name>aN0F Rina Xenia Nannini, Alzate</mwg-rs:Name>. I was looking how to do it by EXIFTOOL launched from the command line on 17 thousands of folders, all under the same root, but I'm afraid to break XMPs - the photos take more than 150 GB and I have not enough space to backup them.

The XMP section looks like this:
         <mwg-rs:Regions rdf:parseType="Resource">
            <mwg-rs:AppliedToDimensions rdf:parseType="Resource">
               <stDim:w>5184</stDim:w>
               <stDim:h>3456</stDim:h>
               <stDim:unit>pixel</stDim:unit>
            </mwg-rs:AppliedToDimensions>
            <mwg-rs:RegionList>
               <rdf:Bag>
                  <rdf:li rdf:parseType="Resource">
                     <mwg-rs:Name>aN0F Rina Xenia Nannini, Alzate</mwg-rs:Name>
                     <mwg-rs:Type>Face</mwg-rs:Type>
                     <mwg-rs:Area rdf:parseType="Resource">
                        <stArea:x>0.640336</stArea:x>
                        <stArea:y>0.373553</stArea:y>
                        <stArea:w>0.276813</stArea:w>
                        <stArea:h>0.497685</stArea:h>
                        <stArea:unit>normalized</stArea:unit>
                     </mwg-rs:Area>
                  </rdf:li>
               </rdf:Bag>
            </mwg-rs:RegionList>
         </mwg-rs:Regions>

The root folder is E:\AA\EV, and the subfolders with JPG photos inside are named as 20200125,AC1,Parma,VVS042,(nnnn),0000000001

The comma should be replaced by space, as in cases where the trailing space is not there, the words would attach one to each other. I believe I can survive with two consecutive spaces in a face tag name. As I have not enough space on disk, I accept delete_originals. I'm using EXIFTOOL everyday to set the "date taken" on multiple folders (thanks again, Phil!) with delete_originals, and I've never had problems.

If a gentle guru posts the necessary command line, I will test it first on a small set (a copy of an actual folder), and then hooray!

Thanks a million and I apologize again for bothering!

Kind regards, Bruno

Phil Harvey

Hi Bruno,

The real concern here is that if some faces aren't named, and if you aren't very careful about how you do this, then names could easily get shifted to the unnamed faces when they are rewritten.  Otherwise, it would be as simple as this:

exiftool "-xmp-mwg-rs:regionname<${-xmp-mwg-rs:regionname;tr/,/./}" -sep "##" DIR

(changing all of the commas to periods)

But I would have to do some testing and more thinking to see what happens if there are regions without a name.  However, I don't have time to do this at the moment, so I'll come back to this when I get a chance.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

brunos

Great Phil, thank you very much. I will do some testing on a copy of data set I may sacrifice, containing some photos with faces, and some without and I will let you know. Thanks again for the super-fast answer!!!

Kind regards
Bruno

StarGeek

Quote from: brunos on February 03, 2020, 01:48:42 PMWhat I've found out comparing XMP is that all the images where the face tag name won't load automatically, contain one or more comma signs.

Damn.  I knew that Picasa didn't properly load keywords that had commas in them, as it would always split them into separate keywords, but I didn't ever think to test if the regions had problems with commas.

QuoteI never got that Picasa uses comma as a delimiter between each face on a multi-person photos.

Technically, it doesn't.  The regions names are completely separate.  It's appears that the parser is broken in the case of reading region names that have commas.

Quote from: Phil Harvey on February 03, 2020, 02:03:08 PMThe real concern here is that if some faces aren't named, and if you aren't very careful about how you do this, then names could easily get shifted to the unnamed faces when they are rewritten.

I'm don't believe that Picasa will embed unnamed regions in the file.  They're saved either in the database and/or the .picasa.ini file.  I just double checked and it didn't look like the unnamed regions were stored in the file.

Quoteexiftool "-xmp-mwg-rs:regionname<${-xmp-mwg-rs:regionname;tr/,/./}" -sep "##" DIR

Shouldn't that be ${xmp-mwg-rs:regionname (no extra dash)?

And maybe use -api Filter to avoid any problems?  This is the region name replacement command I've been using
exiftool -if "$RegionName ne $RegionName#" -TagsFromFile @ -RegionName -api "Filter=s/Search/Replace/i" <FileOrDir>

I would think using Filter=s/, */ /g (/commaSpaceAsterisk/Space/) would replace a comma followed by 0 or more spaces with a single space.
* Did you read FAQ #3 and use the command listed there?
* Please use the Code button for exiftool code/output.
 
* Please include your OS, Exiftool version, and type of file you're processing (MP4, JPG, etc).

Phil Harvey

...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

brunos

Hi!

I did some testing, and the region name gets updated (comma is removed), but it continues not to autoload. I've used Photoshop to read XMP and compared the XMP structure between original and modified.

In the original photo, there's only one <rdf:li section for each region name. The <mwg-rs:Name>aN0M Ugo Bettio, Pavia</mwg-rs:Name> still contains comma:
                  <rdf:li rdf:parseType="Resource">
                     <mwg-rs:Name>aN0M Ugo Bettio, Pavia</mwg-rs:Name>
                     <mwg-rs:Type>Face</mwg-rs:Type>
                     <mwg-rs:Area rdf:parseType="Resource">
                        <stArea:x>0.468174</stArea:x>
                        <stArea:y>0.189091</stArea:y>
                        <stArea:w>0.329983</stArea:w>
                        <stArea:h>0.341725</stArea:h>
                        <stArea:unit>normalized</stArea:unit>
                     </mwg-rs:Area>
                  </rdf:li>

In the processed photo, the region name appears in its own <rdf:li section, while the region coordinates are in the second <rdf:li section. The comma is gone from <mwg-rs:Name>aN0M Ugo Bettio. Pavia</mwg-rs:Name>, replaced by period:
                  <rdf:li rdf:parseType="Resource">
                     <mwg-rs:Name>aN0M Ugo Bettio. Pavia</mwg-rs:Name>
                  </rdf:li>
                  <rdf:li rdf:parseType="Resource">
                     <mwg-rs:Area rdf:parseType="Resource">
                        <stArea:h>0.341725</stArea:h>
                        <stArea:unit>normalized</stArea:unit>
                        <stArea:w>0.329983</stArea:w>
                        <stArea:x>0.468174</stArea:x>
                        <stArea:y>0.189091</stArea:y>
                     </mwg-rs:Area>
                     <mwg-rs:Type>Face</mwg-rs:Type>

I tried also the StarGeek's syntax, modified to:
exiftool -if "$RegionName ne $RegionName#" -TagsFromFile @ -RegionName -api "Filter=s/, */ /g" C:\Users\Bruno\Desktop\folderstemp\0000

The replacement works, but the autoload not, and the <rdf:li is broken in two parts as in the attempts with Phill's syntax.

Just to be sure, I did another test: I took the photo with Ugo Bettio, that by chance contains the face of another activist without the comma in her name, and that name autoloads, while Ugo's not. I've added to Ugo's face a face tag manually in Picasa, making sure it doesn't contain a comma, then I saved the changes in XMP through the native Picasa function; I've verified in Photoshop that the name doesn't contain comma and that the "li" section is only one, and tested autoload - it worked.

So my guess would be that the two <rdf:li sections in some way confuse Picasa as its parser doesn't find the coordinates, therefore giving up. Have no idea if the duplication of the <rdf:li is avoidable and in which way.

A good news it that on the photos I've tested with unnamed faces, there's no region at all in XMP and both command lines didn't harm them.

Thanks again for your precious help!
Kind regards,
Bruno


PS. I also attached a screenshot (left side: rewritten XMP, right side: original JPG's XMP). The problem starts at the line 26, which opens the <rdf:li section for the activist "aN0M Ugo Bettio. Pavia". Unlike the original, the <rdf:li  section closes immediately below the line 27: in that way, we've got two <rdf:li  sections for that single region.

brunos

Just have a flash: perhaps it's not the problem in the doubled <rdf:li section, but in the fact that the <mwg-rs:Type>Face</mwg-rs:Type> is not in the same <rdf:li section with the <mwg-rs:Name>aN0M Ugo Bettio. Pavia</mwg-rs:Name>?

Kind regards
Bruno

Phil Harvey

Right.  I thought something like this may happen.  I will need to find some time to investigate this further.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).


Phil Harvey

What version of ExifTool are you using?  I don't get the same results when I try with a test XMP file that I have here.

It would be helpful if you could post your original xmp.  ie, the output from exiftool -xmp -b FILE > out.xmp

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

brunos

Hi Phil!

The ExifTool version is below, and the out.xmp is attached. Thank you for looking into this!

C:\Users\Bruno\Desktop\folderstemp>exiftool -ver -v
ExifTool version 11.21
Perl version 5.024000 (-C0)
Platform: MSWin32
Optional libraries:
  Archive::Zip                 1.47
  Compress::Zlib               2.069
  Digest::MD5                  2.54
  Digest::SHA                  5.95
  IO::Compress::Bzip2          2.069
  Time::Piece                  1.31
  Unicode::LineBreak           (not installed)
  IO::Compress::RawDeflate     2.069
  IO::Uncompress::RawInflate   2.069
  Win32::API                   0.84
  Win32::FindFile              0.15
  Win32API::File               0.1203

Phil Harvey

I think the problem will go away if you update to the current version of Exiftool.  There were some bug fixes in this area since 11.21.  For example:

July 19, 2019 - Version 11.57
  - Fixed problem replacing multiple structure elements in lists of XMP structures


- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

brunos

That's brilliant, Phil! I've downloaded the 11.85 and the problem is really gone! I processed the file, and it autoloads the name. Wow wow and superwow!

Thanks again, you're a superPhil!

Kindest regards,
Bruno