XML parser error for images containing [JSON] / [JUMBF] tags

Started by Mac2, February 07, 2024, 10:11:14 AM

Previous topic - Next topic

Mac2

When I import data from ExifTool into my application (IMatch) and these files contain CAI/C2PA data, the import fails because of the XML produced by ExifTool is invalid.

I've used the official sample images from the project's GitHub: https://github.com/c2pa-org/public-testfiles/tree/main/image/jpeg

Both the Microsoft XML parser (Windows) and the XML-Parser/Validator in Visual Studio Code report this node:

<JSON:Author>
  <rdf:Description et:id='author' et:table='JSON::Main'>
   <et:desc>Author</et:desc>
   <et:prt rdf:parseType='Resource'>
    <JSON:@type>Person</JSON:@type>
    <JSON:name>Adobe make_test</JSON:name>
   </et:prt>
  </rdf:Description>
 </JSON:Author>

and complain about the JSON:@type as "Element or attribute do not match QName production: QName::=(NCName':')?NCName".

When I replace the node name with <JSON:at-type></JSON:at-type> before parsing it with MSXML, the error is gone and my software can ingest the data as usual.

Phil Harvey

I am having trouble trying to reproduce this.

Which specific sample image, and what ExifTool command line did you use?

- Phil

Edit: Ah, OK.  I can see it now by adding the -struct option.
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Phil Harvey

Got it.  I'll add strict XML attribute name validation for structure elements in the next release (12.77), which should fix the issue.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Mac2

Sorry, I should have included the parameters I use to extract the data. Which indeed use -struct.

That's for looking into this. It's not urgent, but I expect to see more and more of images with embedded CAI data in the future.

Mac2

There are more XML errors.
The data in the attached JPG file fails to load, with the error message
<CBOR:actions[1].action rdf:parseType='Resource'>
'A name contained an invalid character.'
Every node name with [1] is rejected as illegal.

I have created this JPG image for testing purposes with Stable Diffusion and used Photoshop to save it with Content Credentials enabled.

These are the parameters used to extract the metadata:

-overwrite_original
-charset
FILENAME=UTF8
-tagsfromfile
c:\images\001 copy.jpg
-all:all
-api
struct=2
-use
MWG
--preview:all
-@
v:\exiftool\arg_files\exif2xmp.args
--Exif:rating
-@
v:\exiftool\arg_files\iptc2xmp.args
-@
v:\exiftool\arg_files\gps2xmp.args
C:\temp\imt8B23A0F7-3976-422C-A096-2AA8F83C5D26.xmp
-execute



Phil Harvey

Thanks.  The patch will force all structure field names to conform with the XML specification, removing all invalid characters.  The only exception is that I will allow "xml" as the first 3 letters in a field name, which may not be strictly allowed by the spec, but I've tested and it still passes my XML validator.

- phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Mac2

Excellent. Thank you.
Hopefully the Microsoft XML parser will accept these node names too.