XML Parsing

Started by blue-j, August 05, 2024, 02:45:07 PM

Previous topic - Next topic

blue-j

I am always learning, and often wrong.  I'm curious if parsing XML-based metadata might be much easier if a DTD or XSD were provided for each supported namespace?  I see these:

    https://metacpan.org/dist/XML-Validator-Schema

    https://metacpan.org/pod/XML::LibXML::Schema

    https://xerces.apache.org/xerces-p/

From my amateur viewpoint, they look promising.  The only XML that gets recognized and parsed is that which has a schema document!  All others are ignored/unparsed.  Thoughts?

- J

Phil Harvey

I haven't considered using XSD.  Most of the XML that ExifTool parses is proprietary anyway, so I would have to generate the XSD myself, and write the code to interpret the XML based on the XSD.  It just doesn't sound like much fun.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

blue-j

Fair!  I have nothing but gratitude for your work.  : )

Upon more research, I've discovered there are a number of mature, respected utilities for converting multiple XML documents into a single XSD schema very quickly.  I'm installing a few now to test.  I don't have access to a Windows machine (macOS at home, Ubuntu on servers), but XMLSpy looks nice:

https://www.altova.com/blog/generating-a-schema-from-multiple-xml-instances/
https://www.altova.com/xmlspy-xml-editor

Not cheap though.  I'm currently testing Apache XMLBeans libraries first, and I see that Microsoft also has a very well-regarded tool that I've read can work with Mono on macOS.  Will keep you apprised!

- J



blue-j

Wow.  OK.  so, i installed Apache Ant and insured i had JDK 1.8, then installed Apache XMLBeans.  i then used the command line tool inst2xsd to assess a folder of XML documents and emit a schema.  (i am leaving out all the PATH party).  (i also installed Apache Log4j for logging, which is optional.)

Because i was testing with Capture One Settings (.cos) files, and inst2xsd only processes files with the .xml extension, i wrote a command that pipes renaming them and then renaming them back.  this bash command only works on the current directory, and uses the defaults:

XML_DIR=$(pwd); for file in $XML_DIR/*; do mv "$file" "$file.xml"; done && inst2xsd $XML_DIR/*.xml && for file in $XML_DIR/*.xml; do mv "$file" "${file%.xml}"; done
seems to work without any issues.  this was somewhat helpful: link

i then validated:

XML_DIR=$(pwd); for file in $XML_DIR/*; do mv "$file" "$file.xml"; done && inst2xsd -validate $XML_DIR/*.xml && for file in $XML_DIR/*.xml; do mv "$file" "${file%.xml}"; done
and achieved total joy, as far as I can tell?

the entire installation and test took a couple hours. i'll keep researching.

- J