I am always learning, and often wrong. I'm curious if parsing XML-based metadata might be much easier if a DTD or XSD were provided for each supported namespace? I see these:
https://metacpan.org/dist/XML-Validator-Schema (https://metacpan.org/dist/XML-Validator-Schema)
https://metacpan.org/pod/XML::LibXML::Schema (https://metacpan.org/pod/XML::LibXML::Schema)
https://xerces.apache.org/xerces-p/ (https://xerces.apache.org/xerces-p/)
From my amateur viewpoint, they look promising. The only XML that gets recognized and parsed is that which has a schema document! All others are ignored/unparsed. Thoughts?
- J
I haven't considered using XSD. Most of the XML that ExifTool parses is proprietary anyway, so I would have to generate the XSD myself, and write the code to interpret the XML based on the XSD. It just doesn't sound like much fun.
- Phil
Fair! I have nothing but gratitude for your work. : )
Upon more research, I've discovered there are a number of mature, respected utilities for converting multiple XML documents into a single XSD schema very quickly. I'm installing a few now to test. I don't have access to a Windows machine (macOS at home, Ubuntu on servers), but XMLSpy (https://www.altova.com/xmlspy-xml-editor) looks nice:
https://www.altova.com/blog/generating-a-schema-from-multiple-xml-instances/
https://www.altova.com/xmlspy-xml-editor
Not cheap though. I'm currently testing Apache XMLBeans libraries (https://xmlbeans.apache.org/) first, and I see that Microsoft also has a very well-regarded tool (https://learn.microsoft.com/en-us/dotnet/standard/serialization/xml-schema-definition-tool-xsd-exe) that I've read can work with Mono on macOS. Will keep you apprised!
- J
Wow. OK. so, i installed Apache Ant (https://ant.apache.org/) and insured i had JDK 1.8, then installed Apache XMLBeans (https://xmlbeans.apache.org/). i then used the command line tool inst2xsd to assess a folder of XML documents and emit a schema. (i am leaving out all the PATH party). (i also installed Apache Log4j (https://logging.apache.org/log4j/2.x/) for logging, which is optional.)
Because i was testing with Capture One Settings (.cos) files, and inst2xsd (https://xmlbeans.apache.org/guide/Tools.html#inst2xsd) only processes files with the .xml extension, i wrote a command that pipes renaming them and then renaming them back. this bash command only works on the current directory, and uses the defaults:
XML_DIR=$(pwd); for file in $XML_DIR/*; do mv "$file" "$file.xml"; done && inst2xsd $XML_DIR/*.xml && for file in $XML_DIR/*.xml; do mv "$file" "${file%.xml}"; done
seems to work without any issues. this was somewhat helpful: link (https://www.infoworld.com/article/2162353/generate-xml-schemas-from-xml-with-inst2xsd.html)
i then validated:
XML_DIR=$(pwd); for file in $XML_DIR/*; do mv "$file" "$file.xml"; done && inst2xsd -validate $XML_DIR/*.xml && for file in $XML_DIR/*.xml; do mv "$file" "${file%.xml}"; done
and achieved total joy, as far as I can tell?
the entire installation and test took a couple hours. i'll keep researching.
- J