HeadingPairs structure

Started by thorsted, December 06, 2019, 02:16:35 PM

Previous topic - Next topic

thorsted

I am working on transforming the XML from Office documents. One of the properties in the XML is XML:HeadingPairs, but has children.
<XML:HeadingPairs>
  <rdf:Bag>
   <rdf:li>Title</rdf:li>
   <rdf:li>1</rdf:li>
  </rdf:Bag>
</XML:HeadingPairs>

The problem is, "Title" should be a property and "1" be the value of that property. rdf:Bag indicates an unordered list, but the sequence is significant. Maybe rdf:Seq is better?

In the original DOCX file we have:
    <HeadingPairs>
        <vt:vector size="2" baseType="variant">
            <vt:variant>
                <vt:lpstr>Title</vt:lpstr>
            </vt:variant>
            <vt:variant>
                <vt:i4>1</vt:i4>
            </vt:variant>
        </vt:vector>
    </HeadingPairs>

This would be easier to transform as it is tagged differently instead of the same <rdf:li>.

Is there a better way to do this?

Thanks.

Phil Harvey

If put directly into a list, then rdf:Seq is definitely the type to use.  But maybe something like this makes more sense:

<XML:HeadingPairs>
<rdf:Bag>
  <rdf:li rdf:parseType='Resource'>
   <XML:name>Title</XML:name>
   <XML:value>1</XML:value>
  </rdf:li>
</rdf:Bag>
</XML:HeadingPairs>


(although appropriating the "XML" namespace is probably a bad idea -- you should define your own namespace)

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

thorsted

That does make more sense.

In my transform I am transforming all the MD in to a key ID list, the HeadingPairs & TitleofParts are making it more complicated. I will have my own namespace.

Is this something you might change in the way Exiftool outputs the XML for Office Open XML? I was also curious on the use of the XML namespace in Exiftool.

Thanks.

Phil Harvey

ExifTool only composes XML with the -X option (and RDF/XML when writing XMP metadata).  Any XML it outputs otherwise is verbatim as it was found in the file.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

thorsted

Phil,

There is a little bit of restructuring. You can see from my first post the difference.

   <rdf:li>Title</rdf:li>
   <rdf:li>1</rdf:li>

vs.
   <vt:variant>
<vt:lpstr>Title</vt:lpstr>
</vt:variant>
<vt:variant>
<vt:i4>1</vt:i4>
</vt:variant>


In the original we have the title as vt:lpstr and the property as vt:i4, instead of both being rdf:li in exiftool output.

Thanks.

Phil Harvey

What exiftool command are you using?
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

thorsted

Phil,

Exiftool -X to get the rdf:li output.

Looking at docProps/app.xml in the DOCX file for the actual MD.

Phil Harvey

Ah.  I didn't understand that is what you were doing.  We should start again from the beginning.

You are suggesting a change to the -X output?  It is unlikely that I will be able to easily accommodate this request, but if you send me the sample DOCX file I'll take a look.  My email is philharvey66 at gmail.com

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Phil Harvey

I got the sample file, thanks.  I didn't have time to think about this today, but I'll post back after I've had a chance to look into it.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Phil Harvey

I've had a chance to play with this a bit.

ExifTool is converting the elements of the vt:vector from the XML file into a values of a List-type tag.

From there, the -X option outputs the list in RDF/XML format.  But currently the type of list (Bag, Seq, or Alt) is not saved in the intermediate step, so ExifTool outputs all lists as "Bag" in the -X output.  About the only change I could make here is to add a patch to maintain the list type and output this as an RDF/XML "Seq" list, but...

1. The patch would be ugly

2. The resulting change is not very significant, and I don't see how it would make much difference for you.

But since the values are stored as a vector of separate items, the list of values is definitely the right thing for ExifTool to output.

It sounds like you would prefer a more structured output.  However, even if the source XML was structured, ExifTool's parsing of XML is very simplistic, so even then ExifTool would require significant patching to achieve this goal.   Note that ExifTool doesn't officially support reading of arbitrary XML files, for good reason.

So I'm not sure what to do here.  If just changing from Bag to Seq makes a difference for you then I could look into this further.  Beyond that the only thing I could think of would be to add dedicated code to handle HeadingPairs and TitleOfParts specially, but this solution would be very asymmetric and ugly to me.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).