ExifTool Forum

General => Metadata => Topic started by: thorsted on December 06, 2019, 02:16:35 PM

Title: HeadingPairs structure
Post by: thorsted on December 06, 2019, 02:16:35 PM
I am working on transforming the XML from Office documents. One of the properties in the XML is XML:HeadingPairs, but has children.
<XML:HeadingPairs>
  <rdf:Bag>
   <rdf:li>Title</rdf:li>
   <rdf:li>1</rdf:li>
  </rdf:Bag>
</XML:HeadingPairs>

The problem is, "Title" should be a property and "1" be the value of that property. rdf:Bag indicates an unordered list, but the sequence is significant. Maybe rdf:Seq is better?

In the original DOCX file we have:
    <HeadingPairs>
        <vt:vector size="2" baseType="variant">
            <vt:variant>
                <vt:lpstr>Title</vt:lpstr>
            </vt:variant>
            <vt:variant>
                <vt:i4>1</vt:i4>
            </vt:variant>
        </vt:vector>
    </HeadingPairs>

This would be easier to transform as it is tagged differently instead of the same <rdf:li>.

Is there a better way to do this?

Thanks.
Title: Re: HeadingPairs structure
Post by: Phil Harvey on December 06, 2019, 02:41:19 PM
If put directly into a list, then rdf:Seq is definitely the type to use.  But maybe something like this makes more sense:

<XML:HeadingPairs>
<rdf:Bag>
  <rdf:li rdf:parseType='Resource'>
   <XML:name>Title</XML:name>
   <XML:value>1</XML:value>
  </rdf:li>
</rdf:Bag>
</XML:HeadingPairs>


(although appropriating the "XML" namespace is probably a bad idea -- you should define your own namespace)

- Phil
Title: Re: HeadingPairs structure
Post by: thorsted on December 06, 2019, 03:07:11 PM
That does make more sense.

In my transform I am transforming all the MD in to a key ID list, the HeadingPairs & TitleofParts are making it more complicated. I will have my own namespace.

Is this something you might change in the way Exiftool outputs the XML for Office Open XML? I was also curious on the use of the XML namespace in Exiftool.

Thanks.
Title: Re: HeadingPairs structure
Post by: Phil Harvey on December 06, 2019, 10:12:34 PM
ExifTool only composes XML with the -X option (and RDF/XML when writing XMP metadata).  Any XML it outputs otherwise is verbatim as it was found in the file.

- Phil
Title: Re: HeadingPairs structure
Post by: thorsted on December 09, 2019, 08:39:35 AM
Phil,

There is a little bit of restructuring. You can see from my first post the difference.

   <rdf:li>Title</rdf:li>
   <rdf:li>1</rdf:li>

vs.
   <vt:variant>
<vt:lpstr>Title</vt:lpstr>
</vt:variant>
<vt:variant>
<vt:i4>1</vt:i4>
</vt:variant>


In the original we have the title as vt:lpstr and the property as vt:i4, instead of both being rdf:li in exiftool output.

Thanks.
Title: Re: HeadingPairs structure
Post by: Phil Harvey on December 09, 2019, 10:13:31 AM
What exiftool command are you using?
Title: Re: HeadingPairs structure
Post by: thorsted on December 09, 2019, 10:20:58 AM
Phil,

Exiftool -X to get the rdf:li output.

Looking at docProps/app.xml in the DOCX file for the actual MD.
Title: Re: HeadingPairs structure
Post by: Phil Harvey on December 09, 2019, 10:54:05 AM
Ah.  I didn't understand that is what you were doing.  We should start again from the beginning.

You are suggesting a change to the -X output?  It is unlikely that I will be able to easily accommodate this request, but if you send me the sample DOCX file I'll take a look.  My email is philharvey66 at gmail.com

- Phil
Title: Re: HeadingPairs structure
Post by: Phil Harvey on December 10, 2019, 03:49:35 PM
I got the sample file, thanks.  I didn't have time to think about this today, but I'll post back after I've had a chance to look into it.

- Phil
Title: Re: HeadingPairs structure
Post by: Phil Harvey on December 11, 2019, 01:08:12 PM
I've had a chance to play with this a bit.

ExifTool is converting the elements of the vt:vector from the XML file into a values of a List-type tag.

From there, the -X option outputs the list in RDF/XML format.  But currently the type of list (Bag, Seq, or Alt) is not saved in the intermediate step, so ExifTool outputs all lists as "Bag" in the -X output.  About the only change I could make here is to add a patch to maintain the list type and output this as an RDF/XML "Seq" list, but...

1. The patch would be ugly

2. The resulting change is not very significant, and I don't see how it would make much difference for you.

But since the values are stored as a vector of separate items, the list of values is definitely the right thing for ExifTool to output.

It sounds like you would prefer a more structured output.  However, even if the source XML was structured, ExifTool's parsing of XML is very simplistic, so even then ExifTool would require significant patching to achieve this goal.   Note that ExifTool doesn't officially support reading of arbitrary XML files, for good reason.

So I'm not sure what to do here.  If just changing from Bag to Seq makes a difference for you then I could look into this further.  Beyond that the only thing I could think of would be to add dedicated code to handle HeadingPairs and TitleOfParts specially, but this solution would be very asymmetric and ugly to me.

- Phil