Adding PDF/A Compliant XMP Tags

Started by archivistLiz, July 21, 2019, 05:27:46 PM

Previous topic - Next topic

archivistLiz

New to the forum, but not totally new to ExifTool. My question is primarily a syntax one.

I've edited my config file to be able to add custom XMP tags that still conform to the structure necessary to be PDF-A Compliant. %Image::ExifTool::UserDefined = (
    # new XMP namespaces (eg. xxx) must be added to the Main XMP table:
    'Image::ExifTool::XMP::Main' => {
# namespace definition for examples 8 to 11
        pdfaExtension => { # <-- must be the same as the NAMESPACE prefix
            SubDirectory => {
                TagTable => 'Image::ExifTool::UserDefined::pdfaExtension',
                # (see the definition of this table below)
            },
        },
       # add more user-defined XMP namespaces here...
premis => { # <-- must be the same as the NAMESPACE prefix
             SubDirectory => {
                 TagTable => 'Image::ExifTool::UserDefined::premis',
                # (see the definition of this table below)
            },
        },
    },
);

# This is a basic example of the definition for a new XMP namespace.
# This table is referenced through a SubDirectory tag definition
# in the %Image::ExifTool::UserDefined definition above.
# The namespace prefix for these tags is 'xxx', which corresponds to
# an ExifTool family 1 group name of 'XMP-xxx'.
%Image::ExifTool::UserDefined::pdfaExtension = (
     GROUPS => { 0 => 'XMP', 1 => 'XMP-pdfaExtension' },
     NAMESPACE => { 'pdfaExtension' => 'http://www.aiim.org/pdfa/ns/extension/' },
     WRITABLE => 'string', # (default to string-type tags)
     schemas => {
List => 'Bag',
Struct => {
NAMESPACE => {'pdfaSchema' => 'http://www.aiim.org/pdfa/ns/schema#'},
schema => {},
namespaceURI => {},
prefix => {},
property => {
List => 'Seq',
Struct => {
NAMESPACE => {'pdfaProperty' => 'http://www.aiim.org/pdfa/ns/property#'},
name => {},
valueType => {},
category => {},
description => {},
}
}
},
valueType => {
List => 'Seq',
Struct => {
NAMESPACE => {'pdfaType' => 'http://www.aiim.org/pdfa/ns/type#'},
type => {},
namespaceURI => {},
prefix => {},
description => {},
}
},
},
);



# This is a basic example of the definition for a new XMP namespace.
# This table is referenced through a SubDirectory tag definition
# in the %Image::ExifTool::UserDefined definition above.
# The namespace prefix for these tags is 'xxx', which corresponds to
# an ExifTool family 1 group name of 'XMP-xxx'.
%Image::ExifTool::UserDefined::premis = (
     GROUPS => { 0 => 'XMP', 1 => 'XMP-premis' },
     NAMESPACE => { 'premis' => 'http://www.loc.gov/premis/v3' },
     WRITABLE => 'string', # (default to string-type tags)
     EventType => { WRITABLE => 'string' },
     EventDateTime => { WRITABLE => 'date' },
);

print "Working!\n";

#------------------------------------------------------------------------------
1;  #end


In order for this to work, I then have to define the premis tags to fit the PDF/A schema. This would define the tag EventType:
~/ExifTool/exiftool -XMP-pdfaExtension:schemas+="{schema=premisV3, namespaceURI=http://www.loc.gov/premis/v3, prefix=premis, property={name=EventType, valueType=Text, category=internal, description=Event Types according to LOC}}" /filepath/name.pdf

If I just define one property for the schema and pass it a value ~/ExifTool/exiftool "-XMP-premis:EventType=migration", it still creates a valid PDF. (I'm validating using verapdf (verapdf.org). But if I repeat the schema definition with a second resource, it is no longer valid.

I cannot figure the syntax to add more then one resource. (In the config file, I also defined EventDateTime.) I assume it is a question of how to nest the curly braces or maybe use square ones.  I tried the following, and none of them worked:
~/ExifTool/exiftool -XMP-pdfaExtension:schemas+="{schema=premisV3, namespaceURI=http://www.loc.gov/premis/v3, prefix=premis, property={name={EventType, EventDateTime}, valueType={Text, Text}, category={internal, internal}, description={Event Types according to LOC, DateTime digitized}}}" /file.pdf
Working!
Warning: Invalid structure field at 'EventType, EventDateTime}, ...'
    1 image files updated

:~$ ~/ExifTool/exiftool -XMP-pdfaExtension:schemas+="{schema=premisV3, namespaceURI=http://www.loc.gov/premis/v3, prefix=premis, [property={name={EventType, EventDateTime}, valueType={Text, Text}, category={internal, internal}, description={Event Types according to LOC, DateTime digitized}]}}" /file.pdf
Working!
Warning: Invalid structure field at '[property={name={EventType,...'
    1 image files updated

:~$ ~/ExifTool/exiftool -XMP-pdfaExtension:schemas+="{schema=premisV3, namespaceURI=http://www.loc.gov/premis/v3, prefix=premis, property=[{name={EventType, EventDateTime}, valueType={Text, Text}, category={internal, internal}, description={Event Types according to LOC, DateTime digitized}]}}" /file.pdf
Working!
Warning: Invalid structure field at 'EventType, EventDateTime}, ...'
    1 image files updated


I would greatly appreciate any pointers to where I am going wrong. Is it the syntax of defining the schema, or am I missing something in the config file? Thanks in advance!

Phil Harvey

Quote from: archivistLiz on July 21, 2019, 05:27:46 PM
if I repeat the schema definition with a second resource, it is no longer valid.

What do you mean it is no longer valid?  If you are referring to the results from verapdf.org, then you should ask them.  It looks valid to me.  Repeating this command would add a new "schemas" structure to the rdf:Bag of schemas.  I tried this an the resulting XMP looks fine (except that you now have entries with identical namespaceURI's in your list, which may be a problem if these are meant to be unique).

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

archivistLiz

Thanks for the reply! Yes, the non-unique namespace URI, prefix and schema definition, is what seems to be the problem. (I have been in touch with people from VeraPDF. The tool is sensitive, but it should work if I can get the XMP formatting right.) Is there a way to add both properties into the same schema? I tried a couple of different ways of placing the curly or square brackets, but it did not work.

Phil Harvey

If you show me the XMP that you want to generate then I can help you with the config file and exiftool command to do this.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

archivistLiz

Right now the XMP looks like this:
~/ExifTool/exiftool -XMP -b /media/liz/disk/testpdfa/E0001_006.pdf
Working!
<?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d'?>
<x:xmpmeta xmlns:x='adobe:ns:meta/' x:xmptk='Image::ExifTool 11.08'>
<rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'>

<rdf:Description rdf:about=''
  xmlns:dc='http://purl.org/dc/elements/1.1/'>
  <dc:format>application/pdf</dc:format>
</rdf:Description>

<rdf:Description rdf:about=''
  xmlns:pdf='http://ns.adobe.com/pdf/1.3/'>
  <pdf:Producer>Adobe PDF Scan Library 3.2</pdf:Producer>
</rdf:Description>

<rdf:Description rdf:about=''
  xmlns:pdfaExtension='http://www.aiim.org/pdfa/ns/extension/'
  xmlns:pdfaProperty='http://www.aiim.org/pdfa/ns/property#'
  xmlns:pdfaSchema='http://www.aiim.org/pdfa/ns/schema#'>
  <pdfaExtension:schemas>
   <rdf:Bag>
    <rdf:li rdf:parseType='Resource'>
     <pdfaSchema:namespaceURI>http://ns.adobe.com/pdf/1.3/</pdfaSchema:namespaceURI>
     <pdfaSchema:prefix>pdf</pdfaSchema:prefix>
     <pdfaSchema:property>
      <rdf:Seq>
       <rdf:li rdf:parseType='Resource'>
        <pdfaProperty:category>internal</pdfaProperty:category>
        <pdfaProperty:description>A name object indicating whether the document has been modified to include trapping information</pdfaProperty:description>
        <pdfaProperty:name>Trapped</pdfaProperty:name>
        <pdfaProperty:valueType>Text</pdfaProperty:valueType>
       </rdf:li>
      </rdf:Seq>
     </pdfaSchema:property>
     <pdfaSchema:schema>Adobe PDF Schema</pdfaSchema:schema>
    </rdf:li>
    <rdf:li rdf:parseType='Resource'>
     <pdfaSchema:namespaceURI>http://ns.adobe.com/xap/1.0/mm/</pdfaSchema:namespaceURI>
     <pdfaSchema:prefix>xmpMM</pdfaSchema:prefix>
     <pdfaSchema:property>
      <rdf:Seq>
       <rdf:li rdf:parseType='Resource'>
        <pdfaProperty:category>internal</pdfaProperty:category>
        <pdfaProperty:description>UUID based identifier for specific incarnation of a document</pdfaProperty:description>
        <pdfaProperty:name>InstanceID</pdfaProperty:name>
        <pdfaProperty:valueType>URI</pdfaProperty:valueType>
       </rdf:li>
      </rdf:Seq>
     </pdfaSchema:property>
     <pdfaSchema:schema>XMP Media Management Schema</pdfaSchema:schema>
    </rdf:li>
    <rdf:li rdf:parseType='Resource'>
     <pdfaSchema:namespaceURI>http://www.aiim.org/pdfa/ns/id/</pdfaSchema:namespaceURI>
     <pdfaSchema:prefix>pdfaid</pdfaSchema:prefix>
     <pdfaSchema:property>
      <rdf:Seq>
       <rdf:li rdf:parseType='Resource'>
        <pdfaProperty:category>internal</pdfaProperty:category>
        <pdfaProperty:description>Part of PDF/A standard</pdfaProperty:description>
        <pdfaProperty:name>part</pdfaProperty:name>
        <pdfaProperty:valueType>Integer</pdfaProperty:valueType>
       </rdf:li>
       <rdf:li rdf:parseType='Resource'>
        <pdfaProperty:category>internal</pdfaProperty:category>
        <pdfaProperty:description>Amendment of PDF/A standard</pdfaProperty:description>
        <pdfaProperty:name>amd</pdfaProperty:name>
        <pdfaProperty:valueType>Text</pdfaProperty:valueType>
       </rdf:li>
       <rdf:li rdf:parseType='Resource'>
        <pdfaProperty:category>internal</pdfaProperty:category>
        <pdfaProperty:description>Conformance level of PDF/A standard</pdfaProperty:description>
        <pdfaProperty:name>conformance</pdfaProperty:name>
        <pdfaProperty:valueType>Text</pdfaProperty:valueType>
       </rdf:li>
      </rdf:Seq>
     </pdfaSchema:property>
     <pdfaSchema:schema>PDF/A ID Schema</pdfaSchema:schema>
    </rdf:li>
    <rdf:li rdf:parseType='Resource'>
     <pdfaSchema:namespaceURI>http://www.loc.gov/premis/v3</pdfaSchema:namespaceURI>
     <pdfaSchema:prefix>premis</pdfaSchema:prefix>
     <pdfaSchema:property>
      <rdf:Seq>
       <rdf:li rdf:parseType='Resource'>
        <pdfaProperty:category>internal</pdfaProperty:category>
        <pdfaProperty:description>Event Types according to LOC</pdfaProperty:description>
        <pdfaProperty:name>EventType</pdfaProperty:name>
        <pdfaProperty:valueType>Text</pdfaProperty:valueType>
       </rdf:li>
      </rdf:Seq>
     </pdfaSchema:property>
     <pdfaSchema:schema>premisV3</pdfaSchema:schema>
    </rdf:li>
    <rdf:li rdf:parseType='Resource'>
     <pdfaSchema:namespaceURI>http://www.loc.gov/premis/v3</pdfaSchema:namespaceURI>
     <pdfaSchema:prefix>premis</pdfaSchema:prefix>
     <pdfaSchema:property>
      <rdf:Seq>
       <rdf:li rdf:parseType='Resource'>
        <pdfaProperty:category>internal</pdfaProperty:category>
        <pdfaProperty:description>DateTime digitized</pdfaProperty:description>
        <pdfaProperty:name>EventDateTime</pdfaProperty:name>
        <pdfaProperty:valueType>Date</pdfaProperty:valueType>
       </rdf:li>
      </rdf:Seq>
     </pdfaSchema:property>
     <pdfaSchema:schema>premisV3</pdfaSchema:schema>
    </rdf:li>
   </rdf:Bag>
  </pdfaExtension:schemas>
</rdf:Description>

<rdf:Description rdf:about=''
  xmlns:pdfaid='http://www.aiim.org/pdfa/ns/id/'>
  <pdfaid:conformance>B</pdfaid:conformance>
  <pdfaid:part>1</pdfaid:part>
</rdf:Description>

<rdf:Description rdf:about=''
  xmlns:premis='http://www.loc.gov/premis/v3'>
  <premis:EventDateTime>2018:01:17 15:47:08+01:00</premis:EventDateTime>
  <premis:EventType>migration</premis:EventType>
</rdf:Description>

<rdf:Description rdf:about=''
  xmlns:xmp='http://ns.adobe.com/xap/1.0/'>
  <xmp:CreateDate>2018-01-17T15:47:08+01:00</xmp:CreateDate>
  <xmp:CreatorTool>PFU ScanSnap Manager 6.2.14 #SV600</xmp:CreatorTool>
  <xmp:MetadataDate>2018-01-17T15:47:19+01:00</xmp:MetadataDate>
  <xmp:ModifyDate>2018-01-17T15:47:19+01:00</xmp:ModifyDate>
</rdf:Description>

<rdf:Description rdf:about=''
  xmlns:xmpMM='http://ns.adobe.com/xap/1.0/mm/'>
  <xmpMM:DocumentID>uuid:a8032adb-efcf-476b-849a-f3582c3cf1be</xmpMM:DocumentID>
  <xmpMM:InstanceID>uuid:bef4f71b-340c-4cf0-b163-bb30e3ce8fe9</xmpMM:InstanceID>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>


The critical part of it should look like this instead:

     <pdfaSchema:namespaceURI>http://www.loc.gov/premis/v3</pdfaSchema:namespaceURI>
     <pdfaSchema:prefix>premis</pdfaSchema:prefix>
     <pdfaSchema:property>
      <rdf:Seq>
       <rdf:li rdf:parseType='Resource'>
        <pdfaProperty:category>internal</pdfaProperty:category>
        <pdfaProperty:description>Event Types according to LOC</pdfaProperty:description>
        <pdfaProperty:name>EventType</pdfaProperty:name>
        <pdfaProperty:valueType>Text</pdfaProperty:valueType>
       </rdf:li>
       <rdf:li rdf:parseType='Resource'>
        <pdfaProperty:category>internal</pdfaProperty:category>
        <pdfaProperty:description>DateTime digitized</pdfaProperty:description>
        <pdfaProperty:name>EventDateTime</pdfaProperty:name>
        <pdfaProperty:valueType>Date</pdfaProperty:valueType>
       </rdf:li>
      </rdf:Seq>
     </pdfaSchema:property>
     <pdfaSchema:schema>premisV3</pdfaSchema:schema>
    </rdf:li>


Thanks again!

Phil Harvey

I see.  What you want to do is add another SchemasProperty.  So the second command would look like this:

exiftool -XMP-pdfaExtension:schemasproperty+="{name=EventDateTime, valueType=Date, category=internal, description=DateTime digitized}" FILE

I assume there is some reason that you don't just write them both to begin with by having two elements in the Schemas Property list.

However, if there are other Properties you will have trouble adding elements to the proper one.  For complex structures like this it is best to write the whole structure at once.  This is the trouble with nested lists (see the "Tricky" comment here).

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

archivistLiz

I do want to add two elements in the Schemas Property list, but I seem to have the syntax wrong for a nested list. If I add another property as you suggested, it ends up at the wrong place in the XMP (because there are multiple pdfaSchemas).  I tried several different versions of it, but none of them worked. (I'm writing the property list wrong, and I'm not sure if it's bracket placement or type of bracket. The examples have a maximum of two curly brackets, this would have three.)
1)~/ExifTool/exiftool -XMP-pdfaExtension:schemas+="{schema=premisV3, namespaceURI=http://www.loc.gov/premis/v3, prefix=premis, property={name={EventType, EventDateTime}, valueType={Text, Text}, category={internal, internal}, description={Event Types according to LOC, DateTime digitized}}}" /file.pdf
Working!
Warning: Invalid structure field at 'EventType, EventDateTime}, ...'
    1 image files updated

2)
:~$ ~/ExifTool/exiftool -XMP-pdfaExtension:schemas+="{schema=premisV3, namespaceURI=http://www.loc.gov/premis/v3, prefix=premis, [property={name={EventType, EventDateTime}, valueType={Text, Text}, category={internal, internal}, description={Event Types according to LOC, DateTime digitized}]}}" /file.pdf
Working!
Warning: Invalid structure field at '[property={name={EventType,...'
    1 image files updated

3)
:~$ ~/ExifTool/exiftool -XMP-pdfaExtension:schemas+="{schema=premisV3, namespaceURI=http://www.loc.gov/premis/v3, prefix=premis, property=[{name={EventType, EventDateTime}, valueType={Text, Text}, category={internal, internal}, description={Event Types according to LOC, DateTime digitized}]}}" /file.pdf
Working!
Warning: Invalid structure field at 'EventType, EventDateTime}, ...'
    1 image files updated


None of these worked. Should I be using square or curly brackets, and should they be around all the things I define in property or individually around name, valueType, category, and description? Sorry to be such a pain, but I feel like it's so close to working, but I'm just not getting the last bit on my own.

Phil Harvey

It is easy to figure out how the structure should be formatted.  Just read back the XMP with the -struct option:

Here is the result for the attached XMP file:

> exiftool test.xmp -struct -schemas
Schemas                         : [{NamespaceURI=http://ns.adobe.com/pdf/1.3/,Prefix=pdf,Property=[{Category=internal,Description=A name object indicating whether the document has been modified to include trapping information,Name=Trapped,ValueType=Text}],Schema=Adobe PDF Schema},{NamespaceURI=http://ns.adobe.com/xap/1.0/mm/,Prefix=xmpMM,Property=[{Category=internal,Description=UUID based identifier for specific incarnation of a document,Name=InstanceID,ValueType=URI}],Schema=XMP Media Management Schema},{NamespaceURI=http://www.aiim.org/pdfa/ns/id/,Prefix=pdfaid,Property=[{Category=internal,Description=Part of PDF/A standard,Name=part,ValueType=Integer},{Category=internal,Description=Amendment of PDF/A standard,Name=amd,ValueType=Text},{Category=internal,Description=Conformance level of PDF/A standard,Name=conformance,ValueType=Text}],Schema=PDF/A ID Schema},{NamespaceURI=http://www.loc.gov/premis/v3,Prefix=premis,Property=[{Category=internal,Description=Event Types according to LOC,Name=EventType,ValueType=Text},{Category=internal,Description=DateTime digitized,Name=EventDateTime,ValueType=Date}],Schema=premisV3}]


- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

archivistLiz

Ahhh, thank you, I finally got it and was able to produce the right result. Thank you so much!