Problem with list items

Started by barbaraf, March 04, 2011, 08:34:57 PM

Previous topic - Next topic

barbaraf

We're having a problem with some pdfs where the author names are not being treated as individual list items.

I was hoping to be able to use ExifTool to extract the xmp info in a batch mode for reviewing to see which files had the problem.

Unfortunately, I can't seem to make ExifTool list the names as separate items, no matter what I try.

Is there any way to get the same style of output for lists using ExifTool that you get using Acrobat's Advanced Metadata dump? I haven't been able to find a way to dump the data from a bunch of pdf in a batch, which is what brought me to ExifTool in the first place because of the great batch capabilities.

This is what Acrobat gives:

         <dc:creator>
            <rdf:Seq>
               <rdf:li>A. B. Charles</rdf:li>
               <rdf:li>D. E. Frank</rdf:li>
               <rdf:li>G. H. Ingels</rdf:li>
            </rdf:Seq>
         </dc:creator>


This is what ExifTool gives (with 1 list item vs. separate items):

         <dc:creator>
            <rdf:Seq>
               <rdf:li>A. B. Charles, D. E. Frank, G. H. Ingels</rdf:li>
            </rdf:Seq>
         </dc:creator>

I'm using the following command:

c:\projects\xmp\bin\exiftool -xmp -b testfile.pdf > testfile.xmp

I have tried the Perl module as well as the command-line version, but I'm not having any luck there either.

I'm struggling trying to make sense of all of the options in both versions of the ExifTool interface, and welcome any help you can give.

Thanks,
Barbara

Phil Harvey

Hi Barbara,

I don't know how this information was written. Reading FAQ number 17 may help since it explains the technique of writing list-type tags.

Note that the corresponding PDF tag (PDF:Author) is not a list-type tag, so if you are copying from this tag you will not get a list.  (See the PDF Info tags documentation.)  However, if this is the case you could create a user-defined tag to split the string at commas -- I could help with this if you want.

- Phil

...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

barbaraf

Thanks for the info and help, Phil.

Part of the problem is that I don't know how the original information was written in the pdfs either, as we have a number of pdfs from different sources. We have to identify the pdfs where the author (specifically, the dc:creator) was stored incorrectly as a single string and correct these. For example (excerpt from Acrobat's Document Properties > Additional Metadata > Advanced > Save as xmp):


         <dc:creator>
            <rdf:Seq>
               <rdf:li>P. Tang, G. T. Fry; D. D. Davids, R. P. Rife</rdf:li>
            </rdf:Seq>
         </dc:creator>


...is wrong and needs to be split into separate list items. This pdf also shows the creator as 1 list item when I look at the data in the Acrobat application via the Document Properties > Additional Metadata > Advanced panel, while correct pdfs show a separate list item for each name.

ExifTool definitely lets me get the job done. Once I have extracted the data, I can parse it and resubmit it correctly as separate author fields.

I was just looking for something that would show me a before and after picture so that I could see whether the data was split into separate fields as it should be. My hope was that I could use ExifTool to dump all the pre-conversion data in the same format that Acrobat shows, with one or multiple list items. If I use the sep option like so:

c:\projects\xmp\bin\exiftool -sep ", " -tagsFromFile testfile.pdf -@ c:\projects\xmp\bin\pdf2xmp.args testfile.xmp

I get:


  <dc:creator>
   <rdf:Seq>
    <rdf:li>P. Tang</rdf:li>
    <rdf:li>G. T. Fry; D. D. Davids</rdf:li>
    <rdf:li>R. P. Rife</rdf:li>
   </rdf:Seq>
  </dc:creator>


It makes it look like there are 3 separate authors, where Acrobat is only seeing and showing this as one author.

I'm just not getting how I can extract the data in some way to show the problem files, which leaves me calling up the Doc Properties / Additional Metadata / Advanced panel for each and every pdf and reviewing the data that way. Am I just missing it?

Thanks,
Barbara

Phil Harvey

Hi Barbara,

I think I understand now, thanks for explaining.  The config file below will generate a user-defined tag called "CreatorSplit" only if XMP:Creator is not a list and the string contains one or more comma-space sequences.  With this config file, you can use the following command line to fix the problem files.  Any files without problems won't be affected:

exiftool "-xmp-dc:creator<creatorsplit" DIR

Note that this only handles the case of a comma-space separator between names.

%Image::ExifTool::UserDefined = (
    'Image::ExifTool::Composite' => {
        CreatorSplit => {
            Require => 'XMP-dc:Creator',
            ValueConv => q{
                if (ref $val eq 'ARRAY') {
                    return undef unless @$val == 1;
                    $val = $$val[0];
                }
                my @vals = split ', ', $val;
                return undef unless @vals > 1;
                return \@vals;
            },
        },
    },
);
1;  #end


(See the sample config file for instructions on installing the config file.)

- Phil

Note: If you are using Safari, cutting and pasting the above code may not work because you may get unicode non-breaking spaces instead of normal spaces.  If this is happens, either use another browser or substitute all characters that look like spaces with normal spaces.
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

barbaraf

Hi Phil,

Sorry for the delay after your last post...Your suggestion was quite helpful. Thanks for all of your help on this.

Barbara