Extracting binary data from multiple XMP elements

Started by martinrwilson, September 16, 2011, 05:55:55 AM

Previous topic - Next topic

martinrwilson

Hi,
I have an indd document containing XMP data that contains multiple elements with the same tag (in <rdf:li> elements within a <rdf:Seq> element)  containing binary data.

I actually only want the data in the first element, which does seem to be what happens if I refer to the element by tag name, e.g.
exiftool -xmp:pageimage -b test.indd > thumbnail.jpg

So, my question is:
- What does ExifTool do with multiple elements that contain binary data with the same tag name? It seems to just output the first, which is what I want - is this the case and, if so, will this continue to be the case in the future? (Is there a safer way to get the first?)

I found some information in this post but it's not clear from this what does happen in the case of multiple matching elements:
/exiftool/forum/index.php/topic,2105.msg9233.html#msg9233

Any help will be much appreciated!
Thanks,
Martin

Phil Harvey

Hi Martin,

You're quite right, I thought this was documented but I can't find where.  I will add it to the -b documentation.  However, the post you linked explains it well:

When extracting lists with the -b option, all list items from a single tag are extracted separated by newlines. (I think this is what you are calling "multiple matching elements".) This behaviour will not change.  Currently you can't address individual list items from the command line.  I have toyed with the idea of adding this feature, but the way I have done things this would be more difficult than it seems.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

martinrwilson

Thanks for your response.
Just to be clear - my observation that only the first image (in this case) is being output is wrong - in fact all the images (from data matching the tag) will be output, separated by new lines.
So I'm probably ending up with the bytes from two jpegs in one file (which suprisingly, displays fine as the first image).
FYI, the relevant XMP data is shown below, showing the two images (in <xmpGImg:image> elements)

<xmp:PageInfo>
    <rdf:Seq>
       <rdf:li rdf:parseType="Resource">
     <xmpTPg:PageNumber>1</xmpTPg:PageNumber>
     <xmpGImg:format>JPEG</xmpGImg:format>
     <xmpGImg:width>256</xmpGImg:width>
     <xmpGImg:height>256</xmpGImg:height>
     <xmpGImg:image>[lots of bytes making up the image]</xmpGImg:image>
       </rdf:li>
       <rdf:li rdf:parseType="Resource">
     <xmpTPg:PageNumber>2</xmpTPg:PageNumber>
     <xmpGImg:format>JPEG</xmpGImg:format>
     <xmpGImg:width>256</xmpGImg:width>
     <xmpGImg:height>256</xmpGImg:height>
     <xmpGImg:image>[lots of bytes making up the image]</xmpGImg:image>
       </rdf:li>
    </rdf:Seq>
</xmp:PageInfo>

Many thanks,
Martin

martinrwilson

Ok, I've just verified that this is the case.
Thanks for your help.
Regards,
Martin

Phil Harvey

Hi Martin,

Yes, you can add any random data to the end of a JPEG image without causing problems.  In this case it isn't random data, but a newline followed by the other images in the list.

However, exiftool may also be used to remove any JPEG trailer:

exiftool -trailer:all= a.jpg

So doing this on the -b output effectively gives you the first JPEG from the list.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).


jbverschoor

I'm kind of stuck :-)

I'd like to extract all the images :-)
Ideally, I'd first extract "Page Image Page Number"
And then extract each page in the list, or just a single page
Is this already possible?

Page Image Page Number          : 1, 2
Page Image Format               : JPEG, JPEG
Page Image Width                : 256, 256
Page Image Height               : 256, 256
Page Image                      : (Binary data 8544 bytes, use -b option to extract), (Binary data 6012 bytes, use -b option to extract)

Phil Harvey

Unfortunately the command-line application doesn't have a feature to allow a single item to be extracted from a list, so the best you can do is to write them all to a single file:

exiftool -pageimage -b SOURCEFILE > out.jpg

But then you would have to split up the output jpg to recover the individual pages.  The file would be split at the "ff d9 0a ff d8" (hex) pattern.

An alternative is to write a simple Perl script using the ExifTool API to do what you want.  This is trivial, and I can help you with this if you have Perl installed.
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Phil Harvey

I have a solution:

Since this has come up in the past, I will add a -listItem option to extract a specific item from a list.  Then you could do what you want with this:

exiftool -pageimage -b -listitem 0 > page1.jpg
exiftool -pageimage -b -listitem 1 > page2.jpg

This feature will appear in ExifTool 8.70.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).