Difficulty with PDF embedded images and reading specific values

Started by PiratePete, January 20, 2015, 12:53:22 PM

Previous topic - Next topic

PiratePete

Apologies if it's covered somewhere, I did a fair amount of googling and experimentation before this post, and I've used this awesome tool before.

If I use the Verbose switch, I get all the data of the PDF in question. However, I cannot seem to figure out what the switch(es) for the specific values I would want. I can see the tag names in the verbose readout, but when I  try to issue those I get no data returned. I'm using the command line directly.

First I get output that I am able to get specific values for such as PDFVersion, Linearized, etc if I just issue the regular exiftool command with no switch. However, with the verbose switch, I see "PDF dictionary (1 of 1) with 4 entries:" and I am unable to crawl this data that I assume is for the embedded images. Now looking at the tags between " [  ]" I see familiar constructs such as Root, Pages, Kids, Resources, XObject and so on including the information I want. However if I issue a command like

exiftool -PDF:Root:Pages:Kids:Resources:XObject myfile.pdf

I have tried simple things
exiftool -PDF:all myfile.pdf

And that offers no output either so I assume anything under the PDF tag will not work either.  I also thought perhaps those values were being calculated but -composite is empty as well. Any ideas? It's probably something dumb on my part that I glossed over...

And so on, I get no data returned. No errors, just no data. Here's a screenshot of the Verbose output that I'd like to capture specific values from. I just can't seem to see what's wrong. I experimented with explicitly using -ee and a wide range of top-level tag names to get even the simplest output in return in hope of extending that, but no dice.

http://imgur.com/pgfDlNr

Phil Harvey

ExifTool will not extract just anything from a PDF file.  It is a metadata utility, so in general I restrict the pre-defined tags to metadata only.  See the PDF documentation for a list of tags that are extracted.

If you want to extract anything else, you will need to create user-defined tags in a config file, but this may be very complicated to do.

ExifTool really isn't designed to disassemble a PDF to extract all embedded files/images.  For this, there may be other utilities that are more suitable.

- Phil

Edit: Apparently it does more than I remembered.
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

PiratePete

Thanks for the reply Phil - So then, should I assume that the tag names I see that match what I see under the PDF tag from your link are just coincidentally matching and that the access to them is only as result of a "grab everything" from the verbose switch and isn't able to be crawled? I guess what I was seeking originally is how the verbose switch knows of these embedded images, their properties, and values per my screenshot but that the switches can't discern them.

I'll look into a custom defined file to attempt this then. Appreciate your time, thanks again for such a great utility.

Phil Harvey

Oh, right.  XObject is listed in the PDF tag documentation, and the reading the documentation I see that the contained image information is extracted when the ExtractEmbedded (-ee) option is used.  I had forgotten about this.  So you can extract all embbeded images with this command:

exiftool -ee -embeddedimage -b -W %t%c.%s some.pdf

This will create a bunch of files called "embedded###.jpg" in the current directory (if the embedded images are JPG format).

I should learn not to rely on my memory when describing exiftool capabilities.  I am continually surprised by all of the cool things that it does. :)

Good thing it's all documented. ;)

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).