Main Menu

NEWBIE! PDF question

Started by DaveY, February 27, 2013, 07:26:25 PM

Previous topic - Next topic

DaveY

Hi all,

Newbie here, so be gentle!

Is it possible, using EXIFTOOL to extra info about any embedded images and colour info from a PDF file?

Thanks in advance,
Dave.

Phil Harvey

Hi Dave,

Sorry, ExifTool doesn't current extract information from embedded images in PDF files.

You can use the -v3 option to see everything that ExifTool is currently decoding in the PDF.  I'm not sure if you will be able to find the colour information, but it may be in there.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

DaveY

Thanks for the reply Phil.

Looks like I might have to use DynaPDF - rather expensive - might blow the budget for this particular project.

BTW: Exiftool, for other projects I've used it on - is a great tool!

Best regards,
Dave.

Phil Harvey

Hi Dave,

I have had another request for this feature recently, so I took a look to see what would be involved in extracting information from embedded objects... It looks pretty simple actually.  I'll see what I can do to add this ability to the next release.  The only trick will be if different applications store embedded images differently.  I just tried writing a PDF with Word, and with the -v option I can see that it puts the embedded image in a Root/Pages/Kids/Resources/XObject dictionary.  All I need is to parse the stream for each of these objects.

There is also a Root/Pages/Kids/Resources/ColorSpace dictionary.  Could this contain the colour info you wanted?

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

DaveY

Hi again Phil - I didn't expect another response - so thank you kindly.

Had a look at the ColorSpace dictionary and I should be able to glean from that what I need.. great!

As for the image info extraction - if you reckon it can be done quite easily, then I'm more than happy to wait for your next build.

Best regards,
Dave.

1)  ColorSpace (SubDirectory) -->
  | | | | |     - Tag 'ColorSpace', direct dictionary
  | | | | | + [ColorSpace directory with 5 entries]
  | | | | | | 0)  Cs5 = [/Separation,/0r#2039g#2094b,/DeviceCMYK,ref(53 0 R)]
  | | | | | |     - Tag 'Cs5', indirect object (18 0 R)
  | | | | | | 1)  Cs3 = [/Separation,/Process#20Cyan,/DeviceCMYK,ref(54 0 R)]
  | | | | | |     - Tag 'Cs3', indirect object (14 0 R)
  | | | | | | 2)  Cs2 = [/ICCBased,ref(55 0 R)]
  | | | | | |     - Tag 'Cs2', indirect object (13 0 R)
  | | | | | | 3)  Cs1 = [/Separation,/All,ref(58 0 R),ref(59 0 R)]
  | | | | | |     - Tag 'Cs1', indirect object (7 0 R)
  | | | | | | 4)  Cs4 = [/Separation,/0c#20100m#20100y#200k,/DeviceCMYK,ref(57 0 R)]
  | | | | | |     - Tag 'Cs4', indirect object (16 0 R)

Phil Harvey

Hi Dave,

I actually have a working version now that extracts information from embedded documents for which I have already written a filter.  I have run it on a bunch of PDF's already, and the filters I am missing are DCT, JPX CCITTFax and LZW.  Adding these could be a pain.

The real problem I see with this new addition is that it is extremely SLOW.  Doing the decryption and decoding of embedded images is very time consuming (since it is a pure Perl implementation), and it seems that ExifTool can quite easily spend minutes doing this for a large PDF document with lots of embedded images.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Phil Harvey

Update:

ExifTool 9.21 added the ability to extract metadata from embedded images (or the embedded images themselves) in PDF documents.

ExifTool 9.23 added the ability to write a separate output file for each extracted tag.  With these new features, embedded JPG and JP2 images in PDF documents can be extracted like this:

exiftool -ee -embeddedimage -b -W pics/%g3.%s FILE.pdf

(where FILE.pdf is the name of the source PDF file)

This command will create a "pics" directory containing the embedded images.  The image files will be named according to the embedded document number (the ExifTool family 3 group name), ie. "Doc1.jpg", "Doc2.jp2", etc...

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).