NEWBIE! PDF question

DaveY · February 27, 2013, 07:26:25 PM

Hi all,

Newbie here, so be gentle!

Is it possible, using EXIFTOOL to extra info about any embedded images and colour info from a PDF file?

Thanks in advance,
Dave.

Phil Harvey · February 27, 2013, 07:53:28 PM

Hi Dave,

Sorry, ExifTool doesn't current extract information from embedded images in PDF files.

You can use the -v3 option to see everything that ExifTool is currently decoding in the PDF. I'm not sure if you will be able to find the colour information, but it may be in there.

- Phil

DaveY · February 28, 2013, 03:58:45 AM

Thanks for the reply Phil.

Looks like I might have to use DynaPDF - rather expensive - might blow the budget for this particular project.

BTW: Exiftool, for other projects I've used it on - is a great tool!

Best regards,
Dave.

Phil Harvey · February 28, 2013, 07:57:59 AM

Hi Dave,

I have had another request for this feature recently, so I took a look to see what would be involved in extracting information from embedded objects... It looks pretty simple actually. I'll see what I can do to add this ability to the next release. The only trick will be if different applications store embedded images differently. I just tried writing a PDF with Word, and with the -v option I can see that it puts the embedded image in a Root/Pages/Kids/Resources/XObject dictionary. All I need is to parse the stream for each of these objects.

There is also a Root/Pages/Kids/Resources/ColorSpace dictionary. Could this contain the colour info you wanted?

- Phil

DaveY · February 28, 2013, 12:15:34 PM

Hi again Phil - I didn't expect another response - so thank you kindly.

Had a look at the ColorSpace dictionary and I should be able to glean from that what I need.. great!

As for the image info extraction - if you reckon it can be done quite easily, then I'm more than happy to wait for your next build.

Best regards,
Dave.

1) ColorSpace (SubDirectory) -->
| | | | | - Tag 'ColorSpace', direct dictionary
| | | | | + [ColorSpace directory with 5 entries]
| | | | | | 0) Cs5 = [/Separation,/0r#2039g#2094b,/DeviceCMYK,ref(53 0 R)]
| | | | | | - Tag 'Cs5', indirect object (18 0 R)
| | | | | | 1) Cs3 = [/Separation,/Process#20Cyan,/DeviceCMYK,ref(54 0 R)]
| | | | | | - Tag 'Cs3', indirect object (14 0 R)
| | | | | | 2) Cs2 = [/ICCBased,ref(55 0 R)]
| | | | | | - Tag 'Cs2', indirect object (13 0 R)
| | | | | | 3) Cs1 = [/Separation,/All,ref(58 0 R),ref(59 0 R)]
| | | | | | - Tag 'Cs1', indirect object (7 0 R)
| | | | | | 4) Cs4 = [/Separation,/0c#20100m#20100y#200k,/DeviceCMYK,ref(57 0 R)]
| | | | | | - Tag 'Cs4', indirect object (16 0 R)

Phil Harvey · February 28, 2013, 01:03:54 PM

Hi Dave,

I actually have a working version now that extracts information from embedded documents for which I have already written a filter. I have run it on a bunch of PDF's already, and the filters I am missing are DCT, JPX CCITTFax and LZW. Adding these could be a pain.

The real problem I see with this new addition is that it is extremely SLOW. Doing the decryption and decoding of embedded images is very time consuming (since it is a pure Perl implementation), and it seems that ExifTool can quite easily spend minutes doing this for a large PDF document with lots of embedded images.

- Phil

Phil Harvey · March 10, 2013, 10:54:48 AM

Update:

ExifTool 9.21 added the ability to extract metadata from embedded images (or the embedded images themselves) in PDF documents.

ExifTool 9.23 added the ability to write a separate output file for each extracted tag. With these new features, embedded JPG and JP2 images in PDF documents can be extracted like this:

exiftool -ee -embeddedimage -b -W pics/%g3.%s FILE.pdf

(where FILE.pdf is the name of the source PDF file)

This command will create a "pics" directory containing the embedded images. The image files will be named according to the embedded document number (the ExifTool family 3 group name), ie. "Doc1.jpg", "Doc2.jp2", etc...

- Phil

ExifTool Forum

News:

NEWBIE! PDF question

DaveY

Phil Harvey

DaveY

Phil Harvey

DaveY

Phil Harvey

Phil Harvey