exiftool copy MS Document Text

Started by etmmger, January 25, 2012, 04:34:12 PM

Previous topic - Next topic

etmmger

Hello,

I try to resample tiff images with ImageMagick convert. However, I loose the "MS Document Text" tag/value.
When trying to fix this with exiftool with command:
exiftool -tagsFromFile test.tif -msdocumenttext resampled.tif
where test.tif is my original (400dpi) tiff file with the correct msdocumentttext tag and resampled.tif is the resampled file (200dpi) missing the msdocumenttext tag.

I get the following output from exiftool:

Warning: Sorry, msdocumenttext is not writable - test.tif
Warning: No writable tags set from test.tif
    0 image files updated
    1 image files unchanged


How can I copy the msdocumenttext tag from test.tif to resampled.tif?

Kind regads, Marcel

Phil Harvey

Hi Marcel,

Currently this tag is not writable because I don't know the format of the text.  Just copying the information, however, is simpler since it should already be in the correct format.  I only have one sample with this information, and the -v3 option shows this:

  | 17) MSDocumentText = ..1. ..0. ..
  |     - Tag 0x932f (17 bytes, undef[17]):
  |         01c2: 01 00 0b 00 00 00 31 0c 20 0a 0d 30 0c 20 0a 0d [......1. ..0. ..]
  |         01d2: 00                                              [.]


Just for my interest, what does the -v3 dump show for your file?

Until such time as I add the ability to write this tag, the solution is to create a writable user-defined tag to override the existing MSDocumentText tag:

%Image::ExifTool::UserDefined = (
    'Image::ExifTool::Exif::Main' => {
        0x932f => {
            Name => 'MSDocumentText',
            Writable => 'undef',
            WriteGroup => 'IFD0',
        },
    },
);
1; #end


With this config file properly activated, you should be able to copy the tag as you wanted.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

etmmger

Hi Phil,

thanks for your prompt answer. Where do I put the config file and how do I activate it?

Marcel
-------
The output of exiftool -v3 -msdocumenttext for a sample document is:

17) MSDocumentText = .%.IProject Management Institute. .ji P4 Project Management Pr[snip]
  |     - Tag 0x932f (3883 bytes, undef[3883]):
  |        126a6: 01 00 25 0f 00 00 49 50 72 6f 6a 65 63 74 20 4d [..%...IProject M]
  |        126b6: 61 6e 61 67 65 6d 65 6e 74 20 49 6e 73 74 69 74 [anagement Instit]
  |        126c6: 75 74 65 0c 20 0a 6a 69 20 50 34 20 50 72 6f 6a [ute. .ji P4 Proj]
  |        126d6: 65 63 74 20 4d 61 6e 61 67 65 6d 65 6e 74 20 50 [ect Management P]
  |        126e6: 72 6f 66 65 73 73 69 6f 6e 61 6c 0c 20 0a 43 6f [rofessional. .Co]
  |         [snip 3803 bytes]

Phil Harvey

Hi Marcel,

The sample config file linked from my post above (click on on "user-defined") has all the details.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

etmmger

#4
Hi Phil,

thanks. It works!
About the MS Document tags, I found the following on http://social.msdn.microsoft.com/Forums/en-US/os_standocs/thread/03086d55-294a-49d5-967a-5303d34c40f8/

37679 - appears on every page, looks like the text version of the document contents. The content are
0x01 0x00, followed by a length (4 byte aka long) which is 6 bytes less than the actual length of this
field (i.e. it is the remaining length), followed by the UTF8 text version. Each phrase is delimited by a
space followed by a newline (0x20 0x0a aka ' \n'). The end is 0x0d 0x00.

37680 - only appears to occur on the first page, always appears to be length 4096, always starts with
0xd0 0xcf 0x11 0xe0 0xa1 0xb1 0x1a 0xe1, then a string of zeros, and then varies. Perhaps some
kind of metadata dictionary? It is located at the end of the file, and there are 16-bit wide characters
that look like "Root Entry", "CONTENTS" (sometimes more than once, even if only one page), "prop2"
(sometimes more than once), "prop3" (somtimes more than once), "DICT", "Summary Information",
"Owner" and some names. There might be some random stuff / fill in there too. Also appears to be a
consistent bit of stuff "AuvsxjatP0udlw1Aaq5eubr5h" (this might not be ASCII though - there is a
0x05 0x00 always on the front of it.

37681 - appears on every page, always stars with 0x02 0x00 (+ 0x00, 0x00?), then varies. Possibly
the thumbnail image?


Regards, Marcel

PH Edit: Wrapped code block to make it readable.

Phil Harvey

Great.  Thanks for the reference.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).