Using exiftool to fix MODI TIFFs

Started by Nicd, January 14, 2012, 02:24:37 PM

Previous topic - Next topic

Nicd

Hi, I just recently discovered I could use this tool to extract JPEGs out of TIFFs created by Microsoft Office Document Imaging. I wrote a post about it on Google+ at https://plus.google.com/110693235602405640560/posts/hGH7UmykR3T, but I'll copy it here too for easier reading. This just in case anyone is out there searching for it (I didn't find any info about this by searching Google or this forum).

QuoteFor this spring my wife and I redirected our mail to my father-in-law because we were moving out of the country for exchange studies. I needed to set up a scanner so my father-in-law could scan any bills we might receive and send them to us via email. Turns out the scanner we had (CanoScan N650U if I recall correctly) was age old and the scanning programs (by Canon) didn't install. The only working scanning program was MODI – Microsoft Office Document Imaging. It's a small utility belonging to MS Office 2003 (and some earlier) that's now been removed.

I don't know what else MODI does, but it worked with the scanner (even running some OCR) and output TIFF files. But when I tried to open the files on OS X (my wife and I both run it), they wouldn't work. I read at http://en.wikipedia.org/wiki/Microsoft_Office_Document_Imaging that MODI actually produces invalid TIFF files that can only be opened by MODI itself (and some limited MS products)!

I read on and found a link to http://suppressingfire.org/~burner/evil-mods-tiff/, which describes the format in more detail. Apparently the TIFF file contains some metadata and the OCR text, while the actual image is a JPEG inside the TIFF! The site suggested a data recovery tool called Foremost. Because it wasn't readily available as a package for OS X (I'm lazy), I tried another one called PhotoRec, with bad results (it only "recovered" the TIFF file itself).

After that I tried googling around for "extract jpeg" and ran into a nifty utility called ExifTool. It's originally for extracting metadata out of image files but can get other data as well. I decided to try it out of sheer curiosity, but I didn't believe it would work on such an obscure format. But it did! I found out the following:

* "exiftool -b scan.tiff > outputfile" "extracts" the TIFF itself
* "exiftool -b -JpgFromRaw > outputfile" extracts a small JPEG thumbnail
* "exiftool -b -OtherImage > outputfile" extracts the full scanned JPEG file

Thanks to this little Perl tool I'm now able to open scanned MODI TIFF files on my Mac (or *nix or Windows for that matter). As I didn't see this solution published anywhere, I decided to write it up. Thankfully, in MS Office 2010, MODI has been removed completely.

Phil Harvey

#1
Interesting, thanks for the post.  Just two comments about the commands:

Quote* "exiftool -b scan.tiff > outputfile" "extracts" the TIFF itself

This command will extract all metadata with newlines between the values.  It won't produce any recognizable format file.

Quote* "exiftool -b -JpgFromRaw > outputfile" extracts a small JPEG thumbnail
* "exiftool -b -OtherImage > outputfile" extracts the full scanned JPEG file

You need a source file name in these commands.  Other than that, they look OK.

- Phil

Edit:  I just looked at your reference.  ExifTool recognizes the extra tags from this type of file.  You may also be able to extract the OCR text (if it exists) like this:

exiftool -msdocumenttext scan.tiff
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Nicd

Quote from: Phil Harvey on January 14, 2012, 03:15:42 PM
Interesting, thanks for the post.  Just two comments about the commands:

Quote* "exiftool -b scan.tiff > outputfile" "extracts" the TIFF itself

This command will extract all metadata with newlines between the values.  It won't produce any recognizable format file.

Quote* "exiftool -b -JpgFromRaw > outputfile" extracts a small JPEG thumbnail
* "exiftool -b -OtherImage > outputfile" extracts the full scanned JPEG file

You need a source file name in these commands.  Other than that, they look OK.

- Phil

Edit:  I just looked at your reference.  ExifTool recognizes the extra tags from this type of file.  You may also be able to extract the OCR text (if it exists) like this:

exiftool -msdocumenttext scan.tiff
Of course, thanks for the tips. I did forget the filename in the commands, I've edited the Google+ post to match this. And thanks for the great program, saved me a lot of trouble!