Print Page - Reading UTF characters from different image sources

Title: Reading UTF characters from different image sources
Post by: webspider on July 23, 2011, 01:49:28 PM

Hi,

In my PHP script I have used command -

$meta_data = '/usr/bin/exiftool -exif:all -iptc:all -a -g -j -struct -c "%s" -fast -charset iptc=UTF-8 '.$image_path;

and then Json decoded the data for processing.

However, I am facing a problem reading utf-8 characters properly. For some images, this works perfectly, while for some images not.
I have noticed that there is also a difference between reading images from PC and Mac.

Is there any way, I can detect from which source the image is coming and adjust script according to that. Or, convert IPTC data from images from
PC and Mac both to UTF-8 format and then extract.

Your help is much appreciated.

Thanks,
Sourav

Title: Re: Reading UTF characters from different image sources
Post by: Phil Harvey on July 23, 2011, 08:30:14 PM

Assuming IPTC is stored as UTF-8 is an assumption that will be often wrong. I would think the ExifTool default would give more reliable results (it assumes Latin1 unless UTF-8 is specified by the IPTC CodedCharacterSet). In the past, you might have been able to tell the difference between Windows/Mac images by looking at the byte order (Windows always uses little-endian), but Mac's now use Intel CPU's so this difference may be disappearing. You could also look at the Software tag to see if it gave any hint about platform.

- Phil

Title: Re: Reading UTF characters from different image sources
Post by: webspider on July 25, 2011, 12:33:09 AM

Hi Phil,

Thanks for your answer. However, is there any way via Exiftool, that I can convert all IPTC data in an image to UTF-8 format internally before extraction regardless of the actual character encoding, and regardless of the PC or Mac source.

Thanks,
Sourav

Title: Re: Reading UTF characters from different image sources
Post by: Phil Harvey on July 25, 2011, 07:41:08 AM

If you have a heuristic that you can apply to decide what encoding to use, then one option would be to assume UTF-8 for IPTC as you were doing, which effectively disables conversion, then do the conversion yourself.

But there is no 100% reliable method to determine the encoding of IPTC. This is one reason why this information type lost favour.

- Phil

Title: Re: Reading UTF characters from different image sources
Post by: webspider on July 26, 2011, 06:29:35 AM

Hi Phil,

There is one more question. I am trying to change the setting 'ExifUnicodeByteOrder' by this command -

exec("/usr/bin/exiftool -ExifUnicodeByteOrder='MM' ".$imagePath);

and then extract image data. So, that all images are in a common format before extraction. But that isn't changing the byte order.
Could you please specify, if I have missed something.

Many thanks,
Sourav

Title: Re: Reading UTF characters from different image sources
Post by: Phil Harvey on July 26, 2011, 07:35:07 AM

The ExifUnicodeByteOrder specifies the byte order when writing, not when reading. When reading, ExifTool always uses a heuristic to determine the actual byte order used.

I don't understand how you expected to influence the byte order of the images. If ExifTool reads the Unicode in the wrong byte order, all you would get is garbage.

- Phil

ExifTool Forum

General => Metadata => Topic started by: webspider on July 23, 2011, 01:49:28 PM