Reading UTF characters from different image sources

Started by webspider, July 23, 2011, 01:49:28 PM

Previous topic - Next topic

webspider

Hi,

In my PHP script I have used command -

$meta_data = '/usr/bin/exiftool -exif:all -iptc:all -a -g -j -struct -c "%s" -fast -charset iptc=UTF-8 '.$image_path;

and then Json decoded the data for processing.

However, I am facing a problem reading utf-8 characters properly. For some images, this works perfectly, while for some images not.
I have noticed that there is also a difference between reading images from PC and Mac.

Is there any way, I can detect from which source the image is coming and adjust script according to that. Or, convert IPTC data from images from
PC and Mac both to UTF-8 format and then extract.

Your help is much appreciated.

Thanks,
Sourav

Phil Harvey

Assuming IPTC is stored as UTF-8 is an assumption that will be often wrong.  I would think the ExifTool default would give more reliable results (it assumes Latin1 unless UTF-8 is specified by the IPTC CodedCharacterSet).  In the past, you might have been able to tell the difference between Windows/Mac images by looking at the byte order (Windows always uses little-endian), but Mac's now use Intel CPU's so this difference may be disappearing.  You could also look at the Software tag to see if it gave any hint about platform.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

webspider

Hi Phil,

Thanks for your answer. However, is there any way via Exiftool, that I can convert all IPTC data in an image to UTF-8 format internally before extraction regardless of the actual character encoding, and regardless of the PC or Mac source.

Thanks,
Sourav

Phil Harvey

If you have a heuristic that you can apply to decide what encoding to use, then one option would be to assume UTF-8 for IPTC as you were doing, which effectively disables conversion, then do the conversion yourself.

But there is no 100% reliable method to determine the encoding of IPTC.  This is one reason why this information type lost favour.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

webspider

Hi Phil,

There is one more question. I am trying to change the setting 'ExifUnicodeByteOrder' by this command  -

exec("/usr/bin/exiftool -ExifUnicodeByteOrder='MM' ".$imagePath);

and then extract image data. So, that all images are in a common format before extraction. But that isn't changing the byte order.
Could you please specify, if I have missed something.

Many thanks,
Sourav


Phil Harvey

The ExifUnicodeByteOrder specifies the byte order when writing, not when reading.  When reading, ExifTool always uses a heuristic to determine the actual byte order used.

I don't understand how you expected to influence the byte order of the images.  If ExifTool reads the Unicode in the wrong byte order, all you would get is garbage.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).