UTF-8 decoding

Started by cyber77, May 17, 2013, 07:28:29 AM

Previous topic - Next topic

cyber77

I have a little question regarding the character encoding in the Image::ExifTool API. If I understand the API documentation correctly, there is no way to let ExifTool methods return strings in Perls internal format. It is always encoded in one of the charsets provided with the Charset option, UTF-8 per default.

In my application, I let Perl do all the work of encoding into or decode from the Perl's internal format using Perl's IO layers and some Unicode pragmas.

In order to process the values, I am receiving from an Image::ExifTool object, I am doing right now an Encode::decode_utf8($some_exif_val). Is that the correct way, or is there a better way to get strings in Perls internal data encoding representation?

At the moment this results in a two-way encoding / decoding process:

1. Convert the ExifTool values from UTF-8 into Perls internal format
2. Convert from Perls internal format into UTF-8 on all IO output operations

If the ExifTool is doing something like an Encode::encode_utf8($some_internal_val), just to provide the data via UTF-8 to the user, there would be a already a lot of overhead.

Phil Harvey

From the API documentation:

ExifTool returns all values as byte strings of encoded characters. Perl wide characters are not used. See FAQ number 10 for details about the encodings. By default, most returned strings are encoded in UTF-8. For these, Encode::decode_utf8() may be used to convert to a sequence of logical Perl characters.

So, yes.  You are understanding things correctly.  Internally, ExifTool uses byte strings only, and never uses Perl wide characters.  But you should be careful to test for valid UTF-8, because the encoding of strings returned by ExifTool is not guaranteed.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).