ExifTool - problems to read Unicode characters

herb · December 11, 2010, 09:44:29 AM

Hallo Phil,

I am working with ExifTool version 8.40 on a Windows 2000 system and I have some problems reading Unicode characters.

For these tests I worked with the exif-tag "usercomment" (of a jpg-image), which should contain Unicode-characters in case of their value is higher than 127 in the corresponding character-map.

Writing such characters I used the command (in DOS-box of windows)
exiftool -charset x -usercomment=XX t1.jpg

and I see the following result with XX=<always character hex'C4 in the charset-corresponding character-map>
x=cp1251 (cyrillic) XX=upper case cyrillic letter De ==> character_in_file=U+0414
x=cp1252 (latin2) XX=upper caser german Umlaut Ä ==> character_in_file=U+00C4
x=cp1253 (greek) XX=upper case greek letter Delta ==> character_in_file=U+0394

This is absolut correct.

Now I prepared the content of the usercomment-tag to be hex'1404C4009403
(which is: cyrillic De, german Umlaut Ä and greek Delta).

Reading the tag with the command
exiftool -charset x -usercomment t1.jpg >out.txt

I get the following content in the result-file
x=cp1251 ==> file_content_for_usercomment: hex'C4C43F
x=cp1252 ==> file_content_for_usercomment: hex'3FC43F
x=cp1253 ==> file_content_for_usercomment: hex'3FC4C4

with hex'C4 is written character (see above)
hex'3F is question mark

x=utf8 ==> file_content_for_usercomment: hex'D094C384CE94
(which is the correct hex-sequenz for the characters in UTF-8)

Only for x=utf8 I get a proper response.
For me for x=cp1251/cp1252/cp1253 the output is not correct.

I had expected the following output
x=cp1251 ==> file_content_for_usercomment: hex'C43F3F
x=cp1252 ==> file_content_for_usercomment: hex'3FC43F (same as above)
x=cp1253 ==> file_content_for_usercomment: hex'3F3FC4

and in addition the warning:
"not all characters could be encoded to <charset>"
because
e.g. the german Umlaut Ä cannot be encoded within the cyrillic or the greek charset.

Additional information:
In case of the usercomment-tag in file t1.jpg contains the chinese character shan U+5C71,
reading the usercomment-tag I get a proper result
x=utf8 ==> file_content_for_usercomment: hex' E5B1B1

For x=cp... I get that warning, I had expected above.

Yes, I have read the help-file and also the FAQ.

Thanks for the great ExifTool and thanks for your comments and help in advance.
Herb

Phil Harvey · December 11, 2010, 12:43:10 PM

Hi Herb,

Everything seems to be working as designed, although we can debate about whether or not the design should be changed.

1) The warning is issued when there are characters which can be converted, but with your command you need to extract the Warning tag to see this. The Warning tag is only printed automatically if the tag could not be extracted when extracting a specific tag.

2) When a Unicode value can't be converted into a 1-byte character set, I simply pass the byte straight through if the codepoint is less than U+0100, or substitute a "?" (U+003f) otherwise. I believe this is consistent with what you are seeing. Perhaps it would be better to always convert to "?", but I may have had some reason (which I can't recall right now) for doing it this way.

- Phil

herb · December 12, 2010, 04:33:37 AM

Hello Phil,

thank you very much for your fast reply.

ad 2): Yes, my observation is identical to that you have described.
I thought that I have found an error. I apologize.
Now it is clear to me that this is your design, although I do not see an atvantage (at the moment) to pass the byte straight through.

ad 1): Because of your comment, I repeated my tests, but instead to ask for tag -usercomment only, I always asked for all tags (-all:all).
Now I have seen that the warning is printed.
BUT it is only printed in case of a question mark is inserted.

I think this warning should also be printed in case of the first byte of a non convertable character is passed through.
(otherwise the user thinks: everything is 100% ok.)
Would this be possible?

Thanks in advance
Herb

Phil Harvey · December 12, 2010, 09:24:15 AM

Hi Herb,

I looked into my implementation to refresh my memory. To improve speed and reduce file size and memory use I omit entries from my character translation tables which have the same value after translation. This is the reason I am currently passing them straight through.

However, I can add an extra step to test to see if the character exists in the table with a different codepoint. I don't think this will impact the performance much, and it would solve your specific problem, so it makes sense to add this. I will think about implementing this in the next release.

Thanks for the report.

- Phil

herb · December 22, 2010, 11:23:46 AM

Hello Phil,

thank you very much for the correction/enhancement in version 8.43.
You are really great.

So I wish you a Merry Christmas and a Happy New Year.

Herb

Phil Harvey · December 22, 2010, 01:18:16 PM

I'm glad you mentioned this because somehow I incorrectly placed the bullet "Improved handling of character encoding errors" under version 8.42, but as you correctly point out the change was actually made in 8.43. I have fixed this in the online revision history.

...and a Merry Christmas to you too!

- Phil

News:

ExifTool - problems to read Unicode characters

herb

Phil Harvey

herb

Phil Harvey

herb

Phil Harvey