UserComment and surrogates

herb · January 07, 2016, 04:56:03 AM

Hello Phil,

First of all: I wish you a Happy and Prosperous New Year 2016.

Testing my application with unicode characters higher than U+FFFF I have seen a strange behaviour of Exiftool 10.08 on my windows XP system.

I tested with characters of "Linear B" unicode group, because with Code2001 I found a font that is able to show these characters.
I describe the problem with character U+10010. It is called Linear B Syllable B044 KE.
U+10010 is 65552, in utf8 it is: F0908090 and in utf16 it is: D800DC10

I send commands to stdin of exiftool using the -stay_open mechanism and I receive all response via pipe to my stdin.
All strings are encoded in utf8.

Writing and reading this Linaer B Syllable into/from a XMP-tag everything is ok.

But writing a string of this Syllable with only 1 character length into EXIF-tag UserComment I have seen the following:
- the character is stored properly as utf16 byte-sequence.
- reading this tag I receive via pipe for this 1 unicode-character the byte-sequence: C398E1839C
(and my application displays 2 characters on screen).

Writing a 2 byte long string e.g. "Ascii character a"+"LinearBSyllable" everything is fine.

Can you give me a hint what is going wrong in case of writing only 1 Syllable B044.

Best regards
Herb

Phil Harvey · January 07, 2016, 07:47:04 AM

Hi Herb,

The problem is that according to the EXIF specification, UserComment is encoded as UCS-2, not UTF-16. So surrogate pairs are not supported, and characters only up to U+FFFF may be used.

I could perhaps bend the interpretation of the spec if it is possible to distinguish UTF-16 from UCS-2. I'll look into this when I get a chance.

- Phil

herb · January 07, 2016, 09:01:23 AM

Hello Phil,

thanks for your quick reply and thanks for your explanations.

I do not know the differences between UCS-2 and utf16.
But please allow the following question:
- why does Exiftool store the Syllable - which is sent by my application utf8-encoded - in correct utf16?

Best regards
Herb

Phil Harvey · January 07, 2016, 09:26:30 AM

Hi Herb,

UCS-2 is just UTF-16 without surrogate pairs.

I looked into this, and the answer wasn't what I thought...

ExifTool already reads/writes EXIF Unicode as UTF-16.

The problem is a result of the ill-specified byte ordering for EXIF Unicode strings. ExifTool tries to guess the byte ordering, and gets it wrong in the case where the string is a single LinearBSyllable character. A workaround is to start your string with a BOM (Byte Order Mark), so in this case your source UTF-8 will be the byte sequence EF BB BF F0 90 80 90.

Note that this problem shouldn't occur if the text contains longer strings, since then ExifTool should be able to correctly guess the byte ordering.

- Phil

herb · January 07, 2016, 10:41:58 AM

Hello Phil,

thanks again for your investigations.

What Exiftool really does, when the string starts with a BOM is unclear to me.

But I still have a question in order to understand Exiftool a little bit better.
When I write the chinese character 0x5470, Exiftool will store it as 0x5470.
In case of another application had stored this character as 0x7054 Exiftool must guess and reads it as 0x7054, which is a different existing chinese character.
This is all clear to me and we had discussed this a long long time ago.

In case of an Ascii-character is contained, Exiftool can determine the correct byte-sequence. This is also clear.

Unclear to me is the following:
- when Exiftool stores the D800DC10 as this sequence
- why does it not guess the proper sequence in case of reading?

Sorry for my annoying questions.

Best regards
Herb

Phil Harvey · January 07, 2016, 10:58:42 AM

Quote from: herb on January 07, 2016, 10:41:58 AM
Unclear to me is the following:
- when Exiftool stores the D800DC10 as this sequence

I don't understand what you mean here.

Quote- why does it not guess the proper sequence in case of reading?

I could go into the details of the heuristic that ExifTool uses to guess the byte order, but I don't think it would help. Basically, ExifTool looks at the values in the high and low bytes and tries to figure out which way looks more "normal". The MWG specification has something to say about this:

The following heuristics SHOULD be applied when the big or little endian nature of UTF-16 text needs to be determined. These apply to a single item at a time, not uniformly to all UTF-16 text.

If a leading U+FEFF BOM is present, that indicates the byte order.
If only one of the byte orders is valid UTF-16, the valid form is the byte order. This MUST take into account surrogate pairs, and it MAY take into account specific invalid Unicode characters.
Count the number of unique values in the first and second bytes of the 16-bit storage units. The correct byte order is the one with the fewer unique values in the high order part.
Otherwise use the overall TIFF stream byte order.

I should maybe look into implementing the 2nd point. The 3rd point is part of the ExifTool heuristic, and was adopted by the MWG from the ExifTool algorithm.

- Phil

Edit: I checked, and unfortunately the byte sequence in question is valid UTF-16 in the wrong byte order. So checking for valid UTF-16 wouldn't have helped here. I think the only reliable solution is to add the BOM. ExifTool strips off the BOM when reading, but I don't know how other software deals with this.

Edit2: Added link to MWG spec. page.

herb · January 08, 2016, 05:12:40 AM

Hello Phil,

thanks for all your investigations and clarifications.

Maybe I will do some tests with BOM in future.
By the way: writing FEFF D800 DC10 into the Exif-field with a hex-editor helped and the Syllable was read properly.

Thanks again and
Best regards
Herb

News:

UserComment and surrogates

herb

Phil Harvey

herb

Phil Harvey

herb

Phil Harvey

herb