ExifTool Forum

ExifTool => The "exiftool" Application => Topic started by: herb on January 22, 2013, 06:07:30 AM

Title: ExifByteOrder question
Post by: herb on January 22, 2013, 06:07:30 AM
Hello Phil,

I have an JPG image which does not contain any metadata.
Now I write the tag -UserComment and I also tell ExifTool explicitly to use ExifByteOrder=Big-endian (Motorola,MM).
The usercomment string contains
- ASCII-characters,
- german Umlaute, which are no ASCII characters, but they are contained in the codepage of my windows-system and also
- real unicode characters (chinese characters).

This went all ok.

Now I use the application Zoner Photo Studio 15 and write the tag -ModifyDate.
Zoner adds this tag,
it also changes the ExifByteOrder to Little-endian ( I do not know why),
but it does not change the byteorder of the characters inside -UserComment.

I have checked this with an Hex-editor.

Although the byteorder in tag -UserComment is wrong now (seen from ExifByteOrder content point of view),
ExifTool displays "correct" (from my point of view) content of this tag, as long as also ASCII characters are contained.
Also no warning is raised.

Now my question:
Does ExifTool have implemented a correction mechanism? or
where I am wrong.

Thanks for your help in advance
Herb
Title: Re: ExifByteOrder question
Post by: Phil Harvey on January 22, 2013, 07:29:03 AM
Hi Herb,

The EXIF specification is not clear about the byte ordering of the UserComments.  The only logical byte order in my opinion is the same as the ExifByteOrder, but not all software is logical (Microsoft is the biggest offender here).

As a result, ExifTool has to be flexible about the byte order of EXIF Unicode text.  When writing, ExifTool uses the existing EXIF byte order unless a different ExifUnicodeByteOrder is specified.  When reading, ExifTool implements a correction algorithm.  I suggested this algorithm to the MWG, and they have incorporated it into their recommendations (http://metadataworkinggroup.com/specs/):

QuoteThe following heuristics SHOULD be applied when the big or little endian nature of UTF-16 text needs to be determined. These apply to a single item at a time, not uniformly to all UTF-16 text.
  • If a leading U+FEFF BOM is present, that indicates the byte order.
  • If only one of the byte orders is valid UTF-16, the valid form is the byte order. This MUST take into account surrogate pairs, and it MAY take into account specific invalid Unicode characters.
  • Count the number of unique values in the first and second bytes of the 16-bit storage units. The correct byte order is the one with the fewer unique values in the high order part.
  • Otherwise use the overall TIFF stream byte order.

- Phil
Title: Re: ExifByteOrder question
Post by: herb on July 07, 2013, 07:06:23 AM
Hello Phil,

Sorry that I re-open this topic.
First I want to say thanks for your explanationes and for your very good correction algorithm which solved so many of my problems.
I fully agree to the current solution.

But now I run into the situation that your algorithm cannot decide.
This happens e.g. with the chinese character 山 ('shan' - mountain - H'5C71 - 山). The character H'715C is also a valid chinese character.
I think there is no option that the user can tell Exiftool which encoding is to be used.
What do you think, also to allow the extra tag ExifUnicodeByteOrder in case of reading?

Best regards - and hoping not to annoy you
Herb
Title: Re: ExifByteOrder question
Post by: Phil Harvey on July 07, 2013, 08:57:09 AM
Hi Herb,

As you have discovered, the detection algorithm doesn't work if the text is a single non-ASCII character.  In this case, the fallback is to use the EXIF byte ordering.

Unfortunately I can't allow -exifUnicodeByteOrder=SOMETHING when reading, because the way the application works, you enter write mode whenever you assign a tag value.  So the only way to handle this would be with yet another ExifTool option, and I don't think it is worth it for this case.

To work around this problem, you could add an ASCII character to the text.  Even a space will do it.  I have an additional piece of logic that the MWG doesn't use that checks for ASCII character codes.

- Phil
Title: Re: ExifByteOrder question
Post by: herb on July 07, 2013, 02:15:53 PM
Hello Phil,

thanks for your clear words.

Best regards
Herb