ExifTool Forum

General => Metadata => Topic started by: joakimk on May 28, 2016, 03:41:45 PM

Title: Charset and non-ASCII characters in UserComment
Post by: joakimk on May 28, 2016, 03:41:45 PM
I'm trying to figure out how to write EXIF data (tags) to JPEGs, on Android using the Apache Commons Imaging (Sanselan) library.
I think I'm creating and writing the tag (EXIF_TAG_USER_COMMENT) correctly, but I'm having problems with "international characters" (i.e non US-ASCII).
As I've been told, the field has that encoding?

Anyway, here are two JPEGs, where I've written the string "æøå" using two apps: My app, and "Photo Exif Editor (https://play.google.com/store/apps/details?id=net.xnano.android.photoexifeditor&hl=en)" (from Google Play).
If I run the JPEGs through ExifTool, both seem to have garbled UserComments:

other-app.JPG
User Comment                    : ├ª├©├Ñ
(http://bildr.no/thumb/QzQxdStk.jpeg) (http://bildr.no/view/QzQxdStk)

my-app.JPG
User Comment                    : ýĪýÄ©ýÄÑ
(http://bildr.no/thumb/VWFhYjBx.jpeg) (http://bildr.no/view/VWFhYjBx)

Also, the two apps can not read each other's encoding (both show nonsense in UserComment when opening the other app's image).

However, in Windows Explorer (right-click > Properties) the Comment in other-app.JPG is shown correctly as "æøå".
(http://bildr.no/thumb/Mnp2RDNr.jpeg) (http://bildr.no/view/Mnp2RDNr)

So how can that be? I've read on this forum something about it being possible to write non-ASCII characters using "ANSI encoding".
Do you guys know something about such internationalization problems with EXIF data? Maybe you can find some clues in the attached JPEGs?

Thanks!
Joakim


Ps. sorry if this becomes a double post -- I had some problems with posting this question.
Title: Re: Charset and non-ASCII characters in UserComment
Post by: joakimk on May 28, 2016, 04:49:49 PM
I got this info on the Sanselan mailing list.

QuoteI've looked at the code of
org.apache.commons.imaging.formats.tiff.taginfos.TagInfoGpsText
(ExifTagConstants.EXIF_TAG_USER_COMMENT is an instance of TagInfoGpsText).
Here are my observations:

  • The FieldType parameter, which you have set to
    TiffFieldTypeConstants.FIELD_TYPE_ASCII is never used in the implemenation
    of encodeValue(FieldType, Object, ByteOrder)
  • When converting the input String to byte array, String.getBytes(String
    charsetName) is used
  • For charsetName "US-ASCII" is always used (it can not be configured by
    the user)
So my guess is, that the code will not handle characters not in the
US-ASCII charset correctly.
Title: Re: Charset and non-ASCII characters in UserComment
Post by: Phil Harvey on May 28, 2016, 07:38:47 PM
See FAQ 10 for help with this (https://exiftool.org/faq.html#Q10), specifically the EXIF section.

Also, it may be useful to use the exiftool -v3 option to see the raw data for this tag.

- Phil
Title: Re: Charset and non-ASCII characters in UserComment
Post by: joakimk on May 29, 2016, 05:35:30 AM
Seems I can write international characters if I encode to UTF-8:

                               
byte[] bytes = textToSet.getBytes("UTF-8"); // "UTF-8"
TiffOutputField exif_comment = new TiffOutputField(TiffConstants.EXIF_TAG_USER_COMMENT,
TiffFieldTypeConstants.FIELD_TYPE_ASCII, bytes.length, bytes);


Then "æøå" comes up nicely in both apps, as well as in Windows.
However, so long as there are "non-ASCII" characters in the UserComment field, ExifTools -v3 seems to be not so happy:

If text = "Test", then we get:
  | | 15) UserComment = ASCIITest
  | |     - Tag 0x9286 (12 bytes, undef[12]):
  | |         5c39: 41 53 43 49 49 00 00 00 54 65 73 74             [ASCII...Test]


but if text = "æøå", we get:
  | | 18) UserComment = ......
  | |     - Tag 0x9286 (6 bytes, string[6] read as undef[6]):
  | |         216f: c3 a6 c3 b8 c3 a5                               [......]


if text = "eggæ", we get:
  | | 18) UserComment = egg..
  | |     - Tag 0x9286 (5 bytes, string[5] read as undef[5]):
  | |         2a55: 65 67 67 c3 a6                                  [egg..]


Should ExifTools be able to print out (echo) the content properly, in windows CMD output, or is this maybe just a result of windows/CMD not being able to print the Unicode characters?
The fact that ExifTools does not write "ASCII" is maybe because... it's UTF-8?

Here is the JPEG with text = "æøå"  :-)
(http://bildr.no/thumb/aFNVTHV0.jpeg) (http://bildr.no/view/aFNVTHV0)


Thanks again!
Title: Re: Charset and non-ASCII characters in UserComment
Post by: Hayo Baan on May 29, 2016, 06:41:12 AM
With -v3 you are looking at the internal (raw) byte data, in UTF8, non ascii characters are represented by multiple bytes, (e.g., hex codes c3 and a6 for æ). That's what you are seeing. I bet if you have exiftool display the user comment tag normally it will show up correctly ;)
Title: Re: Charset and non-ASCII characters in UserComment
Post by: joakimk on May 29, 2016, 06:48:46 AM
Yeah, I see :-)
Two bytes per char. But if you see the first post, the comment in other-app.JPG (æøå) does not print readable with exiftool, ran normally.

So I'm at a loss ;-) Always an extra challenge for non-US programmers, this charset/Unicode business. Thanks for replying!
Title: Re: Charset and non-ASCII characters in UserComment
Post by: Phil Harvey on May 29, 2016, 08:41:47 AM
The library you are using isn't writing the special characters correctly.  You can see what should happen if you use ExifTool to write special characters to this tag:

% exiftool a.jpg -usercomment=æøå
    1 image files updated
% ./exiftool a.jpg -usercomment
User Comment                    : æøå
% exiftool a.jpg -v3
[...]
  | | 14) UserComment = UNICODE...
  | |     - Tag 0x9286 (14 bytes, undef[14]):
  | |         0bc2: 55 4e 49 43 4f 44 45 00 00 e6 00 f8 00 e5       [UNICODE.......]
[...]


Note that I am on a Mac, and the Terminal is UTF-8 by default, so I didn't need to worry about the -charset option when writing this tag.

- Phil
Title: Re: Charset and non-ASCII characters in UserComment
Post by: joakimk on May 29, 2016, 09:22:49 AM
But could it be that I'm just not creating the tag properly, that the library itself isn't to blame? Seems like I've tried everything, I just can't figure out how to write UTF8/Unicode to the JPEG.

Does it look like I'm creating/encoding the tag properly?
Title: Re: Charset and non-ASCII characters in UserComment
Post by: Hayo Baan on May 29, 2016, 02:20:01 PM
Well, even though officially you may need to have your library add the UNICODE bit and use the true unicode character values like Phil mentioned, your version of the file shows just fine in exiftool for me. But, like Phil, I am on a Mac meaning that our character encoding is UTF-8 by default. If you added -charset EXIF=UTF8 to your command I guess everything will show the user comment just fine for you too.

Cheers,
Hayo
Title: Re: Charset and non-ASCII characters in UserComment
Post by: joakimk on May 29, 2016, 03:04:12 PM
I checked the images on my Mac (my development laptop is a Dell laptop, running Win10), and -- yes -- other-app.JPG does come through as "æøå" :)
I tried adding -charset EXIF=UTF8 in Windows, but I still keep getting ├ª├©├Ñ
Anyway -- that was the image handled by the other app, so not very interesting for me except it gives me a reliable benchmark (check on the Mac).

So, checking images from all my various experiments, I found that this approach seems to actually do the trick:
(this is borrowed from the Users Common Apache forum (http://osdir.com/ml/user-commons-apache/2012-03/msg00046.html))

/*
   For UserComment what Sanselan does is autodetect the encoding: this
   will work for ASCII and write ASCII, but in 0.97 it wrongly assumed
   that the unicode encoding is UTF-8, whereas it's actually UTF-16 with
   byte ordering depending on the file's byte ordering. So in 0.97 you
   have to encode this manually yourself using a big hack:
*/
byte[] unicodeMarker = new byte[]{ 0x55, 0x4E, 0x49, 0x43, 0x4F, 0x44, 0x45, 0x00 };
byte[] comment = textToSet.getBytes("UTF-16LE"); // OR UTF-16BE if the file is big-endian!
byte[] bytesComment = new byte[unicodeMarker.length + comment.length];
System.arraycopy(unicodeMarker, 0, bytesComment, 0, unicodeMarker.length);
System.arraycopy(comment, 0, bytesComment, unicodeMarker.length, comment.length);
TiffOutputField exif_comment = new TiffOutputField(TiffConstants.EXIF_TAG_USER_COMMENT,
                                        TiffFieldTypeConstants.FIELD_TYPE_ASCII, bytesComment.length, bytesComment);


textToSet = "æøå" gives:
  | | 18) UserComment = UNICODE...
  | |     - Tag 0x9286 (14 bytes, string[14] read as undef[14]):
  | |         1d37: 55 4e 49 43 4f 44 45 00 e6 00 f8 00 e5 00       [UNICODE.......]

... identical to what you posted, Phil  :D
... and on the Mac, I get "UserComment: æøå"
(http://bildr.no/thumb/SWI1MVNw.jpeg) (http://bildr.no/view/SWI1MVNw)
(sorry, it's a banana)

What's puzzling/bugging me now, is that this approach -- with the manual Unicode marker, and null-termination etc -- is that this approach loses the ability to play nice with Windows Explorer.
Right-click the file > Properties shows an empty "Comments" field.
(http://bildr.no/thumb/NWhnR1hB.jpeg) (http://bildr.no/view/NWhnR1hB)

It is possible, because the other app manages it.

Title: Re: Charset and non-ASCII characters in UserComment
Post by: Phil Harvey on May 29, 2016, 05:45:05 PM
The way you are encoding your UNICODE text is correct now, provided that the byte ordering is consistent with the EXIF.  But as far as I know, Windows doesn't treat UNICODE text properly, and always assumes little-endian byte ordering.  So for EXIF to be correct and compatible with windows, the ExifByteOrder must be Little-Endian.  However, ExifTool allows you to set ExifUnicodeByteOrder when writing if you want to make the Unicode text byte ordering inconsistent with the EXIF byte order, but this allows you to write Little-Endian Unicode text to a Big-Endian EXIF, which I think would make Windows happy.  (See the Extra Tags documentation (https://exiftool.org/TagNames/Extra.html) for a description of these special tags.)

- Phil
Title: Re: Charset and non-ASCII characters in UserComment
Post by: joakimk on May 30, 2016, 08:37:13 AM
So, just an update. Thanks for the patience and for allowing me so much space on this forum :)

I tried chaging to UTF-16BE (BigEndian), but no improvement. And, if I understand your comment below, LittleEndian is the correct encoding (for Windows, at least)?
In any case, when nothing seemed to have an effect. The situation remained:


Rather than playing around more with the encoding, I kept it at UTF-16LE, and rather tried changing how the tag itself is created. Changing the FieldType from FIELD_TYPE_ASCII to FIELD_TYPE_UNDEFINED, it seems the first problem (file properties > Comment) is resolved. It still works on the Mac.

So maybe this is the full, correct code for encoding and creating a UserComment tag in Java, using Sanselan? Hopefully someone else might benefit.


byte[] unicodeMarker = new byte[]{ 0x55, 0x4E, 0x49, 0x43, 0x4F, 0x44, 0x45, 0x00 };
byte[] comment = textToSet.getBytes("UTF-16LE"); // OR UTF-16BE if the file is big-endian!
byte[] bytesComment = new byte[unicodeMarker.length + comment.length];
System.arraycopy(unicodeMarker, 0, bytesComment, 0, unicodeMarker.length);
System.arraycopy(comment, 0, bytesComment, unicodeMarker.length, comment.length);
TiffOutputField exif_comment = new TiffOutputField(TiffConstants.EXIF_TAG_USER_COMMENT,
                                        TiffFieldTypeConstants.FIELD_TYPE_UNDEFINED, bytesComment.length, bytesComment);


(http://bildr.no/thumb/YVFNc3ZI.jpeg) (http://bildr.no/view/YVFNc3ZI)

(http://bildr.no/thumb/cmhKVFlr.jpeg) (http://bildr.no/view/cmhKVFlr)

Thanks for all the help, and please let me know if something appears odd or wrong with the JPEG or EXIF data within.
Title: Re: Charset and non-ASCII characters in UserComment
Post by: Phil Harvey on May 30, 2016, 11:51:27 AM
Is your first linked image the one that is supposed to have a UserComment in it?  I don't see one.

- Phil
Title: Re: Charset and non-ASCII characters in UserComment
Post by: joakimk on May 30, 2016, 01:40:14 PM
Here's what I'm getting:

  | | 15) UserComment = UNICODE...
  | |     - Tag 0x9286 (14 bytes, undef[14]):
  | |         593e: 55 4e 49 43 4f 44 45 00 e6 00 f8 00 e5 00       [UNICODE.......]

(or, on the Mac, simply UserComment: æøå)

Maybe the meta data doesn't survive the picture hosting website, bildr.no?
I'll try to attach the image here (but that hasn't always worked on this forum, unfortunately, at least in my experience)
Title: Re: Charset and non-ASCII characters in UserComment
Post by: Hayo Baan on May 30, 2016, 02:19:22 PM
I saw the usercomment just fine, indeed encoded like you showed. (@Phil, were you perhaps looking at the small image reference, not at the full image as found on bildr.no?)

Anyway, looks like everything is working as you want now :)
Title: Re: Charset and non-ASCII characters in UserComment
Post by: Phil Harvey on May 30, 2016, 04:33:58 PM
Quote from: Hayo Baan on May 30, 2016, 02:19:22 PM
(@Phil, were you perhaps looking at the small image reference, not at the full image as found on bildr.no?)

Right.  That was it.

QuoteAnyway, looks like everything is working as you want now :)

Yup!

- Phil