Mixed encoding of JSON output

Started by RalfM, May 23, 2010, 08:31:11 AM

Previous topic - Next topic

RalfM

Hi,

when I used EXIFtool (called by another program) to extract some MP3-tags and pass them to the program in JSON format, it worked at first but then (at the first filename with non-ASCII-characters) the other program complained about encoding errors (wrong UTF8). Examining what was going on showed that some of the values returned by EXIFtool where encoded as UTF8, others as Latin1.

Details:
OS: German WinXP SP3
EXIFtool version: Windows-executable, versions 8.12, 8.15 und 8.19 behave alike

cmdline:
  exiftool -j John*CD1\01*.mp3

output:
[{
  "SourceFile": "John le Carré (2009) The Spy Who Came in from the Cold CD1/01. The Spy Who Came in from the Cold 01.mp3",
  "ExifToolVersion": 8.19,
  "FileName": "01. The Spy Who Came in from the Cold 01.mp3",
  "Directory": "John le Carré (2009) The Spy Who Came in from the Cold CD1",
  "FileSize": "5.0 MB",
  "FileModifyDate": "2010:05:09 22:02:46+02:00",
  "FilePermissions": "rw-rw-rw-",
  "FileType": "MP3",
  "MIMEType": "audio/mpeg",
  "MPEGAudioVersion": 1,
  "AudioLayer": 3,
  "AudioBitrate": 128000,
  "SampleRate": 44100,
  "ChannelMode": "Joint Stereo",
  "MSStereo": "On",
  "IntensityStereo": "Off",
  "CopyrightFlag": false,
  "OriginalMedia": true,
  "Emphasis": "None",
  "VBRFrames": 8129,
  "VBRBytes": 5248719,
  "VBRScale": 90,
  "ID3Size": 513,
  "EncoderSettings": "LAME 32bits version 3.98.2 (http://www.mp3dev.org/)",
  "Title": "The Spy Who Came in from the Cold 01",
  "Artist": "John le Carré",
  "Album": "The Spy Who Came in from the Cold CD1",
  "Year": 2009,
  "Track": "01",
  "Genre": "Radio Play",
  "Length": "212.306 s",
  "Comment": "",
  "DateTimeOriginal": 2009,
  "Duration": "03:32 (approx)"
}]

What can be seen:
The values of "SourceFile" and "Directory" are encoded as Latin1 (as you can see from the last letter of Carré), while the value of "Artist" is encoded as UTF8 (Carré).

According to RFC 4627 all JSON *must* be in unicode, so the Latin1 encoded values give errors in JSON-reading applications.

The EXIFtool help only mentions that UTF8 is the default for all EXIFtool output, and that the -charset option doesn't work with -j, so I guess I didn't do anything wrong there.

Further tests showed that this behavior is not limited to .mp3, but also can be seen with .jpg files.
I attached a small example (the ExifTool.jpg used by the EXIFtool distribution for testing, renamed to include äöü and an XPTitle added which also includes äöü).

EXIFtool output (many lines omitted):
[{
  "SourceFile": "ExifTooläöümod.jpg",
  "ExifToolVersion": 8.19,
  "FileName": "ExifTooläöümod.jpg",
  "Directory": ".",
  "FileSize": "21 kB",
  "FileModifyDate": "2009:08:24 16:36:54+02:00",
  "FilePermissions": "rw-rw-rw-",
  "FileType": "JPEG",
...
  "SceneType": "Directly photographed",
  "XPTitle": "Titel äöü Titel",
  "ThumbnailImage": "(Binary data 1558 bytes)",
...
  "FocalLength35efl": "6.0 mm (35 mm equivalent: 41.4 mm)",
  "HyperfocalDistance": "0.75 m"
}

What can I do to get UTF8-only output?

Best regards
Ralf M.

Phil Harvey

#1
Hi Ralf,

This is a big problem.  Many (older) metadata formats do not adequately specify character encoding.  Often the specification is written assuming simple ASCII but other applications expand this to include whatever local character set the system is using.  In cases like this it is difficult or impossible to determine the character encoding for some strings.

ExifTool should convert anything in a known character set to whatever you specify with the -charset option, but for the reason mentioned above this won't include everything.

I am surprised about XPTitle though, since this should be stored as UCS2 and should be decoded properly be ExifTool.

[Edit] I just noticed you attached a sample, thanks.  The XPTitle works fine for me, but the filename is odd:

exiftool ~/Desktop/ExifToolA\314\203\302\244A\314\203\302\266A\314\203\302\274mod.jpg -xptitle -json
[{
  "SourceFile": "/Users/phil/Desktop/ExifTooläöümod.jpg",
  "XPTitle": "Titel äöü Titel"
}]


I think we have a problem with whatever you are using to display the exiftool output.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

RalfM

Hi Phil,

thanks for the hints and ideas. I only found time to follow your suggestions this weekend; results in short:

  • Output is really not-UTF, thus not a viewer problem.
  • Your example can be explained, however I had to make an assumption for that.
  • Encoding of tag values in the files doesn't seem to be the problem here.
  • The core problem might be filename handling in windows.
The long version follows below, including an explanation why I think the topic is important for JSON output.
It would be nice if you could read the text below (even if it's somewhat lengthy) and tell me whether there is something in it or how I can get the expected result. If I can help by doing further tests don't hesitate to let me know (only don't expect an answer within 24 hours).

Best regards
Ralf M.


My "viewer"

For the previous posting I redirected the EXIFtool output to a file, opened that with notepad and copied the contents via clipboard to the forum. To double check I did now

exiftool -j -xptitle ExifTooläöümod.jpg > redirected.json
exiftool -j -xptitle ExifTooläöümod.jpg -w json
fc /b redirected.json ExifTooläöümod.json


Result of the binary comparison in the last line: Both files have the same contents.
So redirection doesn't change the contents.
Viewing the content in a hex viewer: (See screenshot in attached imagefile redirected.json.ashex.jpg)

The highlighted byte e4 represents the letter ä, thus the SourceFile is encoded as cp1252, not UTF8.
In the XPTitle the original letter ä is represented by the bytes c3 a4 (shown on the right as ä), which is UTF8.

Thus EXIFtool under WinXP produces non-UTF-output as written in my first posting, it is not a viewer problem.

BTW: The same is true for the standard output (without -j): If several files are given, the output of the filenames (after ========) is in cp1252, while the rest is UTF8. However here it doesn't do as much harm as with json (see at the end).


Your example

In your example output it is just the other way around, XPTitle looks like CP1252, SourceFile like UTF8 (assuming one letter per byte is shown) - this is very strange.
Even stranger is the filename in your EXIFtool invocation (but examining that leads to a possible explanation):
The name of the original file I attached was
ExifTooläöümod.jpg
i.e. "ExifTool" followed by the three letters "äöü" ("aou" with double dots) followed by "mod.jpg"
In your example it becomes
ExifToolA\314\203\302\244A\314\203\302\266A\314\203\302\274mod.jpg
i.e. each of the letters "äöü" is replaced by "A" followed by what looks like four octal numbers.
It took me some time to find a kind of explanation for that; the best I could make out of it is that on the way from my computer via the ExifTool Forum to your computer the filename was mangled as follows:

  • The name was converted to UTF8 ("ä" -> \303\244)
  • The bytes of the result were (incorrectly) interpreted as ISO8859-1 characters and converted to unicode (\303\244 -> 'LATIN CAPITAL LETTER A WITH TILDE' + 'CURRENCY SIGN')
  • The result was normalized to the form NFD ('LATIN CAPITAL LETTER A WITH TILDE' + 'CURRENCY SIGN' -> 'LATIN CAPITAL LETTER A' + 'COMBINING TILDE' + 'CURRENCY SIGN')
  • The result was encoded as UTF8 (resulting in \101\314\203\302\244 or shorter A\314\203\302\244)
In short: The filename was converted from ISO to UTF8 *twice* with some decomposing normalization in between.

This leads to a possible explanation for the output you posted:
Assuming that your terminal program does not show one character per byte but interprets the byte stream as UTF8, the SourceFile value looks like UTF8 because of the double conversion described above (and the terminal only undoes one of these), while the XPTitle value is shown with its real value even though it is in UTF8 internally.
You seem to use some kind of Unix and I don't know Unix to well, but I heard that current flavours of Unix use UTF8 as filename encoding, so it might at least be possible that a terminal program understands UTF8.

Could you check this, simply by redirecting the output of EXIFtool to a file and inspect that file with a hex viewer?


Encodings

You mention the difficulties with older metadata formats that don't specify an encoding. I don't envy you the task of handling all these. Despite your doubts you do a pretty good job with my examples:

  • XPTitle is internally encoded in UCS2 (or possibly UTF16) as a look with a hex viewer shows, and you convert it correctly to UTF8 on output.
  • Artist in MP3 files is internally encoded in CP1252 (or possibly ISO8859-1) as a look with a hex viewer shows, and you convert it correctly to UTF8 on output.

The problem in the cases I came across seem to be the filename and directory only.


What might be the core problem

On the EXIFtool homepage you write "In Windows, ExifTool will not process files with Unicode characters in the file name. This is due to an underlying lack of support for Unicode filenames in the Windows standard C I/O libraries."
Thus you probably get filenames in the system encoding (similar to cp1252 on machines in western Europe and northern America). It looks like you are writing these cp1252-filenames to the output byte by byte without converting them to UTF8 first.


Why UTF8 is important especially for JSON

JSON output is readable by humans, but it is even better suited to be read by other programs. For many programming languages libraries exist that read JSON and allow easy access to the data structure read. These libraries assume that their input conforms to the JSON-spec, RFC4627, and that clearly states that JSON consists of Unicode characters and shall be encoded as UTF8 (default), UTF16 or UTF32. RFC4627 even describes an algorithm on how to detect from the first four bytes of the JSON byte stream which of the UTF-encodings was used.
When I feed the byte stream shown in the hex viewer screenshot (see attached image file redirected.json.ashex.jpg) to a JSON library (of Python, to be precise) I get an exception because the byte stream is not valid UTF8.
The bad thing is that this happens on reading the data, before I get access to the contents, so that I don't have a chance to recode an offending value. (And as the offending value in my case is the filename, I cannot suppress it either: --SourceFile doesn't work.)

Phil Harvey

Quote from: RalfM on May 30, 2010, 08:12:55 PM
The highlighted byte e4 represents the letter ä, thus the SourceFile is encoded as cp1252, not UTF8.

Thus EXIFtool under WinXP produces non-UTF-output as written in my first posting, it is not a viewer problem.

Unfortunate.  I see no easy way to fix this since I don't know how to determine the Windows encoding for file names.  Presumably they are encoded using whatever local encoding is used on the particular system.

It is interesting that you are able to run exiftool at all on files with unicode characters in their names.  A known problem is that exiftool will not process these files but apparently on some systems it does (I don't understand why).  The only work-around I can suggest is to not use special characters in file names on Windows.

QuoteIn your example output it is just the other way around, XPTitle looks like CP1252, SourceFile like UTF8 (assuming one letter per byte is shown) - this is very strange.

No.  My terminal is set to UTF8.  So XPTitle is UTF8 since it displays properly.  The FileName is in some other encoding though.  Here is the binary dump of the output:

> exiftool ~/Desktop/ExifToolA\314\203\302\244A\314\203\302\266A\314\203\302\274mod.jpg -xptitle -json >t1

> hexdump t1
    0000: 5b 7b 0a 20 20 22 53 6f 75 72 63 65 46 69 6c 65 [[{.  "SourceFile]
    0010: 22 3a 20 22 2f 55 73 65 72 73 2f 70 68 69 6c 2f [": "/Users/phil/]
    0020: 44 65 73 6b 74 6f 70 2f 45 78 69 66 54 6f 6f 6c [Desktop/ExifTool]
    0030: 41 cc 83 c2 a4 41 cc 83 c2 b6 41 cc 83 c2 bc 6d [A....A....A....m]
    0040: 6f 64 2e 6a 70 67 22 2c 0a 20 20 22 58 50 54 69 [od.jpg",.  "XPTi]
    0050: 74 6c 65 22 3a 20 22 54 69 74 65 6c 20 c3 a4 c3 [tle": "Titel ...]
    0060: b6 c3 bc 20 54 69 74 65 6c 22 0a 7d 5d 0a       [... Titel".}].]


QuoteIn short: The filename was converted from ISO to UTF8 *twice* with some decomposing normalization in between.

Interesting.  I could see the forum messing with the file name like that.

Quote
You seem to use some kind of Unix and I don't know Unix to well, but I heard that current flavours of Unix use UTF8 as filename encoding, so it might at least be possible that a terminal program understands UTF8.

Exactly.

QuoteCould you check this, simply by redirecting the output of EXIFtool to a file and inspect that file with a hex viewer?

Done.

QuoteOn the EXIFtool homepage you write "In Windows, ExifTool will not process files with Unicode characters in the file name. This is due to an underlying lack of support for Unicode filenames in the Windows standard C I/O libraries."
Thus you probably get filenames in the system encoding (similar to cp1252 on machines in western Europe and northern America). It looks like you are writing these cp1252-filenames to the output byte by byte without converting them to UTF8 first.

Ah.  I see you found this.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

amenthes

I would like to bring this issue back, it is still existing.

With the recent (9.79, January 2015) addition of -charset FileName=cp1252 this issue could be fixed, right? At least, as long as the user provided the info on the commandline.

On a different note: ruby (from which i am planning to use exiftool) can actually determine the filesystem's encoding. Maybe their way is transferable to exiftool, then this could even be resolved automatically. In ruby, you can write Encoding.find("filesystem") which will return the correct value on my system.

Phil Harvey

#5
Quote from: amenthes on April 11, 2015, 07:48:12 AM
With the recent (9.79, January 2015) addition of -charset FileName=cp1252 this issue could be fixed, right?

I think to fix this issue you should use -charset filename=utf8 and specify the file name in UTF-8.

QuoteOn a different note: ruby (from which i am planning to use exiftool) can actually determine the filesystem's encoding. Maybe their way is transferable to exiftool, then this could even be resolved automatically. In ruby, you can write Encoding.find("filesystem") which will return the correct value on my system.

Yes, I think I figured out how this could be done in Perl.  I thought about this but haven't implemented it because it would be an incredible amount of work to fix this properly.  There are just too many places where file names are used and displayed, and it is much easier for the user to do everything in UTF-8 if he cares about it.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

amenthes

Quote from: Phil Harvey on April 11, 2015, 09:21:46 AM
Quote from: amenthes on April 11, 2015, 07:48:12 AM
With the recent (9.79, January 2015) addition of -charset FileName=cp1252 this issue could be fixed, right?

I think to fix this issue you should use -charset filename=utf8 and specify the file name in UTF-8.

I'll give it a try. I'm not sure that i have control over encoding at that point, though. If i do, it is indeed a good solution.

amenthes

#7
somewhere in between something goes wrong.

The dirname looks like this: "2012-05-24 München", which is picked up by ruby's dir globbing and stored internally in a UTF-8 String:

00000000  32 30 31 32 2d 30 35 2d 32 34 20 4d c3 bc 6e 63  |2012-05-24 M..nc|
00000010  68 65 6e 2f                                      |hen/|


I then proceed to popen3 a exiftool process. I verified that all pipes (in/out/err) have external_encoding UTF-8 as well. The error message that comes back is this:

00000000  46 69 6c 65 20 6e 6f 74 20 66 6f 75 6e 64 3a 20  |File not found: |
00000010  32 30 31 32 2d 30 35 2d 32 34 20 4d fc 6e 63 68  |2012-05-24 M.nch|
00000020  65 6e 2f 0a                                      |en/.|


unfortunately, i can't say with 100% certainty that Kernel.spawn() does not convert encodings somewhere along the way. I got this out of API Monitor v2, but i'm not entirely sure how to interpret it. I'm not familiar with API Monitor, basically i was searching for an equivalent of strace.

Ruby is doing a CreateProcessW call with this as second parameter (lpCommandLine):

0000  43 00 3a 00 5c 00 55 00 73 00 65 00 72 00 73 00  C.:.\.U.s.e.r.s.
0010  5c 00 61 00 6d 00 65 00 6e 00 74 00 68 00 65 00  \.a.m.e.n.t.h.e.
0020  73 00 5c 00 77 00 6f 00 72 00 6b 00 73 00 70 00  s.\.w.o.r.k.s.p.
0030  61 00 63 00 65 00 5c 00 66 00 6f 00 74 00 6f 00  a.c.e.\.f.o.t.o.
0040  65 00 78 00 70 00 6f 00 72 00 74 00 5c 00 65 00  e.x.p.o.r.t.\.e.
0050  78 00 69 00 66 00 74 00 6f 00 6f 00 6c 00 20 00  x.i.f.t.o.o.l. .
0060  2d 00 4a 00 20 00 2d 00 63 00 68 00 61 00 72 00  -.J. .-.c.h.a.r.
0070  73 00 65 00 74 00 20 00 46 00 69 00 6c 00 65 00  s.e.t. .F.i.l.e.
0080  4e 00 61 00 6d 00 65 00 3d 00 75 00 74 00 66 00  N.a.m.e.=.u.t.f.
0090  38 00 20 00 2d 00 63 00 68 00 61 00 72 00 73 00  8. .-.c.h.a.r.s.
00a0  65 00 74 00 20 00 75 00 74 00 66 00 38 00 20 00  e.t. .u.t.f.8. .
00b0  2d 00 73 00 75 00 62 00 6a 00 65 00 63 00 74 00  -.s.u.b.j.e.c.t.
00c0  20 00 22 00 32 00 30 00 31 00 32 00 2d 00 30 00   .".2.0.1.2.-.0.
00d0  35 00 2d 00 32 00 34 00 20 00 4d 00 fc 00 6e 00  5.-.2.4. .M...n.   <--- 0xFC would be "ü"
00e0  63 00 68 00 65 00 6e 00 2f 00 22 00 00 00        c.h.e.n./."...




At that moment, it looks like cp1252 or iso-8859-1. The 0xFC is the same "ü" in any of those.

After that, cmd.exe will call CreateProcessA with lpCommandLine set to this:

0000  43 3a 5c 55 73 65 72 73 5c 61 6d 65 6e 74 68 65  C:\Users\amenthe
0010  73 5c 77 6f 72 6b 73 70 61 63 65 5c 66 6f 74 6f  s\workspace\foto
0020  65 78 70 6f 72 74 5c 65 78 69 66 74 6f 6f 6c 20  export\exiftool
0030  2d 4a 20 2d 63 68 61 72 73 65 74 20 46 69 6c 65  -J -charset File
0040  4e 61 6d 65 3d 75 74 66 38 20 2d 63 68 61 72 73  Name=utf8 -chars
0050  65 74 20 75 74 66 38 20 2d 73 75 62 6a 65 63 74  et utf8 -subject
0060  20 32 30 31 32 2d 30 35 2d 32 34 20 4d fc 6e 63   2012-05-24 M.nc  <--- again: 0xFC
0070  68 65 6e 2f 00                                   hen/.


So, i guess i can't use ruby to call exiftool with utf-8 encoding after all. If this were utf-8, it should be [0xC3 0xBC].

(note to self: which is weird, as this patch seems to be in ruby for over a year https://bugs.ruby-lang.org/projects/ruby-trunk/repository/revisions/41709 which is addressing something that sounds very similar https://bugs.ruby-lang.org/issues/1771 )

amenthes

nevermind i just found a solution that does not butcher the charset:

I'm opening an exiftool process and pass all my arguments directly throuch stdin via the provided '-@ -'.

This seems to work out just fine!

Phil Harvey

Excellent.  I'm glad you found the solution.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Marsu42

Quote from: amenthes on April 12, 2015, 01:17:19 PM
I'm opening an exiftool process and pass all my arguments directly throuch stdin via the provided '-@ -'.

Btw, after going nearly crazy using exiftool with the windows command line, the -@ arg option is the one I resorted to, too. My problem is that I use utf-8 encoding for the tags, but one app I use positively insists on iptc being latin2 - argh.

It might be easier using a Unix shell like on real Linux or at least cygwin, but everything else than utf-16 on Windows is unreliable and esp. with utf-8 you're asking for trouble.