When I knew exiftool don't support cp936 charset, frustrated.

Started by HaujetZhao, October 23, 2021, 11:47:18 AM

Previous topic - Next topic

HaujetZhao

In my area (China main land), the GBK(CP936) is the main Windows system encoding, instead of UTF-8. It's a historical problem, hard to change but to adapt.

ExifTool now can't return the correct user comment text writen by digiKam or Windows Explorer due to the non ascii characters were in CP936 encoding.

Is that fixable in the foreseeable future?

Hayo Baan

I had a look into this, and it looks like this should be feasible for Phil to add to exiftool. cp936 (gbk) is a supported encoding in perl (at least on my system).

@Phil, would this just be a matter of extending %charsetName in ExifTool.pm or are there other places this must be dealt with?
Hayo Baan – Photography
Web: www.hayobaan.nl

Phil Harvey

I did a bit of checking, and WIndows cp936 is a double-byte character set.  I don't even know how this would work for the console.  All character sets supported by ExifTool are single-byte character sets (I'm counting UTF-8 as a single-byte set even though it does support multibyte characters through special extension codes).

Could you try something like this and attach the output text file?:

echo "testing: Some_cp936_character_string" > out.txt

(leave the "testing:" at the start of the string so I have some characters that I should be able to recognize, but use Chinese characters for the rest.)

And try this to see how it comes back in your console:

type out.txt

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

HaujetZhao

Quote from: Phil Harvey on October 23, 2021, 07:58:42 PM
I did a bit of checking, and WIndows cp936 is a double-byte character set.  I don't even know how this would work for the console.  All character sets supported by ExifTool are single-byte character sets (I'm counting UTF-8 as a single-byte set even though it does support multibyte characters through special extension codes).

Could you try something like this and attach the output text file?:

echo "testing: Some_cp936_character_string" > out.txt

(leave the "testing:" at the start of the string so I have some characters that I should be able to recognize, but use Chinese characters for the rest.)

And try this to see how it comes back in your console:

type out.txt

- Phil

type "D:\Users\Haujet\Desktop\String with cp936 encoding.txt"
Strings below is in cp936 encoding: 云蒸沧海,雨润桑田。阴阳世界,造化黎元。羲农开辟,轩昊承传。魃凌涿鹿,熊奋阪泉。四凶伏罪,群兽听宣。垂裳拱手,击壤欢颜。挽弓射日,采石补天。巢由小隐,稷契大贤。触峰贻患,治水移权。繇惟北面,舜竟南迁。洪荒待考,虚诞连篇。聊将俊杰,尽作神仙。


Phil Harvey

Thanks.

Very interesting.  This also seems like a variable-character-size encoding.  Regular ASCII characters are unchanged, but codes presumably over 0x7f are 2 bytes long.

I found a table that maps Windows cp936 to Unicode code points.  I should be able to support this.  Let me work on it.

One question:  What is the common name for this character set?  GBK?

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Phil Harvey

Wait.  Looking at your sample JPG file, ExifTool is already decoding this properly.  Here is what I get on a Mac (and presumably you would get the same thing in Windows if you have UTF-8 properly enabled in your console):

> exiftool -Description-zh-CN -S "A image with cp936 comment metadata inside.jpg"
Description-zh-CN: 云蒸沧海,雨润桑田。阴阳世界,造化黎元。羲农开辟,轩昊承传。魃凌涿鹿,熊奋阪泉。四凶伏罪,群兽听宣。垂裳拱手,击壤欢颜。挽弓射日,采石补天。巢由小隐,稷契大贤。触峰贻患,治水移权。繇惟北面,舜竟南迁。洪荒待考,虚诞连篇。聊将俊杰,尽作神仙。


What exactly do you mean when you say "ExifTool now can't return the correct user comment text"?  -- As far as I can tell, it does return the correct text.  Is the problem just that you want to be able to return it as cp936?  Is it not possible to change your console to cp65001?

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

StarGeek

Setting Windows as shown in this StackOverflow answer also appears to extract it correctly.

C:\>exiftool -g1 -a -s -Description* "Y:\!temp\aa\c\A image with cp936 comment metadata inside.jpg"
---- XMP-dc ----
Description                     : Software: Snipaste
Description-zh-CN               : 云蒸沧海,雨润桑田。阴阳世界,造化黎元。羲农开辟,轩昊承传。魃凌涿鹿,熊奋阪泉。四凶伏罪,群兽听宣。垂裳拱手,击壤欢颜。挽弓射日,采石补天。巢由小隐,稷契大贤。触峰贻患,治水移权。繇惟北面,舜竟南迁。洪荒待考,虚诞连篇。聊将俊杰,尽作神仙。
"It didn't work" isn't helpful. What was the exact command used and the output.
Read FAQ #3 and use that cmd
Please use the Code button for exiftool output

Please include your OS/Exiftool version/filetype

HaujetZhao

Quote from: Phil Harvey on October 23, 2021, 11:12:48 PM
Wait.  Looking at your sample JPG file, ExifTool is already decoding this properly.  Here is what I get on a Mac (and presumably you would get the same thing in Windows if you have UTF-8 properly enabled in your console):

> exiftool -Description-zh-CN -S "A image with cp936 comment metadata inside.jpg"
Description-zh-CN: 云蒸沧海,雨润桑田。阴阳世界,造化黎元。羲农开辟,轩昊承传。魃凌涿鹿,熊奋阪泉。四凶伏罪,群兽听宣。垂裳拱手,击壤欢颜。挽弓射日,采石补天。巢由小隐,稷契大贤。触峰贻患,治水移权。繇惟北面,舜竟南迁。洪荒待考,虚诞连篇。聊将俊杰,尽作神仙。


What exactly do you mean when you say "ExifTool now can't return the correct user comment text"?  -- As far as I can tell, it does return the correct text.  Is the problem just that you want to be able to return it as cp936?  Is it not possible to change your console to cp65001?

- Phil

That's wired, this is the result from my windows laptop:

PS D:\Users\Haujet\Desktop> exiftool -Description-zh-CN -S "A image with cp936 comment metadata inside.jpg"
Description-zh-CN: 浜戣捀娌ф捣锛岄洦娑︽鐢般€傞槾闃充笘鐣岋紝閫犲寲榛庡厓銆傜静鍐滃紑杈燂紝杞╂槉鎵夸紶銆傞瓋鍑屾犊楣匡
紝鐔婂闃硥銆傚洓鍑朵紡缃紝缇ゅ吔鍚銆傚瀭瑁虫嫳鎵嬶紝鍑诲¥娆㈤銆傛尳寮撳皠鏃ワ紝閲囩煶琛ュぉ銆傚发鐢卞皬闅愶紝绋峰澶ц搐銆傝Е宄拌椿鎮o紝娌绘按绉绘潈銆傜箛鎯熷寳闈紝鑸滅珶鍗楄縼銆傛椽鑽掑緟鑰冿紝铏氳癁杩炵瘒銆傝亰灏嗕繆鏉帮紝灏戒綔绁炰粰銆?


On my WSL Linux subsystem (utf-8 system encoding), it did get the correct result:

user@Haujet-Matebook:/mnt/d/Users/Haujet/Desktop$ exiftool -Description-zh-CN -S "A image with cp936 comment metadata inside.jpg"
Description-zh-CN: 云蒸沧海,雨润桑田。阴阳世界,造化黎元。羲农开辟,轩昊承传。魃凌涿鹿,熊奋阪泉。四凶伏罪,群兽听宣。垂裳拱手,击壤欢颜。挽弓射日,采石补天。巢由小隐,稷契大贤。触峰贻患,治水移权。繇惟北面,舜竟南迁。洪荒待考,虚诞连篇。聊将俊杰, 尽作神仙。


Premise:

- My windows system uses cp936 as encoding
- Mac and Linux uses utf-8 as encoding

So my guess is:

- ExifTool read the correct result, but returned it to Windows console or stdout with utf-8 decoded Bytes
- The windows used cp936 to encode the Bytes and printed it on the console.

But I'm not so sure about the actual reason.

This is the history of cp936:

- In 1980, China made GB2312, in order to handle Chinese characters. It included 7445 characters, and it's downward compatible with ASCii
- In 1995, due to the limitation of GB2312, China expanded it to GBK1.0, which included 21886 characters, downward compatible with GB2312
- In 2000, China expanded the GBK1.0 to GB18030, added some national minority language symbols, as the official national standard.

It is said that when inventing the code page, IBM put GBK in page 936, so the GBK is also called cp936.

Because ASCii, GB2312, GBK, GB18030 are downward compatible, supporting GBK is sufficient for normal use.


HaujetZhao

Additional:

After using

chcp 65001

the result in cmd is:

Active code page: 65001

D:\Users\Haujet\Desktop>exiftool -Description-zh-CN -S "A image with cp936 comment metadata inside.jpg"
Description-zh-CN: 云蒸沧海,雨润桑田。阴阳世界,造化黎元。羲农开辟,轩昊承传。魃凌涿鹿,熊奋阪泉。四凶伏罪,群兽听宣。
垂裳拱手,击壤欢颜。挽弓射日,采石补天。巢由小隐,稷契大贤。触峰贻患,治水移权。繇惟北面,舜竟南迁。洪荒待考,虚诞连篇。聊将俊杰,尽作神仙。

D:\Users\Haujet\Desktop>


The result in PowerShell is:

Active code page: 65001
PS D:\Users\Haujet\Desktop> exiftool -Description-zh-CN -S "A image with cp936 comment metadata inside.jpg"
Description-zh-CN: 浜戣捀娌ф捣锛岄洦娑︽鐢般€傞槾闃充笘鐣岋紝閫犲寲榛庡厓銆傜静鍐滃紑杈燂紝杞╂槉鎵夸紶銆傞瓋鍑屾犊楣匡
紝鐔婂闃硥銆傚洓鍑朵紡缃紝缇ゅ吔鍚銆傚瀭瑁虫嫳鎵嬶紝鍑诲¥娆㈤銆傛尳寮撳皠鏃ワ紝閲囩煶琛ュぉ銆傚发鐢卞皬闅愶紝绋峰澶ц搐銆傝Е宄拌椿鎮o紝娌绘按绉绘潈銆傜箛鎯熷寳闈紝鑸滅珶鍗楄縼銆傛椽鑽掑緟鑰冿紝铏氳癁杩炵瘒銆傝亰灏嗕繆鏉帮紝灏戒綔绁炰粰銆?

PS D:\Users\Haujet\Desktop>


So I think you can recreate the wrong result in cmd by `chcp 936`

Phil Harvey

So it is working for you as it should in cmd.exe (a-la FAQ 18).

I can't speak for why it doesn't work in PowerShell, but there are other problems with PowerShell so I would recommend not using it.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

HaujetZhao

Thank you, Phil! Guess I'd arrange some time to make a translate of FAQ before next problem. The FAQ is really a treasure. I'm not a native English speaker, reading English article is relevantly a bit harder, that's why I haven't finish reading it, but I'll make it.

Hayo Baan

Interesting. Since the text was in the XMP data, it should already be in UTF-8 encoding (iirc, this is the mandatory encoding), it would therefore display correctly (if the codepage is 65001).

To be able to display comments etc. encoded in cp936/GBK, however, a change to exiftool (perhaps simply like I suggested) would still be required. @Phil, would it really not be as simple as I said? Or do you do your own character encoding/decoding instead of using the Perl Encode module?

Cheers,
Hayo
Hayo Baan – Photography
Web: www.hayobaan.nl

Phil Harvey

Hi Hayo,

cp936 is more difficult than the other supported character sets because it is multi-byte.  It would require additional work and some dedicated code to be able to use this encoding to set new values when writing.  So I wouldn't want to add support for this unless there was a real demand.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Hayo Baan

Quote from: Phil Harvey on October 31, 2021, 10:58:48 AM
cp936 is more difficult than the other supported character sets because it is multi-byte.  It would require additional work and some dedicated code to be able to use this encoding to set new values when writing.  So I wouldn't want to add support for this unless there was a real demand.

Right, was already afraid it was more cumbersome that simply adding it to the list of supported encodings. Pity :(
Hayo Baan – Photography
Web: www.hayobaan.nl