Output recoding for unicode file name happens in a wrong conddition

Started by Yang-z, April 09, 2022, 08:27:28 AM

Previous topic - Next topic

Yang-z

Hi,
I know the windows command-line doesn't support UNICODE very well, if the system code page is not set to be 65001(or utf-8).
However, I find a bug that maybe caused be the ExifTool itself.

My current code page is 936 (simplfied Chinese), when I dealing with file names with Chinese characters, the output name is garbled.
So, I did a test:

1. Use command-line to pass the 'file' argument:
Quote
PS H:\Git\ExifToolGUI\.samples\1> exiftool -file:filename "中文.jpg"
File Name                       : 中文.jpg
It looks fine for me, since my code page is 936

2. Use command-line to pass the 'file' argument, and output is set to json:
Quote
PS H:\Git\ExifToolGUI\.samples\1> exiftool -file:filename "中文.jpg" -j
[{
  "SourceFile": "????.jpg",
  "FileName": "????.jpg"
}]
Now, the characters in original filename becomes something confusing for the command-line.

3. Use a arg file (test.txt, utf-8, only records the filename "中文.jpg") to pass the 'file' argument:
Quote
PS H:\Git\ExifToolGUI\.samples\1> exiftool -charset filename=utf8 -file:filename -@ "test.txt"
File Name                       : 涓枃.jpg

PS H:\Git\ExifToolGUI\.samples\1> exiftool -charset filename=utf8 -file:filename -@ "test.txt" -j
[{
  "SourceFile": "涓枃.jpg",
  "FileName": "涓枃.jpg"
}]
Now, the output file name (with -j or not) becomes "涓枃.jpg".

I use "> out.txt" to get the result, and use a Hex tool to see what happend.

I find 涓枃.jpg (output of the third test run) is 中文.jpg (the original file name)'s utf-8 enconding result.

I pass a filename from a utf-8 enconded file to ExifTool, along with the param "-charset filename=utf8". And when ExifTool pass it to json or command-line output, programme treats the filename in utf-8 form as in cp936 form (my local system code page), and an extra encoding processing (from 936 to utf-8) seems happend.
(tested by using windows notepad to save the out.txt as ANSI encoding (let notebook to recode it from utf-8 to ANSI cp936), when open the new copy out(ANSI).txt, it is recognised as utf-8, and the result becomes right)

When come back to the output of 2nd test run, the file name is passed by command-line, and it should be 'recoded automatically to the system code page'. When Exiftool pass it to json, a cp936 to utf-8 recoding process should take place. However, nonthing happend, ExifTool pass the raw cp936 bytes to json, and it is treated as utf-8 bytes later on.
(testd by using pyexiftool with the param of '-b' (to get the raw bytes of unrecognised tag values, yeah, the filename is now unrecognised and returned like base64:...):
Quote
    lm_s:str=metadata[3]['SourceFile']
    lm_b:bytes=base64.b64decode(lm_s[7:])
    out_=lm_b.decode('gb2312')
    # gb2312 is cp936, my local code page
the right file name comes back.
)

So, I guess, the recoding logic of EixfTool for output comtains some ...en.. typing mistakes?
The recoding process from local to utf-8 should happen in test-run 2 rather than test-run 3.

If I change my system code page to 65001, everything will be fine, due to local to utf8 recoding processes become utf8 to utf8 processes, and any other foreign characters will be supported too. If my system code page is a local one,  local charecters in filename could be supported if this bug is fixed.

Hope my post could help.


system: win11
exiftool ver: 12.4

StarGeek

Do you have the same problems if you use CMD? PowerShell can corrupt binary data and it might be interpreting the characters as such.
"It didn't work" isn't helpful. What was the exact command used and the output.
Read FAQ #3 and use that cmd
Please use the Code button for exiftool output

Please include your OS/Exiftool version/filetype

Yang-z

Quote from: StarGeek on April 09, 2022, 11:15:41 AM
Do you have the same problems if you use CMD? PowerShell can corrupt binary data and it might be interpreting the characters as such.

I just tried.
Yes, the same results:

Quote
H:\Git\ExifToolGUI\.samples\1>exiftool -file:filename "中文.jpg"
File Name                       : 中文.jpg

H:\Git\ExifToolGUI\.samples\1>exiftool -file:filename "中文.jpg" -j
[{
  "SourceFile": "????.jpg",
  "FileName": "????.jpg"
}]

H:\Git\ExifToolGUI\.samples\1>exiftool -file:filename -charset filename=utf8 -@ "test.txt"
File Name                       : 涓枃.jpg

H:\Git\ExifToolGUI\.samples\1>exiftool -file:filename -charset filename=utf8 -@ "test.txt" -j
[{
  "SourceFile": "涓枃.jpg",
  "FileName": "涓枃.jpg"
}]

Phil Harvey

If your console is set properly to UTF-8 then this should work.  Here it is on my Mac:

> exiftool 中文.jpg -filename -j
[{
  "SourceFile": "中文.jpg",
  "FileName": "中文.jpg"
}]


The file name shouldn't be recoded for the JSON output if it is valid UTF-8.  JSON requires valid UTF-8.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Yang-z

Quote from: Phil Harvey on April 12, 2022, 07:42:59 AM
If your console is set properly to UTF-8 then this should work.

Hi Phil,
Thanks for your replying.
Yes, I know if the console is set to UTF-8, everything will be OK.
For windows command-line, if I set the system code page to 65001(utf-8), the output is right.

Quote
The file name shouldn't be recoded for the JSON output if it is valid UTF-8.  JSON requires valid UTF-8.
No, it shouldn't be.
However, in my 3rd test run above, the file name was valid utf-8 when it was passed to ExifTool, but late on ExifTool thought it was cp936 and a 'cp936 to utf-8' recoding was triggered before passed to json. So, json got the over-encoded string.

Coincidentally, in my 2nd test, the name is of cp936 and a 'cp936 to utf-8' recoding should be triggered before passed to json, but it wasn't.

So, I guess ExifTool does try to take care of the reconding things, but some typing mistakes in the source code just makes the recoding process be triggered wrongly under these two conditions.

Could you please look into the source code and find out why this happens?
It would be appreciated:)

Thanks a lot for your work.



Yang-z

Quote from: Phil Harvey on April 12, 2022, 07:42:59 AM
The file name shouldn't be recoded for the JSON output if it is valid UTF-8.  JSON requires valid UTF-8.

Today I did some extra tests and find that:
The result of test 3 (涓枃.jpg)  is caused by wrong encoding parameter in a decoding process, something like:
decode(encoding='ansi')
Everything should be fine until the filename(utf-8 encoded) is passed to windows console(no matter cmd or ps). And then windows console decodes the filename as 'ansi'.
Anyway, this problem is caused by windows terminal, not ExifTool.
If the current code page is set to utf-8:
chcp 65001
This problem could be fixed.
Or, output the result to a file.
Or, for developper, this mis-decoding is damagelessly reversible:

    # python
    s:str = "涓枃.jpg"
    b:bytes = s.encode(encoding='ansi')
    s8:str = b.decode(encoding='utf-8')
    # s8 == "中文.jpg"



When it comes to the result of test 2(????.jpg), ExifTool get the filename(system-code-page coded) , and somehow when json stores it , it becomes (????.jpg). It seems that json takes the original filename as utf-8 encoded bytes but with some encoding errors, and json just "fixes" them.
This time the result is damaged and irreversible, unless '-b' is used to protect the filename(system-code-page coded).

    # python
    b: bytes = base64.b64decode(garbled[7:])
    local_encoding = locale.getpreferredencoding(False)
    filename: str = b.decode(local_encoding)

What about setting current code page to utf-8? It doesn't help, the input filename will be recoded to system code page eventually. And, there is no way for developpers to change the windows system code page dynamically.


Overall, the problem of output I reported is not caused by ExifTool itself, it is introduced into by the non-utf8 windows current and system code page, and the lack of recoding processes. The solution could be:
1. Ask windows users to change the system code page to utf-8. (add support for all languages in filenames)
2. Or, use files to set input and get output. (add support for local language in filenames only)
3. Or, use '-b' to protect raw system-code-page coded filename. (add support for local language in filenames only, but it wiil cause problems when '-U' is also spesified. )
4. Or, simply don't use '-j'