FAQ: 18b. "I'm having problems with special characters on the Windows

Started by eed, September 04, 2020, 08:02:13 AM

Previous topic - Next topic

eed

Windows10 64 bit. At the command prompt (cmd) first I changed code page to utf8:


C:\work>chcp 65001
Active code page: 65001


Now I'm trying to execute command:

C:\work>ExifTool -charset filename=utf8 -@ D:\αβγ\test.txt
"Error opening arg file D:\???\test.txt"


Note that arg file test.txt is in folder with Greek name "αβγ".

I'm set code page to utf8 and I'm using "-charset filename=utf8" option.

What I'm missing?

Greek name is just for example.
If I have (on same computer) folders in different languages - Greek, Cyrillic, Turkish etc, how can process arg file in these folders?

Phil Harvey

I don't know if you are missing anything.  I can't test this right now, but special characters in Windows file names are a real problem.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

StarGeek

This is something I've always had a problem with.  I couldn't get these characters to work even with the new Windows Terminal.  I still haven't gotten around to installing the Windows Linux subsystem yet to see if that works.
"It didn't work" isn't helpful. What was the exact command used and the output.
Read FAQ #3 and use that cmd
Please use the Code button for exiftool output

Please include your OS/Exiftool version/filetype

eed

Hi, Phil.

I did some tests and found out that problem with special characters in Windows command line is not related to ExifTool, but it is a Windows issue.
I did a small simple console program with one task: output hex bytes dump of it's command line.

To get command line two Windows API functions can be used: GetCommandLineA or GetCommandLineW.
"-A" version (ANSI) works with codepages while "-W" (wide) works with unicode.

Results from tests:
With GetCommandLineW command line is in UTF-16 (UTF-16LE little-endian order to be more specific).
So result is UTF-16 alwais. Regardles of code page set with chcp. Even if code page is set to utf-8 (chcp 65001) result is UTF-16.
The test app can be started from command prompt, from widows explorer or via other app with CreateProcess - in all cases we get UTF-16.

With GetCommandLineA command line is in ANSI format.
The original UTF-16 command is converted to ANSI with possible data loss, because all unicode chars cannot be mapped to ANSI.
And usually we have a data loss (unless we limit ourselves to English only).
Unfortunately even if code page in console is set to utf-8 command line is NOT unicode, but ANSI.


Based on that I have one suggestion for ExifTool (for Windows only):
Instead of use standard input file "STDIN" use GetCommandLineW to get command line (in UTF-16LE).
If needed it can be converted to utf-8 with Windows API function WideCharToMultiByte.

Benefits: Command line does not depends anymore of code page set.
The same command line can contain any chars in any language, even in several different languages at once.


I'm not sure if this is a good idea or not. Just a suggestion.

Phil Harvey

Thanks for this suggestion.  Unfortunately, it would be some work to switch to use GetCommandLineW instead of the standard Perl argument handling.  I think this would be possible, but it would require me to spend far more time in my Windows virtual machine than I would like.  (My virtual machine is dead slow because I don't have enough ram on that system.  Also, I would rather be dealing with metadata than adding system-specific patches.)

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).