Encoding issues on Windows

Started by alexs, February 24, 2020, 03:11:02 AM

Previous topic - Next topic

alexs

I'm trying to automate photos processing using ExifTool and Python in most cases everything works fine, but when it comes to dealing with non-ASCII characters (cyrillic, umlauts or other characters) in the file path or inside tags there are errors. I have read FAQ #10 and FAQ #18 and also some forum topics, but still stuck.

My Python script invokes ExifTool via subprocess mechanism and all communication is done via reading/writing to ExifTool's stdin/stdout. Here is command used to start ExifTool
exiftool -stay_open True -charset filename=utf-8 -@ - -common_args -groupNames --printConv

The -charset filename=utf-8 option was set to correctly handle filenames with non-ASCII characters. Stdin and stdout "files" of the ExifTool process are opened using UTF-8 encoding, in my understanding this should allow to use any characters in file names and inside tags without problems. But it does not work as expected: when file name contains cyrillic charactes and Windows also is cyrillic this file can not be opened. Same happens when filename contains cyrillic charactes while Windows uses cp1252 or other locale.

I tried to run ExifTool directly from the Windows cmd.exe and don't understand results. First run uses default commandline encoding for cyrillic Windows — OEM866 but calling ExifTool on the file with non-ASCII characters works almost fine, only FileName tag contains unreadable value and there is a warning


C:\>chcp
Active code page: 866

C:\>c:\tools\exiftool.exe c:\test\тестове.jpg
ExifTool Version Number         : 11.88
File Name                       : ЄхёЄютх.jpg
Directory                       : c:/test
Warning                         : FileName encoding not specified
...


If I try to apply -charset filename=utf-8, then file can not be found

C:\>chcp
Active code page: 866

C:\>exiftool.exe -charset filename=UTF8 c:\test\тестове.jpg
Invalid filename encoding for c:/test/ЄхёЄютх.jpg
Error opening directory c:/test/ЄхёЄютх.jpg


If I change console encoding to 65001 as suggested by FAQ #18, it does not help either

C:\>chcp 65001
Active code page: 65001

C:\>exiftool.exe  c:\test\тестове.jpg
ExifTool Version Number         : 11.88
File Name                       : .jpg
Directory                       : c:/test
Warning                         : FileName encoding not specified


Even if -charset filename=UTF8 is specified

C:\>chcp 65001
Active code page: 65001

C:\>exiftool.exe -charset filename=UTF8 c:\test\тестове.jpg
Invalid filename encoding for c:/test/.jpg
Error opening directory c:/test/.jpg


At the same time on Linux and Mac everything works fine with non-ASCII charactes both in tag values and file names (as I understand, because they are using UTF-8 everywhere).

Is it possible to make ExifTool accepts non-ASCII characters in filenames and tags on Windows for any locale? I already tried to open ExifTool's stdin/stdout using Windows system encoding (cp1251 or cp1252) but seems this have no effect on this issue.

Phil Harvey

Your command-line test won't work because you never specified the system code page for -charset filename.  The "chcp" command sets only the console code page.  From FAQ 18:

Note that Windows will recode arguments on the command line from the current console code page to the system code page, so the ExifTool -charset should be set to the system code page for command-line arguments. However, this technique may yield unexpected results since not all characters may be represented using the system code page.

But FAQ 18 goes on to say:

To get around this limitation, arguments may be read from an ExifTool argument file using the -@ option. UTF‑8 encoding is recommended for the argument file, and with this you would also set -charset filename=utf8 if using special characters in filename arguments within the file.

Which is what you are doing with the -stay_open True -@ -, but I think the problem here is that you say the pipes are opened using UTF-8 encoding, which could easily mess things up.  The pipes should be binary so they don't try to mess with the character encoding, and you should encode things yourself in UTF-8 before passing them.

It may help to first try this using a UTF-8 argfile (with -@ argfile, and without the -stay_open option).

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

alexs

QuoteYour command-line test won't work because you never specified the system code page for -charset filename.  The "chcp" command sets only the console code page.

Actually I do specify -charset filename in the command line test.

Quote
If I try to apply -charset filename=utf-8, then file can not be found

C:\>chcp
Active code page: 866

C:\>exiftool.exe -charset filename=UTF8 c:\test\тестове.jpg
Invalid filename encoding for c:/test/ЄхёЄютх.jpg
Error opening directory c:/test/ЄхёЄютх.jpg


and

Quote
Even if -charset filename=UTF8 is specified

C:\>chcp 65001
Active code page: 65001

C:\>exiftool.exe -charset filename=UTF8 c:\test\тестове.jpg
Invalid filename encoding for c:/test/.jpg
Error opening directory c:/test/.jpg


QuoteI think the problem here is that you say the pipes are opened using UTF-8 encoding

Thanks, I will try with binary mode. But my assumption was that if pipes are in UTF-8 and unicode is passed to them then no recoding is done.

Phil Harvey

Quote from: alexs on February 24, 2020, 11:42:18 AM
Actually I do specify -charset filename in the command line test.

Yes, but you specified UTF8, which is not the system code page.
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

alexs

Ah, now it is a bit more clearer. So I need to specify system codepage in the -charset filename even if console uses UTF-8. Thanks!

May I ask what will be the best approach for the case if system codepage is for example CP1252 and filename or tag value contains characters which are not available in CP1251 (e.g. some cyrillic letters or diacritic symbols)? I suspect that using system codepade won't work in this case.

Phil Harvey

Quote from: alexs on February 25, 2020, 06:10:38 AM
May I ask what will be the best approach for the case if system codepage is for example CP1252 and filename or tag value contains characters which are not available in CP1251 (e.g. some cyrillic letters or diacritic symbols)? I suspect that using system codepade won't work in this case.

Correct.  As I quoted above from the documentation:

However, this technique may yield unexpected results since not all characters may be represented using the system code page.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).