Proposal: Give Windows ExifTool access to full Unicode command-line arguments

Started by johnrellis, July 01, 2019, 05:13:14 PM

Previous topic - Next topic

johnrellis

[I'm posting this separately from Oliver Betz's proposed new launcher, since that proposal should be considered independent of this.]

Problem

Currently, it isn't possible to pass command-line arguments to Windows ExifTool containing arbitrary Unicode characters.  Internally, all Windows processes receive their command-line arguments as UTF-16. But Perl uses the Windows C library main(), rather than the native wmain(), and main() converts the contents of argv from UTF-16 to the current Windows System Code Page (an 8-bit encoding limited to at most 255 characters).  Any argument character not in that code page will get converted to "?". This happens regardless of the current system and console code pages and even when users set the console code page to UTF-8.

Solution

The solution has two small, surgical parts:

1. Change Oliver Betz's "ppl.c" launcher to use wmain() to receive the command-line arguments in UTF-16, encode them as UTF-8, and pass them to ExifTool via RunPerl.

2. Change ExifTool to convert the command-line arguments from UTF-8 to the current -charset encoding (if it is something other than the default UTF-8).

Analysis

It's quite difficult to use Windows ExifTool with full Unicode. Perl delivers command-line arguments to ExifTool encoded in the current system code page, an 8-bit mapping that provides a tiny subset of all Unicode characters.  Even putting arguments in a "-@ argfile" can fail, e.g. if the user (or a client program) puts the argfile in %TEMP% and the current username has characters not in the current code page.

As Phil recently said, "Windows has been a thorn in my side, but it seems like a majority of ExifTool users run this platform". This proposal can pull one of those thorns out, with just a small, painful sting.

As background to this proposal, read these to refresh your understanding of how ExifTool handles character encodings:

https://exiftool.org/faq.html#Q10
https://exiftool.org/exiftool_pod.html#WINDOWS-UNICODE-FILE-NAMES

I've already tested Part 1 with a modified version of "ppl.c" -- see https://www.dropbox.com/s/prdk5yyqsb2j40t/ppl-2019-07-01.c?dl=0 . That alone is sufficient to allow ExifTool to access command-line arguments containing any Unicode character when -charset is set to the default UTF8.   Other Perl programmers have used this method of using wmain() and converting arguments to UTF-8, for example https://www.nu42.com/2017/02/perl-unicode-windows-trilogy-one.html.

To get (nearly) unfettered access to full Unicode with ExifTool, users would:

- Do "chcp 65001" to set the console code page to UTF-8.
- Use "-charset filename=utf8" to tell ExifTool to interpret filenames in UTF-8.

Part 2 is needed to implement the current ExifTool semantics when users have used -charset to choose an encoding different than UTF-8.  For example, a user may have their system and console code page set to Turkish, using "-charset Turkish" with ExifTool.  Currently, ExifTool assumes that Perl's main() will provide command-line arguments in that character encoding, and ExifTool will output in that character encoding. So to maintain that semantics, ExifTool will have to convert the UTF-8 arguments to Turkish.

I had hoped that this solution would also allow ExifTool itself to be installed in locations whose paths include arbitrary characters. But even though "ppl.c" could encode the path to "exiftool.pl" in argv[1] as UTF-8, that won't help, since Perl reads the script with standard C library i/o functions taking 8-bit filenames. Perl does provide the -Ci option for using the "wide character" (UTF-16) Windows i/o functions, but that doesn't affect how it reads scripts, unfortunately.



obetz

Quote from: johnrellis on July 01, 2019, 05:13:14 PM
[...]
2. Change ExifTool to convert the command-line arguments from UTF-8 to the current -charset encoding (if it is something other than the default UTF-8).
[...]
Part 2 is needed to implement the current ExifTool semantics when users have used -charset to choose an encoding different than UTF-8.  For example, a user may have their system and console code page set to Turkish, using "-charset Turkish" with ExifTool.  Currently, ExifTool assumes that Perl's main() will provide command-line arguments in that character encoding, and ExifTool will output in that character encoding. So to maintain that semantics, ExifTool will have to convert the UTF-8 arguments to Turkish.

what is the benefit investing a lot of effort to provide partially (!) UTF-8 argv to ExifTool if ExifTool converts this back then to the system CP?

Quote from: johnrellis on July 01, 2019, 05:13:14 PM
I had hoped that this solution would also allow ExifTool itself to be installed in locations whose paths include arbitrary characters. But even though "ppl.c" could encode the path to "exiftool.pl" in argv[1] as UTF-8, that won't help, since Perl reads the script with standard C library i/o functions taking 8-bit filenames.

Even if I repeat myself: Program paths with non-ASCII characters are problematic. Does "Jürgen" feel so much better if his login name is not "Juergen"? I have to admit that I don't know how bad it is in other languages, but I would stick with a latin nick name even then.

Cautious people use "POSIX portable filename character set". Don't trust programmers to quote everything correctly.

Quote from: johnrellis on July 01, 2019, 05:13:14 PM
Perl does provide the -Ci option for using the "wide character" (UTF-16) Windows i/o functions

https://perldoc.perl.org/perlrun.html describes the -Ci option this way:  "UTF-8 is the default PerlIO layer for input streams".

This seems to be a completely different thing than 'using the "wide character" (UTF-16) Windows i/o functions'.

I spent the last days a lot of time investigating the character set situation in Windows and Perl (not yet ExifTool).

My findings so far: This looks like a can of worms!

Perl is poorly supported for Windows. I don't know if the maintainers consider Windows an unworthy or underprivileged system, but that doesn't matter for the result: Perl is Linux-centric.

I subscribed to win32-vanilla@perl.org and sent a question regarding Unicode support but the message wasn't even distributed over the list. Well, there were just 7 postings in 2018 according to an archive.

John, you linked to https://www.nu42.com/2017/02/perl-unicode-windows-trilogy-one.html Did you read "I know why kârlı gets double UTF-8 encoded, I know what needs to be fixed, but, as the title says, this post is the first in a series, and I will discuss those issues and their fixes in follow-up posts"? There are no follow-up posts. Did A. Sinan Unur give up?

Hayo Baan

The problem imho is not so much that Perl is Unix centric, the problem is that Windows has implemented character encodings in such a non-standard way...

Moreover, even on a unix-like system, getting the UTF8 encoding to work fully – and well – is quite difficult and complex. With the perl utf8::all (I am the current maintainer) and related modules, a lot has already been taken care of, but even with those you have to be careful (especially when converting to/from other encodings or when crossing "borders"). Doing character encodings properly is hard, and this is not helped by the fact that in Windows things are implemented so completely differently/incompatible... Perhaps (hopefully) the new UNIX environment inside Windows will help here.

If you look at other languages, support for UTF8 is generally even slimmer than with Perl, requiring much more programming to make things work correctly (e.g. interpreting multiple bytes as a single "character", etc.).
Hayo Baan – Photography
Web: www.hayobaan.nl

obetz

Quote from: Hayo Baan on July 02, 2019, 06:52:34 AM
The problem imho is not so much that Perl is Unix centric, the problem is that Windows has implemented character encodings in such a non-standard way...

* What do you consider the "standard"?
* How does Windows derive from this standard?

Windows introduced Unicode with Windows NT back in the early nineties (!) and set (kind of) a standard. Initially it was UCS-2, but soon (IIRC starting with Windows 2000), they switched from UCS-2 to UTF-16.

To my knowledge, also ECMAScript/JavasScript and Java use UTF-16 internally. Python seems to prefer UTF-8.

Linux switched to UTF-8 after 2000. "The internet" started to use UTF-8 approx. 2006?

What's the standard used by Perl, and when was it available in a mature manner?

Don't misunderstand me: I'm no speci-alist *) in character encoding. My main profession is embedded hardware and software, the most complicated user interface I'm doing myself is 7 bit ASCII over TTY.

Oliver

*) without the hyphen, the word is mangled by the forum software because it looks like a drug

Hayo Baan

Quote from: obetz on July 02, 2019, 09:59:28 AM
Quote from: Hayo Baan on July 02, 2019, 06:52:34 AM
The problem imho is not so much that Perl is Unix centric, the problem is that Windows has implemented character encodings in such a non-standard way...

* What do you consider the "standard"?

Standard as in what "most" systems seem to use and which seems to also be the current going way (e.g. the internet really standardised on UTF-8).

Quote from: obetz on July 02, 2019, 09:59:28 AM
* How does Windows derive from this standard?
Well, for one they still use the notion of codepages with all sorts of different interpretations of 8-bit-only characters. UTF-8 support this way seems a bit broken (codepage 65001 doesn't seem to be true UTF-8). I think the fact they chose to (internally) store things as UTF-16, could have been a good and transparent solution IF translations at the boundary where clean and automatic. However that also doesn't seem to be the case :(. Perhaps if codepage 65001 was implemented as a proper UTF-8 codepage, most of the issues would not exist.

Quote from: obetz on July 02, 2019, 09:59:28 AM
Windows introduced Unicode with Windows NT back in the early nineties (!) and set (kind of) a standard. Initially it was UCS-2, but soon (IIRC starting with Windows 2000), they switched from UCS-2 to UTF-16.

To my knowledge, also ECMAScript/JavasScript and Java use UTF-16 internally. Python seems to prefer UTF-8.

Linux switched to UTF-8 after 2000. "The internet" started to use UTF-8 approx. 2006?

What's the standard used by Perl, and when was it available in a mature manner?
Perl uses a different encoding internally, one you won't normally come into contact with. I don't know what it is either. It any rate is is built in such a way that comparing codepoints is done rather efficiently. From all the languages I worked with character encoding support in Perl seems to be best, that is at least in Perl these things are supported at the core. In other languages you still have to do lots of extra things even to do something as simple as e.g. count characters rather than bytes when determining the length of a string.

Quote from: obetz on July 02, 2019, 09:59:28 AMDon't misunderstand me: I'm no speci-alist *) in character encoding. My main profession is embedded hardware and software, the most complicated user interface I'm doing myself is 7 bit ASCII over TTY.
Me neither! What I know comes mostly from what I've read in various forums and from my work on the utf8 Perl modules. I'm by no means a true expert on unicode ;)

Quote from: obetz on July 02, 2019, 09:59:28 AM
*) without the hyphen, the word is mangled by the forum software because it looks like a drug
Ha ha, yeah sometimes the forbidden word list is a bit strange...
Hayo Baan – Photography
Web: www.hayobaan.nl

Phil Harvey

Quote from: Hayo Baan on July 02, 2019, 02:56:51 PM
Ha ha, yeah sometimes the forbidden word list is a bit strange...

The forbidden word is C_i_a_l_i_s because this forum has been hit with a spam-bot posting ads for this.  It seems that the word filter doesn't take into account word boundaries, so it censors that word in the middle of s_p_e_c_i_a_l_i_s_t   :(

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

johnrellis

Quotewhat is the benefit investing a lot of effort to provide partially (!) UTF-8 argv to ExifTool if ExifTool converts this back then to the system CP?

The precise proposal is a little different:
QuoteChange ExifTool to convert the command-line arguments from UTF-8 to the current -charset encoding (if it is something other than the default UTF-8).

If ExifTool -charset is set to UTF-8, then the command-line arguments are left as UTF-8. Only if -charset is set to some other encoding, e.g. Turkish, are the command-line arguments converted to that encoding.

As explained in the proposal, this implements the current -charset semantics, which users rely on. Consider the Turkish user who has set the Windows system code page and the console code page to Turkish and is using "-charset=turkish" with ExifTool. Currently, that user expects all command line arguments to be converted to Turkish inside ExifTool, and she expects all output from ExifTool to be in the Turkish code page. She may well have scripts using ExifTool that depend on those semantics. This proposal maintains those semantics and won't break her existing usage of ExifTool.

johnrellis

Re: https://www.nu42.com/2017/02/perl-unicode-windows-trilogy-one.html:
QuoteI know why kârlı gets double UTF-8 encoded, I know what needs to be fixed, but, as the title says, this post is the first in a series, and I will discuss those issues and their fixes in follow-up posts
He is referring to this command:
$ ..\perl.exe -Mutf8 -Mopen=:std,:utf8 -E "say $ENV{iş}"
kârlı

I didn't find any follow-ups to this. However, it's not relevant to this proposal. ExifTool doesn't use Perl's Unicode string representation, nor does it invoke "use open ':std'" to handle input/output of UTF-8 strings. Internally, ExifTool represents all encoded strings as strings of 8-bit bytes.

johnrellis

QuoteThis looks like a can of worms!
I've made a precise proposal for passing Unicode command-line arguments. What in particular about this proposal won't work as expected?

obetz

Quote from: Hayo Baan on July 02, 2019, 02:56:51 PM
Standard as in what "most" systems seem to use and which seems to also be the current going way (e.g. the internet really standardised on UTF-8).

the risk of early implementation... In the early nineties, UCS-2 seemed like a decent choice, and when they recognized that 60000 code points are not enough, UTF-16 was an obvious "fix" introduced 20 years ago. UTF-8 only later became popular.

Quote from: Hayo Baan on July 02, 2019, 02:56:51 PM
[...] UTF-8 support this way seems a bit broken (codepage 65001 doesn't seem to be true UTF-8). I think the fact they chose to (internally) store things as UTF-16, could have been a good and transparent solution IF translations at the boundary where clean and automatic. However that also doesn't seem to be the case :(. Perhaps if codepage 65001 was implemented as a proper UTF-8 codepage, most of the issues would not exist.

that's interesting. I just started to investigate the effects of codepage 65001 in the Windows console, so I don't know much about the errors.

Can you give me some hints what I should test, IOW where the pitfalls and errors are?

Quote from: Hayo Baan on July 02, 2019, 02:56:51 PM
From all the languages I worked with character encoding support in Perl seems to be best, that is at least in Perl these things are supported at the core. In other languages you still have to do lots of extra things even to do something as simple as e.g. count characters rather than bytes when determining the length of a string.

Do you know whether this is also true for Python? I yet don't use it actively, but it seems to be widely used even in Windows applications (e.g. scientific and engineering tools I'm using), so I consider learning it.

Oliver

obetz

Quote from: johnrellis on July 03, 2019, 07:23:11 PM
QuoteThis looks like a can of worms!
I've made a precise proposal for passing Unicode command-line arguments. What in particular about this proposal won't work as expected?

Part 1 of your proposal "Change Oliver Betz's "ppl.c" launcher to use wmain()". Didn't yet check part 2.

The problem is to provide a working Perl environment. I'm not sure whether making a new perl.exe is even sufficient or it needs other parts of Perl to be adapted.

I'm not capable to run a simple Perl test script handling arguments, paths, environment stdout correctly. Didn't even try stdin yet.

As long as a test script doesn't work correctly, I consider running ExifTool in such an environment a risk since there can be many undetected problems.

I don't want to deliver things I don't understand completely and I didn't test thoroughly.

Hayo Baan

Quote from: obetz on July 04, 2019, 02:44:13 AM
Quote from: Hayo Baan on July 02, 2019, 02:56:51 PM
Standard as in what "most" systems seem to use and which seems to also be the current going way (e.g. the internet really standardised on UTF-8).

the risk of early implementation... In the early nineties, UCS-2 seemed like a decent choice, and when they recognized that 60000 code points are not enough, UTF-16 was an obvious "fix" introduced 20 years ago. UTF-8 only later became popular.
Indeed, sometimes it's better to wait a while to see what the best solution (or standard) is going to be. UTF-16 sounded good, but has some disadvantages that UTF-8 doesn't have: space (especially for "western" languages where most characters can be stored using just a single byte) and the fact that it disallows some code points that both UTF-8 and UTF-32 do support.

Quote from: obetz on July 04, 2019, 02:44:13 AM
Quote from: Hayo Baan on July 02, 2019, 02:56:51 PM
[...] UTF-8 support this way seems a bit broken (codepage 65001 doesn't seem to be true UTF-8). I think the fact they chose to (internally) store things as UTF-16, could have been a good and transparent solution IF translations at the boundary where clean and automatic. However that also doesn't seem to be the case :(. Perhaps if codepage 65001 was implemented as a proper UTF-8 codepage, most of the issues would not exist.

that's interesting. I just started to investigate the effects of codepage 65001 in the Windows console, so I don't know much about the errors.

Can you give me some hints what I should test, IOW where the pitfalls and errors are?

I don't know the exact problems, but the fact that Microsoft mentioned improved UTF-8 support as part of one of their last Windows10 updates (but they still don't seem to claim full UTF-8 compatibility), is a strong suggestion there are issues. It might just be missing code points in fonts, but I'm not sure.

If I recall correctly, Phil also mentioned incompatibilities with cp65001 and UTF-8, but I can't recall the specifics.

Quote from: obetz on July 04, 2019, 02:44:13 AM
Quote from: Hayo Baan on July 02, 2019, 02:56:51 PM
From all the languages I worked with character encoding support in Perl seems to be best, that is at least in Perl these things are supported at the core. In other languages you still have to do lots of extra things even to do something as simple as e.g. count characters rather than bytes when determining the length of a string.

Do you know whether this is also true for Python? I yet don't use it actively, but it seems to be widely used even in Windows applications (e.g. scientific and engineering tools I'm using), so I consider learning it.

The base Python should support UTF-8 just fine, that is just like e.g. C and C++ should. To do this more conveniently/automatically, I'm sure there are multiple python modules around to help you here (like there are for almost anything, even more so than for Perl). While I have developed Python scripts, I never needed UTF-8 support, so I have not investigated this further :D
Hayo Baan – Photography
Web: www.hayobaan.nl

obetz

Quote from: Hayo Baan on July 04, 2019, 01:48:12 PM
The base Python should support UTF-8 just fine, that is just like e.g. C and C++ should. To do this more conveniently/automatically, I'm sure there are multiple python modules around to help you here (like there are for almost anything, even more so than for Perl).

Out of curiosity, I briefly adapted the tests I ran with the modified perl.exe and found that "Windows console Unicode" simply works in Python.

But it was a long way to get there: https://bugs.python.org/issue1602 starting 2007-12, Unicode support for the Windows console released with 3.6 2016-12, eight years later.

Reading "Issue 1602" confirms my impression that it's not enough to change a few calls within perl.exe. It's too delicate for me, I don't dare tackle it.

Hayo Baan

Quote from: obetz on July 05, 2019, 02:48:09 AM
Quote from: Hayo Baan on July 04, 2019, 01:48:12 PM
The base Python should support UTF-8 just fine, that is just like e.g. C and C++ should. To do this more conveniently/automatically, I'm sure there are multiple python modules around to help you here (like there are for almost anything, even more so than for Perl).

Out of curiosity, I briefly adapted the tests I ran with the modified perl.exe and found that "Windows console Unicode" simply works in Python.

But it was a long way to get there: https://bugs.python.org/issue1602 starting 2007-12, Unicode support for the Windows console released with 3.6 2016-12, eight years later.

Reading "Issue 1602" confirms my impression that it's not enough to change a few calls within perl.exe. It's too delicate for me, I don't dare tackle it.

WOW, that took them a very long time to get right in Python. I don't think you can expect to get a similar fix (though work-around is probably a better description since most issues seemed to be related to how Microsoft handles things) for Perl, let alone you be able to do so yourself. If it was easy it would have been done already, I think ;)

BUT, if you output (UTF-8) to files instead of the console, does that not work with Perl just fine too? It really is the console, isn't it? (Don't try this in Powershell though; Microsoft broke the (output) redirection so this won't work there at all).
Hayo Baan – Photography
Web: www.hayobaan.nl

obetz

Quote from: Hayo Baan on July 05, 2019, 05:22:28 AM
I don't think you can expect to get a similar fix (though work-around is probably a better description since most issues seemed to be related to how Microsoft handles things) for Perl, let alone you be able to do so yourself.

that's what I wanted to express. I'm out.

Quote from: Hayo Baan on July 05, 2019, 05:22:28 AM
BUT, if you output (UTF-8) to files instead of the console, does that not work with Perl just fine too?

This question is to be directed to the OP.

I'm happy with ExifTool working through pipes and arg files. Mostly, I use it with IMatch, there is no need to change the proven interface.

My concern was to eliminate the peculiarities of pp (running unpacked exe in %temp%, spawn, parameter mangling and so on), and this seems to work pretty well now. In combination with an installer, it looks like a useful addition to the "single file" ExifTool.exe (both have their pros and cons).

Oliver