Unicode paths on win32

Started by jteb, March 01, 2011, 05:56:24 AM

Previous topic - Next topic

jteb

Hi Phil,

Once again, a topic about the never ending story of unicode paths on win32.

Following up on some suggestions made in this forum, I've put together a fairly simple proof of concept to support unicode paths on Windows.

Basically I just combined the suggestions given in https://exiftool.org/forum/index.php/topic,2394.0.html

Before I go into the code, let me say this was the first time I wrote any perl code. Hence the code is probably very non-perlish and ugly besides  ;)

It works as follows. I wrote a simple module that overrides some CORE functions (attached). In the overrides, I check on paths being unicode and if so use one of the Win32API::File functions. If not, I just pass the function call through to the original CORE:: function. The module can be conditionally imported when the OS is win32.
I use exiftool from Java through the std* handles, which I set to UTF8 in Java. Therefor, I need to set the modes of all pipes to utf8 in exiftool.


BEGIN {
if ($^O =~ /win32api/i) {

    # These need to be UTF8, conforming to the Java encoding we use on the std*
    binmode(STDIN, ":utf8");
    binmode(STDOUT, ":utf8");
    binmode(STDERR, ":utf8");

    require Win32::UnicodeFunctions;
    import Win32::UnicodeFunctions qw(:CoreOverrides :File);
}
}


I changed some open(FILE, ..); $fh = \*FILE; calls to open(my $fh, ...); as well. I think that it should work with open(*FILE, ...); or something like that as well.

The file exists test (-e) I had to replace with my own fileExists() method. I'm pretty sure it's not possible to override that, cmiiw. Another version of fileExists needs to be loaded on non-win32 platforms obviously.

I've tested this with reading/writing metadata on a daemonized exiftool instance, using stdin/stdout for communicating with Java (-stay_open -@ -). It seems to work nicely for these two simple cases, on both non-unicode and unicode paths. As said, it's probably ugly and not quite safe, but we actually started using in our testing branches and it's looking very promising.

Cheers,
Jan

Phil Harvey

Hi Jan,

Thanks for your work on this.  There are certainly a number of things that this change would break, but some may be fixed.  For example, I think that setting console input/output to ":utf8" is dangerous.  Also, you are using many features that are only available in later Perl versions so I would have to figure out when these features were introduced and check for the appropriate Perl version.

But it is a very useful proof of concept, and I will look into this in more detail when I get a chance.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

jteb

Thanks Phil.

For me it's quite safe to force the std* encodings to utf8, I do the same in Java. I can understand that's surely not the case for other solutions though.

>> Also, you are using many features that are only available in later Perl versions
Did I now?  :-[

If I can help in any way (testing stuff, finding out how to unbreak things, ...) I'd be happy to. You just need ask.

Cheers,
Jan

Phil Harvey

Hi Jan,

I've looked at the code a bit more closely, and have a few comments:

1) Well done!  Brilliant actually, considering this is your first attempt at Perl.

2) It is difficult to determine when the 2-argument version of binmode came into existence, but I'm guessing with Perl 5.6.  However, you only use this for your Java piping, and it isn't used in your UnicodeFunctions.pm, so I maybe don't need to worry about this.  What exactly are you using the stdio streams for, and why do you need to set them to ":utf8"?  I'm not clear on this.

3) What about closing files?  When opening you do a CreateFileW() then OsFHandleOpenFd() then CORE::open().  What needs to be done to un-do all these calls to avoid memory/filehandle leaks?  (ExifTool is commonly used to run through 100's of thousands of images.)

4) I will also need to be able to re-code all of my stat() calls (-s -M, etc) to use filehandles rather than file names, which shouldn't be too difficult.  So this looks do-able.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

jteb

Hi Phil,

1. Thanks, I'm actually starting to like perl, it's very powerful indeed  :)

2. From Java I open a number of exiftool (-stay_open True -@ -) processes, putting each in a separate thread, which I put in a pool. For each process, I open an output stream to its stdin and an input stream to its stdout+stderr. When I receive a task from another part of the system, I get a process from the pool, write the commands to the output stream and finish with a -execute. The output is read from the input stream (the process's stdout) up to "{ready}" and the process is put back into the pool.

To make the communication over the pipe understand unicode characters (on Windows especially), I need to set the output-/input streams to utf8, otherwise they'll be set to the platform encoding, effectively breaking the unicode support. If I don't set binmode(std*, :utf8), or even just binmode(std*), on the perl side unicode is broken as well. Even-though I explicitly use encode('UTF-16LE', ...). Off course, on *NIX this is not a limitation.

Hope that makes sense.

3. As far as I understand, when calling CORE::open(*FH, ..) on a filehandle created by CreateFileW, "ownership" is relinquished and calling CORE::close on FH, the underlying filehandle created by CreateFileW is effectively closed as well. If I read the documentation correctly, when FH goes out of scope, it is automatically closed (cmiiw).

At least, when I test my code, I see memory use grow during the execution of a task and shrink again when the task is ready. I didn't test this on hundreds of thousands of files though, so I'm not sure whether there's not even a small memory leak. I'll put that to the test in one of the coming days.

4. Nice, hopefully it is doable.

Thanks for looking in to this.

Jan


Phil Harvey

I realized that one piece I am missing is the ability to list files in a directory.  The exiftool application requires this ability since it accepts file and/or directory names.  Currently, I use the perl opendir/readdir functions to do this, but there is a disconnect in the Windows Perl libraries because open() will fail with names returned by readdir() in Windows if they contain special characters.

I checked Win32API::File and can't find any directory listing functions.  Have you seen these in your travels?

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

jteb

That's a tricky one. I'm able to open a valid native win32 filehandle to a directory using:

my $fh = CreateFileW(encode('UTF-16LE', $path.'\0'), GENERIC_READ, FILE_SHARE_READ, [], OPEN_EXISTING, FILE_FLAG_BACKUP_SEMANTICS, []);
(the magic is in the FILE_SHARE_READ and FILE_FLAG_BACKUP_SEMANTICS).

I'm pretty sure it's a valid handle, as the size returned by GetFileSize() is exactly the same as listed by ls -l (using msys).

However, opendir doesn't accept a file descriptor like open does.

Let's see if I can find a workaround somehow.

Phil Harvey

I've done a bit of googling, and it appears that I need some sort of a Perl interface to the Windows FindFirstFileW() and FindNextFileW() functions.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

jteb

I came to the same conclusion, but wanted to avoid having a dependency on Win32::File.
It is however possible to import these functions through Win32API, straight from the kernel32.dll. I'll see if I can get that working, I've done things like that with JNI before and it's working fine, but this kind of trickery doesn't usually feel very comfortable ;)

That would still leave creating a directory. Apparently the docs for the Win32 module state that it should be possible using Win32::CreateDirectory, but it seems that's not so. Perhaps we can do something similar with another DLL import.

I'll be offline for a while, but will do some other tests.

Phil Harvey

Quote from: jteb on March 02, 2011, 11:11:17 AM
It is however possible to import these functions through Win32API, straight from the kernel32.dll. I'll see if I can get that working, I've done things like that with JNI before and it's working fine, but this kind of trickery doesn't usually feel very comfortable ;)

I agree.  I saw a technique to do this using Win32::API::Struct to generate the necessary structures, but also saw a reference that Win32::API::Struct was not reliable, which worried me.

Quote
That would still leave creating a directory. Apparently the docs for the Win32 module state that it should be possible using Win32::CreateDirectory, but it seems that's not so. Perhaps we can do something similar with another DLL import.

Ouch.  What is the problem with Win32::CreateDirectory() ?

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

jteb

Well, it works for non-unicode pathnames. I couldn't make it work with unicode pathnames. Some googling and reading lots of posts from people with the same problem, made me think it not possible.

It could be I have to encode the path somehow, before passing it to Win32::CreateDirectory.

Phil Harvey

This is another one of my fears.  This post contains code from someone who apparently got Win32::CreateDirectory() to work with Unicode file names.  Similarly, I have run tests with Unicode filenames that work on my Windows XP system but fail for other people.  So some things may depend on the system settings.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

BogdanH

Hi,

I just jumped here and probably I'm missing something... but according to:
http://msdn.microsoft.com/en-us/library/aa363855%28v=vs.85%29.aspx
-CreateDirectyW API call should be used to pass Unicode characters. But probably I'm totally off on this topic  :)

Bogdan

Phil Harvey

Hi Bogdan,

Yes, thanks.  The problem is that the Windows function you mentioned can not be called directly from Perl, so we need to find a Perl library that provides an API we can use.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).