ExifTool Forum

ExifTool => Bug Reports / Feature Requests => Topic started by: jteb on March 01, 2011, 05:56:24 AM

Title: Unicode paths on win32
Post by: jteb on March 01, 2011, 05:56:24 AM
Hi Phil,

Once again, a topic about the never ending story of unicode paths on win32.

Following up on some suggestions made in this forum, I've put together a fairly simple proof of concept to support unicode paths on Windows.

Basically I just combined the suggestions given in https://exiftool.org/forum/index.php/topic,2394.0.html

Before I go into the code, let me say this was the first time I wrote any perl code. Hence the code is probably very non-perlish and ugly besides  ;)

It works as follows. I wrote a simple module that overrides some CORE functions (attached). In the overrides, I check on paths being unicode and if so use one of the Win32API::File functions. If not, I just pass the function call through to the original CORE:: function. The module can be conditionally imported when the OS is win32.
I use exiftool from Java through the std* handles, which I set to UTF8 in Java. Therefor, I need to set the modes of all pipes to utf8 in exiftool.


BEGIN {
if ($^O =~ /win32api/i) {

    # These need to be UTF8, conforming to the Java encoding we use on the std*
    binmode(STDIN, ":utf8");
    binmode(STDOUT, ":utf8");
    binmode(STDERR, ":utf8");

    require Win32::UnicodeFunctions;
    import Win32::UnicodeFunctions qw(:CoreOverrides :File);
}
}


I changed some open(FILE, ..); $fh = \*FILE; calls to open(my $fh, ...); as well. I think that it should work with open(*FILE, ...); or something like that as well.

The file exists test (-e) I had to replace with my own fileExists() method. I'm pretty sure it's not possible to override that, cmiiw. Another version of fileExists needs to be loaded on non-win32 platforms obviously.

I've tested this with reading/writing metadata on a daemonized exiftool instance, using stdin/stdout for communicating with Java (-stay_open -@ -). It seems to work nicely for these two simple cases, on both non-unicode and unicode paths. As said, it's probably ugly and not quite safe, but we actually started using in our testing branches and it's looking very promising.

Cheers,
Jan
Title: Re: Unicode paths on win32
Post by: Phil Harvey on March 01, 2011, 07:33:09 AM
Hi Jan,

Thanks for your work on this.  There are certainly a number of things that this change would break, but some may be fixed.  For example, I think that setting console input/output to ":utf8" is dangerous.  Also, you are using many features that are only available in later Perl versions so I would have to figure out when these features were introduced and check for the appropriate Perl version.

But it is a very useful proof of concept, and I will look into this in more detail when I get a chance.

- Phil
Title: Re: Unicode paths on win32
Post by: jteb on March 01, 2011, 10:11:36 AM
Thanks Phil.

For me it's quite safe to force the std* encodings to utf8, I do the same in Java. I can understand that's surely not the case for other solutions though.

>> Also, you are using many features that are only available in later Perl versions
Did I now?  :-[

If I can help in any way (testing stuff, finding out how to unbreak things, ...) I'd be happy to. You just need ask.

Cheers,
Jan
Title: Re: Unicode paths on win32
Post by: Phil Harvey on March 01, 2011, 02:40:24 PM
Hi Jan,

I've looked at the code a bit more closely, and have a few comments:

1) Well done!  Brilliant actually, considering this is your first attempt at Perl.

2) It is difficult to determine when the 2-argument version of binmode came into existence, but I'm guessing with Perl 5.6.  However, you only use this for your Java piping, and it isn't used in your UnicodeFunctions.pm, so I maybe don't need to worry about this.  What exactly are you using the stdio streams for, and why do you need to set them to ":utf8"?  I'm not clear on this.

3) What about closing files?  When opening you do a CreateFileW() then OsFHandleOpenFd() then CORE::open().  What needs to be done to un-do all these calls to avoid memory/filehandle leaks?  (ExifTool is commonly used to run through 100's of thousands of images.)

4) I will also need to be able to re-code all of my stat() calls (-s -M, etc) to use filehandles rather than file names, which shouldn't be too difficult.  So this looks do-able.

- Phil
Title: Re: Unicode paths on win32
Post by: jteb on March 02, 2011, 03:57:16 AM
Hi Phil,

1. Thanks, I'm actually starting to like perl, it's very powerful indeed  :)

2. From Java I open a number of exiftool (-stay_open True -@ -) processes, putting each in a separate thread, which I put in a pool. For each process, I open an output stream to its stdin and an input stream to its stdout+stderr. When I receive a task from another part of the system, I get a process from the pool, write the commands to the output stream and finish with a -execute. The output is read from the input stream (the process's stdout) up to "{ready}" and the process is put back into the pool.

To make the communication over the pipe understand unicode characters (on Windows especially), I need to set the output-/input streams to utf8, otherwise they'll be set to the platform encoding, effectively breaking the unicode support. If I don't set binmode(std*, :utf8), or even just binmode(std*), on the perl side unicode is broken as well. Even-though I explicitly use encode('UTF-16LE', ...). Off course, on *NIX this is not a limitation.

Hope that makes sense.

3. As far as I understand, when calling CORE::open(*FH, ..) on a filehandle created by CreateFileW, "ownership" is relinquished and calling CORE::close on FH, the underlying filehandle created by CreateFileW is effectively closed as well. If I read the documentation correctly, when FH goes out of scope, it is automatically closed (cmiiw).

At least, when I test my code, I see memory use grow during the execution of a task and shrink again when the task is ready. I didn't test this on hundreds of thousands of files though, so I'm not sure whether there's not even a small memory leak. I'll put that to the test in one of the coming days.

4. Nice, hopefully it is doable.

Thanks for looking in to this.

Jan

Title: Re: Unicode paths on win32
Post by: Phil Harvey on March 02, 2011, 07:30:57 AM
I realized that one piece I am missing is the ability to list files in a directory.  The exiftool application requires this ability since it accepts file and/or directory names.  Currently, I use the perl opendir/readdir functions to do this, but there is a disconnect in the Windows Perl libraries because open() will fail with names returned by readdir() in Windows if they contain special characters.

I checked Win32API::File and can't find any directory listing functions.  Have you seen these in your travels?

- Phil
Title: Re: Unicode paths on win32
Post by: jteb on March 02, 2011, 10:05:09 AM
That's a tricky one. I'm able to open a valid native win32 filehandle to a directory using:

my $fh = CreateFileW(encode('UTF-16LE', $path.'\0'), GENERIC_READ, FILE_SHARE_READ, [], OPEN_EXISTING, FILE_FLAG_BACKUP_SEMANTICS, []);
(the magic is in the FILE_SHARE_READ and FILE_FLAG_BACKUP_SEMANTICS).

I'm pretty sure it's a valid handle, as the size returned by GetFileSize() is exactly the same as listed by ls -l (using msys).

However, opendir doesn't accept a file descriptor like open does.

Let's see if I can find a workaround somehow.
Title: Re: Unicode paths on win32
Post by: Phil Harvey on March 02, 2011, 10:40:44 AM
I've done a bit of googling, and it appears that I need some sort of a Perl interface to the Windows FindFirstFileW() and FindNextFileW() functions.

- Phil
Title: Re: Unicode paths on win32
Post by: jteb on March 02, 2011, 11:11:17 AM
I came to the same conclusion, but wanted to avoid having a dependency on Win32::File.
It is however possible to import these functions through Win32API, straight from the kernel32.dll. I'll see if I can get that working, I've done things like that with JNI before and it's working fine, but this kind of trickery doesn't usually feel very comfortable ;)

That would still leave creating a directory. Apparently the docs for the Win32 module state that it should be possible using Win32::CreateDirectory, but it seems that's not so. Perhaps we can do something similar with another DLL import.

I'll be offline for a while, but will do some other tests.
Title: Re: Unicode paths on win32
Post by: Phil Harvey on March 02, 2011, 11:26:09 AM
Quote from: jteb on March 02, 2011, 11:11:17 AM
It is however possible to import these functions through Win32API, straight from the kernel32.dll. I'll see if I can get that working, I've done things like that with JNI before and it's working fine, but this kind of trickery doesn't usually feel very comfortable ;)

I agree.  I saw a technique to do this using Win32::API::Struct to generate the necessary structures, but also saw a reference that Win32::API::Struct was not reliable, which worried me.

Quote
That would still leave creating a directory. Apparently the docs for the Win32 module state that it should be possible using Win32::CreateDirectory, but it seems that's not so. Perhaps we can do something similar with another DLL import.

Ouch.  What is the problem with Win32::CreateDirectory() ?

- Phil
Title: Re: Unicode paths on win32
Post by: jteb on March 02, 2011, 11:35:41 AM
Well, it works for non-unicode pathnames. I couldn't make it work with unicode pathnames. Some googling and reading lots of posts from people with the same problem, made me think it not possible.

It could be I have to encode the path somehow, before passing it to Win32::CreateDirectory.
Title: Re: Unicode paths on win32
Post by: Phil Harvey on March 02, 2011, 11:50:21 AM
This is another one of my fears.  This post (http://stackoverflow.com/questions/2192053/how-do-i-check-if-a-unicode-directory-exists-on-windows-in-perl) contains code from someone who apparently got Win32::CreateDirectory() to work with Unicode file names.  Similarly, I have run tests with Unicode filenames that work on my Windows XP system but fail for other people.  So some things may depend on the system settings.

- Phil
Title: Re: Unicode paths on win32
Post by: BogdanH on March 02, 2011, 01:05:46 PM
Hi,

I just jumped here and probably I'm missing something... but according to:
http://msdn.microsoft.com/en-us/library/aa363855%28v=vs.85%29.aspx
-CreateDirectyW API call should be used to pass Unicode characters. But probably I'm totally off on this topic  :)

Bogdan
Title: Re: Unicode paths on win32
Post by: Phil Harvey on March 02, 2011, 01:22:08 PM
Hi Bogdan,

Yes, thanks.  The problem is that the Windows function you mentioned can not be called directly from Perl, so we need to find a Perl library that provides an API we can use.

- Phil