Files with unicode-characters in filename

Started by herb, April 03, 2011, 06:39:28 AM

Previous topic - Next topic

herb

Hallo,

On ExifTool homepage under headline "known problems" is written:
"In Windows, ExifTool will not process files with Unicode characters in the file name.
This is due to an underlying lack of support for Unicode filenames in the Windows standard C I/O libraries."

I ignored the above message and did some tests using the "windows short_filename" (DOS 8.3 filename)
for files with Unicode-characters in filename.

The filename to be discussed is <directory_path>\<filename>.
For my tests I used various ExifTool versions up to 8.50 and I tested on a Windows 2000 and on a Windows XP system.
(I had no other windows systems for testing).
The suffix of <filename> always had only ASCII-characters, like *.jpg.


With wording
- an ASCII character is a charcter with hex-value < 128
- an ANSI character is a character with hex-value < 256 AND it is a valid character within the pc-system codepage.
  E.g. on my machine with codepage 1252 the 'german Umlaute' Ä, Ö etc. are valid ANSI characters.
  Chinese charcters are of course Unicode-characters, because of hex-value > 255.
  Also cyrillic characters are Unicode-characters on my system, because they are not valid within the codepage 1252 of my system.
- an Unicode-character is a character with hex-value > 255 OR it is not a valid character in pc-system codepage.

for <directory_path> and <filename> we see the following:
1) Both <directory_path> and <filename> contain only ASCII-characters.
   You all know what great job ExifTool does.

2) <directory_path> and/or <filename> also contain some ANSI-characters.
   No difference to 1).

3) <filename> contains only ANSI-characters but <directory_path> contains at least 1 Unicode-character.
   I addressed the file using <short_name_of_directory_path>\<filename> (e.g.: D:\direct~1\testpicture.jpg),
   which contains only ANSI-characters and I have seen no restriction working with such files.

   For such files it is also possible to create e.g. a *.MIE file using <short_name_of_directory_path>/<filename.MIE>

4) <filename> contains at least 1 Unicode-character
   I addressed the file as follows
   a) <short_name_of_directory_path>\<short_filename>
      in case of <directory_path> also contains a Unicode_character (e.g.:D:\direct~1\filena~1.jpg)
   b) <directory_path>/<short_filename>
      in case of <directory_path> contains only ANSI-characters (e.g.: D:\testdirectory\filena~1.jpg).

   In combination with option -overwrite_original_in_place ExifTool opens the file as in case 1)
   So e.g. it is also possible to modify some metadata tags.
   I have seen no restriction from ExifTool side.

   In case of modifying a metadata tag and NOT using the option -overwrite_original_in_place you will get a file in the
   specified directory but the <long_filename> is the given <short_filename> (which contains only ASCII-characters).

This behaviour of ExifTool (together with Perl) is wonderful for me.

Now I have the following feature request:
Please do NOT change this behaviour opening/accessing a file.

Thanks in advance
Herb

Phil Harvey

Hi Herb,

Thanks!

This is very interesting and useful, and explains why I wasn't able to reproduce this problem when I was testing (I was probably using what you call ANSI characters in my tests).

Doesn't this boil down to the fact that you can always get exiftool to read/write a file by specifying the short directory and short filename?  I'm not exactly sure why you use long names at all.

But unfortunately I think there are still cases where exiftool won't do what it should:

1) When extracting information from multiple files by specifying a directory name, and one or more files in the directory contain Unicode characters in their name.

2) When writing output to a different directory with Unicode characters in the name.  Here, exiftool will create directories if they don't exist, but it won't be able to create directories with Unicode characters.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).