Arg file whose path and contents contain non-ANSI characters?

Started by johnrellis, June 27, 2017, 09:00:00 PM

Previous topic - Next topic

johnrellis

On Windows, after reading the FAQs and many threads, I still can't figure out how to have an arg file whose path contains non-ANSI characters and that contains filenames encoded in UTF-8.

My .bat file changes the code page to 65001 (#1), which allows UTF-8 file paths to be passed on the ExifTool command line (#2). 

The FAQ https://exiftool.org/exiftool_pod.html#WINDOWS-UNICODE-FILE-NAMES implies that the encoding of filenames within arg files is controlled by the system code page:
QuoteIn Windows, by default, file and directory names are specified on the command line (or in arg files) using the system code page

But when the code page is 65001, ExifTool isn't able to read a UTF-8 filename in an arg file whose path does or does not contain non-ANSI characters (#3, #5).

Using "-charset filename=UTF8" works when the args file path contains only ANSI characters (#4). But when the args file path contains non-ANSI characters, ExifTool isn't able to open the arg file (#6).

Am I missing something, obvious or not?

Phil Harvey

The difference between #2 and #3 is telling.  The encoding of the command-line arguments is not the same as the encoding in args.txt because ExifTool treats these exactly the same way.

If you could post args.txt as an attachment I will take a look at the encoding.  But from what you have posted it seems that args.txt is proper UTF-8, but your command-line arguments are not.

Try this command, and attach test.txt also:

echo "c:\Users\john\Documents\xÀ\a.jpg" > test.txt

Make sure you are using exactly the same settings as when you ran your tests.

It should be informative to compare this to args.txt.

- Phil

Edit:  Also, what version of ExifTool are you using?
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

johnrellis

Thanks very much for looking at this.

ExifTool version is 10.57.

Using "echo" to construct "test.txt" yields a file that is byte-for-byte identical with "args.txt", except for a trailing space added by "echo".  Using "test.txt" in place of "args.txt" with ExifTool yields the same results.

My test files are here: https://www.dropbox.com/sh/qhhgnb3vosd5rxi/AAA-DourWpi95jliV-7d3gKda?dl=0 . The folder includes "test.bat", which produces the output shown below, and "test.txt" and "testq.txt" (the output from "echo" with and without quotes around the path). 

Following the output of running "test.bat" is the output from running "od" on OS X to examine the byte contents of the files "test.bat", "args.txt", and "test.txt", which seems to indicate they are all encoded in UTF-8 (unless I'm staring past something).

Phil Harvey

OK, thanks.  I'm going to have to try to reproduce this in Windows, but it may be a day or two before I can do this.  I don't understand how there could be a difference between the interpretation of command-line arguments vs. ones in the argfile.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Phil Harvey

I've had a quick look at this.

Windows is doing something strange with the command-line arguments.  Even if the console is UTF-8, the command-line arguments aren't.  Somehow the built-in echo command converts to UTF-8 on output.  Note the difference between this:

exiftool -echo À

(which echoes the input argument directly to the console), and

echo À

:(

This is unfortunate, and I don't understand it completely, but getting UTF-8 encoded arguments on the command line is trickier than I thought.

I'll do more digging when I get a chance.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Phil Harvey

Here's someone else with this problem.

I am surprised that I didn't know about this before (or had forgotten), but this would explain some of the problems that I've seen with special characters in Windows.  I'll have to see about adding this to FAQ 18.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Phil Harvey

...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

johnrellis

Thanks much for digging into this.

A couple of possible accommodations:

- ExifTool could interpret the contents of the arg file according to the current code page.  So if the code page is 65001 (UTF-8), ExifTool would interpret the arg file as being in UTF-8.  The documentation (http://www.exiftool.org/exiftool_pod.html#WINDOWS-UNICODE-FILE-NAMES) implies this is the current behavior, but my experiments above showed not.

- ExifTool could allow the -charset option to be specified in arg files. Currently, ExifTool doesn't allow it in arg files:

> chcp 437
Active code page: 437

> exiftool -format -@ c:\Users\john\Documents\xÀ\args-charset.txt
Warning: Tag 'charset' is not defined
File not found: C:/Users/john/Documents/xÀ/a.jpg

The documentation for -@ implies that this should be allowed, but it isn't.

My use case involves running ExifTool from a Lightroom plugin.  The plugin needs a place to write a temporary arg file, and the "approved" place to create that on Windows is in C:\Users\username\AppData\Temp.  (The old fashioned C:\Temp and C:\Windows\Temp has long disappeared, I think.) But if username contains non-ANSI characters, the plugin is in the situation of needing to have an arg file whose  path and contents both contain non-ANSI characters.

I think my workaround is to have a .bat file that cd's to the temp directory before invoking ExifTool with the -charset option. Then it can invoke the arg file using a relative path name containing no non-ANSI characters.

Phil Harvey

Quote from: johnrellis on June 28, 2017, 03:05:23 PM
- ExifTool could interpret the contents of the arg file according to the current code page.  So if the code page is 65001 (UTF-8), ExifTool would interpret the arg file as being in UTF-8.

I don't like this solution for a number of reasons.  This would turn ExifTool to the dark side -- it would be like the way that Microsoft is recoding command-line characters.

QuoteThe documentation (http://www.exiftool.org/exiftool_pod.html#WINDOWS-UNICODE-FILE-NAMES) implies this is the current behavior

Where did you get that impression?  The documentation needs clarifying then.

Edit:  I see.  This line line is over-simplified now that we know Windows recodes command-line arguments:

        In Windows, by default, file and directory names are specified on the
        command line (or in arg files) using the system code page [...]


How about this instead?:

        In Windows, command-line arguments are specified using the current code page
        and are recoded automatically to the system code page.  This recoding is not
        done for arguments in ExifTool arg files, so by default filenames in arg
        files use the system code page.


Quote- ExifTool could allow the -charset option to be specified in arg files. Currently, ExifTool doesn't allow it in arg files:

Yes it does, but each argument must be on a separate line.  And this does give you a work around to the problem using this command:

exiftool -format -@ c:\Users\john\Documents\xÀ\args.txt

and this argfile:

-charset
filename=utf8
c:\Users\john\Documents\xÀ\a.jpg


- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

johnrellis

Quote
Yes it does [allow -charset in arg files], but each argument must be on a separate line.
Duh! I knew that, having generated many arg files.  Coding blindness.  This is cleaner than cd'ing to the temp directory.

Thanks much.

johnrellis

Quote
In Windows, command-line arguments are specified using the current code page and are recoded automatically to the system code page.  This recoding is not done for arguments in ExifTool arg files, so by default filenames in arg files use the system code page.
That is precisely worded, but it doesn't appear to match what I'm observing.  When I change the system code page to 65001 (UTF-8), ExifTool is not treating the filenames in the arg file as using the system code page (UTF-8).  Here's an example using the files previously posted:

> chcp 65001
Active code page: 65001

> type args.txt
C:\Users\john\Documents\xÀ\a.jpg

> exiftool -format -@ args.txt
File not found: C:/Users/john/Documents/xÀ/a.jpg

> exiftool -format -charset filename=UTF8 -@ args.txt
Format                          : image/jpeg

Or am I still misinterpreting something?

Phil Harvey

Hi John,

Maybe my terminology is wrong, but I would say that the chcp command changes the active console code page.  The system code page is set in your system settings and isn't changed by chcp.  This is my understanding anyway.  Does this make sense?  If so, can you suggest a way to improve the docs?

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

johnrellis

I've been reading up on Windows Unicode handling. Microsoft's terminology varies, but here are the labels I've found for the two concepts:

"System code page", "operating system code page", "active code page", "system active code page" are terms for the operating system's current code page.  "A Windows operating system always has one currently active Windows code page.".  See this article.

"Console input code page", "Console output code page", "console code page" are terms for a console window's current input and output code pages.  Each console window has its own current code pages. See this article.

In a while I'll suggest some edits to the documentation for Exiftool on the Windows command line.

johnrellis

Here's my attempt at editing the portions of the documentation dealing with ExifTool and the Windows command line based on the Windows pain I went through in the last several days and others' pain that I read about in the fora.  Originally I intended just a few surgical edits, but I kept pulling the thread...

My draft would replace FAQ 18 with a new FAQ, giving recipes for how to work with Windows and then following up with the gory technical dirt. The draft also includes an edited "Windows Unicode Filenames".

-----------------------------------------------------------------------
Using ExifTool with the Windows Command Line

Though Windows fully supports Unicode, the Windows command console ("command prompt", cmd.exe) has a legacy approach to international character sets that is incompatible with the industry standard technology inside ExifTool. You can't provide command-line arguments to ExifTool containing arbitrary Unicode characters. Such arguments can only contain characters from the Windows operating system code page, which corresponds to the region and language set in the Windows Control Panel. For example, in computers configured for English (United States), arguments can only contain ANSI Latin 1 characters. By default, this restriction also applies to arguments in ExifTool arg files.

Though there are a number of ways to handle the issue, the most general method is to use arg files encoded in UTF-8 and to change the console's character set  (it's "code page") to UTF-8:

1. Put any arguments that may contain arbitrary Unicode characters in an arg file encoded in UTF-8, using a UTF-8-aware text editor or program. If the arguments include filenames, use the -charset filename=UTF8 option before the arg file on the command line:

    exiftool -charset filename=UTF8 -@ args.txt

If the path to the arg file contains characters not in the operating system code page, you likely won't be able to pass it on the command line. Instead, "cd" to the directory containing the arg file and pass the filename without the directory.

2. To view output containing arbitrary Unicode characters in the console, change its current code page to UTF-8 with this command:

    chcp 65001

You can automatically run "chcp 65001" every time "cmd.exe" is launched by changing the Windows Registry for the Command Processor: Run "regedit" and put "chcp 65001" into Data field for "HKEY_LOCAL_MACHINE\Software\Microsoft\Command Processor\Autorun".

Note that when the console's code page is UTF-8, you can type or paste Unicode characters on the command line, but most such characters won't be passed properly to ExifTool. Even if you complete a filename containing special characters by typing tab, that filename may not be passed properly to Exiftool.  Arguments containing such characters must be passed via an arg file.

3. If some characters aren't displaying properly, try a different console font. Click on the  icon in the upper-left corner of the window, select Properties and then Font.  "Consolas" (the default for Windows 10 in English) includes Latin, Baltic, Greek, Turkish, and Cyrillic characters but no Asian characters.

A simpler but more limited method is to extract from or write to separate text files and use a UTF-8-aware text editor to edit the files. For example:

    # extracting...
    exiftool image.jpg > out.txt

    # writing...
    exiftool "-subject<=subject.txt" image.jpg

Yet another approach is to set the console's code page to correspond with the region and language set for the operating system in the Windows Control Panel, and then use Exiftool's -charset option. See Windows code page identifiers. For example, if the operating system is set to Arabic:

1. Do "chcp 1256" to set the console's code page to ANSI Arabic. 

2. Specify  "-charset Arabic" on the Exiftool command line.

The downside of this approach is that you can only enter and display Unicode characters in the chosen code page.

Technical details:

The Windows operating system code page defines the character set used by non-Unicode programs on the computer, based on the region and language set in the Windows Control Panel. The Windows command console also has a current code page defining its input and output character set.  The console's initial code page is 437, the legacy MS-DOS character set.

When you enter a command line from the console or a batch file, the console translates the characters from the current code page to Unicode.  But when ExifTool requests the command line via the standard C library, those Unicode characters get translated to the operating system's code page. Any Unicode characters not in that code page will get mangled. This occurs even when the console's code page is set to UTF-8.

The console interprets ExifTool's output using its current code page.  Since ExifTool by default uses the UTF-8 encoding, setting the console's code page to UTF-8 ensures that ExifTool's Unicode output will be properly interpreted by the console.  (But the console will only display those Unicode characters that are in its current font.)

-----------------------------------------------------------------------
Windows Unicode Filenames

When using Exiftool on Windows, in general it is impossible to provide filename arguments containing arbitrary Unicode characters on the command line. Such filenames can only contain characters from the Windows operating system code page, which corresponds to the region and language set in the Windows Control Panel. For example, in computers configured for English (United States), filenames can only contain ANSI Latin 1 characters. By default, this restriction also applies to filenames in ExifTool arg files.

To provide filename arguments containing arbitrary Unicode characters, place them in an ExifTool arg file encoded in UTF-8, and use the -charset filename=UTF8 option to specify their encoding:

    exiftool -charset filename=UTF8 -@ args.txt

See the FAQ " Using ExifTool with the Windows Command Line" for more details on using ExifTool on Windows.

A warning is issued if a specified filename contains special characters and the filename character set was not provided. However, the warning may be disabled by setting -charset filename="", and ExifTool may still function correctly if the operating system code page matches the character set used for the file names.

When a directory name is provided, the filename encoding need not be specified (unless the directory name contains special characters), and ExifTool will correctly handle the Unicode filenames in the directory.

The filename character set applies to the FILE arguments as well as filename arguments of -@, -geotag, -o, -p, -srcfile, -tagsFromFile, -csv=, -j= and -TAG<=. However, it does not apply to the -config filename, which always uses the system character set. The -charset filename= option must come before the -@ option to be effective, but the order doesn't matter with respect to other options.

Notes:

1) FileName and Directory tag values still use the same encoding as other tag values, and are converted to/from the filename character set when writing/reading if specified.

2) Unicode support is not yet implemented for other Windows-based systems like Cygwin.

3) See "WRITING READ-ONLY FILES" below for a note about editing read-only files with Unicode names.






johnrellis

Also, here's an edited bullet item for the "Known Problems"  on http://www.exiftool.org/index.html#problems:

In Windows, you cannot provide arguments on the command line containing arbitrary Unicode characters. For details and workarounds, see WINDOWS UNICODE FILENAMES and FAQ 18.