Accented characters, OEM Code Pages, Command Prompt and finally Powershell

Started by twoj, January 28, 2025, 05:18:17 PM

Previous topic - Next topic

twoj

I think i've learned more about Code Pages than i ever wanted to!
This is just some information about what i learned about writing and reading accented characters in Windows through the Command Propmt (cmd.exe) or Powershell with exiftool.
To start, Windows uses 2 code pages, basically the GUI interface (ANSI code page) is using Unicode and represents things normally in most GUi application, then it uses the historical OEM code pages for the terminal apps, ie Command Prompt, Powershell (all versions).
Living in Canada, my Windows is set to Locale: English (Canada) (CP 850) [Control Panel -> Region -> Administrative - Language for non-unicode programs], this is a typical code page for Europe as well. The US defaults to CP 437 (even for Windows 11).
If people have read up on this issue, there is a setting in there: Beta: Use unicode UTF-8 for worldwide language support. I suspect that clicking that will change your default OEM code page to 65001 aka UTF-8. You may be lucky to do this if you don't use other command line tools or certain applications that still need to run command line arguments. I had turned this on and i think things were working well with exiftool but then i ran into an issue with another application so i ended up turning it off.
The first thing i had to understand is that there are actually 2 processes in using exiftool in a terminal, the processes are 1-reading and 2-writing. Since exiftool uses unicode natively, it needs to convert the UTF8 metadata to the CP of the terminal, and alternatively take the character in the terminal CP and convert it to UTF8. I was using Exif Pilot which is a unicode Windows GUI app as a reference to check the validity of which characters were written to the metadata.
My testing was limited to reading and writing the accented character é, which is faily common in French but it should apply to other accented characters. I also noticed there is a difference in the way the Windows Command Prompt handles the reading of this character versus how Powershell displays it. I prefer using Powershell since it allows more control and usually my scripts use Powershell.
1 - Writing
 1.1 Command Prompt: using the command exiftool -location="abé" FILE , or exiftool -location="abé" -charset cp850 FILE, both results entered the wrong accented character into the metadata. What did work was to use the Latin charset, ie: exiftool -location="abé" -L FILE
in the Exiftool FAQ section 18 it discusses accented characters and the -L is equivalent to -charset cp1252
Now why this worked for me, i'm not sure why it works with the cp1252 and not cp850, since cp850 has the same character but at least it got it working. This case was true whether i had the Command Promt set in its deafault Code Page of 850 or 65001. To make sure its written correctly use the Exif Pilot program and read below about reading the metadata
 1.2 Powershell: Powershell operates the same as the Command Prompt for writing. to properly record the "é" i had to use the command: exiftool -location="abé" -L FILE

2 - Reading
 2.1 Command Prompt: if you try to just read the metadata after properly writing the accented character by the normal command 'exiftool File' you will propably get some unknown character, ie Location=ab?, again you verify in the Exif Pilot GUI what character is actually written. Assuming you have now written the proper character to the metadata, i could either keep my current code page (850) and run the command: exiftool -charset cp850 File , again I don't know why i have to specifically define the CP that it is in but it does properly show the accented character. Alternatively i could switch the CP of the Command Prompt to 65001, you can check the Code page in the Command Prompt or Powershell by issuing the command: CHCP
to change the Code Page the command would be CHCP 65001 , or back to CP850 with the command: CHCP 850
So if you put the Command Prompt to 65001 and then read it normally it shows properly, ie Location=abé
 2.2 Powershell: Here is the difference i found, even if I change the CP of Powershell to 65001 it had no effect on displaying the character properly, if I did 'exiftool File' in CP850 or CP65001 (as oppsosed to the Command Prompt) it displayed incorrectly. I had to define the code page in the command, ie: exiftool -charset cp850 File and then it would display properly.

Apologies for the rambling but hopefully there are some nuggets that might help others in dealing with this issue. This is clearly an issue with Windows keep legacy products that should have been dropped long ago but it looks like its not changing soon and i need to work with my system the way it is.
Thanks again to Phil and Stargeek for all their help.


 

twoj

Just another comment;

I have been outputing the exiftool info to text files, usually in powershell exiftool -a -G1 File >out1.txt
i do this to compare modifications when making changes. As mentioned above, i can write an accented character with the proper encoding to the metadata and also i can read it back properly by specifying the proper code page.
There does seem to be some a difference with the actual File Name encoding and the metadata encoding, and by specifing the encoding of the metadata information it can incorrectly display the name of the file in the metadata output.

Specifically i have a filename called;
abcdèf.tif
with Location = abcdé


when i run exiftool -a -G1 abcdèf.tif, in the command window (CP=850 default) i'll get ;
exiftool -a -G1 abcdèf.tif
[ExifTool]      ExifTool Version Number         : 13.04
[ExifTool]      Warning                         : FileName encoding must be specified
[System]        File Name                       : abcdÞf.tif

[XMP-iptcCore]  Location                        : abcd├®


if i then switch the CP=65001, i get the incorrect filename
exiftool -a -G1 abcdèf.tif
[ExifTool]      ExifTool Version Number         : 13.04
[ExifTool]      Warning                         : FileName encoding must be specified
[System]        File Name                       : abcdf.tif

but the correct metadata (as explained above);
[XMP-iptcCore]  Location                        : abcdé

If i then output this to a text file;
I think what happens is that if Notepad or Notepad++ open it, it sees the filename first and decides the encoding is 1252 (Latin) so it defaults the encoding of the text file to ANSI at which point it correctly displays the filename;

<ANSI>
[ExifTool]      ExifTool Version Number         : 13.04
[ExifTool]      Warning                         : FileName encoding must be specified
[System]        File Name                       : abcdèf.tif

but the location is incorrect because the encoding from Exiftool is UTF not ANSI, so
Location                        : abcdé

however since Notepad++ makes changing the encoding simple, when read as UTF
<UTF-8>
[ExifTool]      ExifTool Version Number         : 13.04
[ExifTool]      Warning                         : FileName encoding must be specified
[System]        File Name                       : abcd覮tif (it actually shows abcdxE8f.tif

but the location is correct;
[XMP-iptcCore]  Location                        : abcdé


So I think in conclusion, Windows has made a royal mess to maintain backwards compatibility. I have tossed around several ideas to try and get some system that just works properly. I've thought of a VM in linux to tag since they just default to unicode, possibly a windows VM and default it to UTF but then it still might have the issue with text files in 1252. I think for the moment i'll skip the accented characters in the programming and possibly tag them in a Windows unicode software since there seems to clean way to get around accented characters on the command line.







StarGeek

Quote from: twoj on February 18, 2025, 04:41:22 PMI have been outputing the exiftool info to text files, usually in powershell exiftool -a -G1 File >out1.txt

One thing to take note of is that when redirecting to a file (or a pipe), PS will force the output to be UTF-16 with a Little Endian BOM. Additionally, it will corrupt binary data in the process by assuming the data is text and changing it to UTF-16. It will also insert a carriage return (0x0D) in front of every line feed (0x0A). See this post and its links.

Exiftool won't read the resulting UTF-16 file. For example, if you used PS to output a CSV file with the -csv option, exiftool won't be able to re-import the data unles the CSV file is changed to UTF-8.

QuoteThere does seem to be some a difference with the actual File Name encoding and the metadata encoding, and by specifing the encoding of the metadata information it can incorrectly display the name of the file in the metadata output.

This might be covered in Windows Unicode File Names.

Personally, I simply use CMD and the Beta: Use unicode UTF-8 option. While that can affect some older GUIs, I've had no problem with accented and other complex characters on the command line.
"It didn't work" isn't helpful. What was the exact command used and the output.
Read FAQ #3 and use that cmd
Please use the Code button for exiftool output

Please include your OS/Exiftool version/filetype