No support for unicode surrogates | emoji

Anonan · January 10, 2019, 06:07:31 AM

Unicode surrogate pair are usual Unicode character except that it have code points from U+010000 to U+10FFFF, what required to use two 16-bit code units.
https://en.wikipedia.org/wiki/UTF-16#U+010000_to_U+10FFFF

This character can be also represent with UTF-8.
The same character, but the different byte representation with UTF-8 and UTF-16.

https://unicodebook.readthedocs.io/definitions.html#character-string

Phil Harvey · January 10, 2019, 12:38:08 PM

Quote from: Anonan on January 09, 2019, 08:13:48 PM
Bug:
If a file with name containing surrogate pair is contained in a folder, all other files with non-ASCII name will be endoded with ANSI encoding (Other data is encoded with UTF-8).

Try specifying -charset filename=YOUR_SYSTEM_CHARACTER_SET.

- Phil

Anonan · January 10, 2019, 03:55:07 PM

Quote from: Phil Harvey on January 10, 2019, 12:38:08 PM
Quote from: Anonan on January 09, 2019, 08:13:48 PM
Bug:
If a file with name containing surrogate pair is contained in a folder, all other files with non-ASCII name will be endoded with ANSI encoding (Other data is encoded with UTF-8).

Try specifying -charset filename=YOUR_SYSTEM_CHARACTER_SET.

- Phil

It doesn't work.
Windows has UTF-16 charset for file names. The program says Invalid Charset utf16 (or UTF-16, UTF16, utf-16).
With any other valid (for the program) charset (utf8, cp1251) I get Error: File not found

ёшэшщ – it's a mojibake. It should be "синий".

The stderr's text is encoded with ANSI (in my case ANSI is cp1251).

Anonan · January 10, 2019, 04:28:20 PM

I have updated bug description.

Bug:
If a file with name containing surrogate pair is contained in a folder, the output lines that contains file name for all files in this folder ~~all other files with non-ASCII name~~ will be encoded with ANSI* encoding (Other data is encoded with UTF-8 by default).
And if you use -json, characters in this ANSI string that are not contained in ASCII charset just will be replaced by ??.??.?.

*ANSI is cp1251 in my case.

fix: only lines with file name

The folder structure:

Anonan · January 10, 2019, 04:52:49 PM

Surrogate pair within meta tag are processed well, I get in result.txt a valid UTF-8 character.

I can copy and paste these 6 bytes, and character would be displayed correctly.

But with -json this data will be lost.

One more example:

Anonan · January 16, 2019, 07:26:02 PM

The PowerShell's script to find out all files with names contain a surrogate pair:

Get-ChildItem -Recurse -Force | Where-Object -FilterScript {$_.name -match "[\uD800-\uDBFF][\uDC00-\uDFFF]"}
or
ls -r -fo | where {$_.name -match "[\uD800-\uDBFF][\uDC00-\uDFFF]"}

It's the output in Notepad++ and in Windows' notepad.
And I can change the encoding to UTF-8 via Windows' notepad. After this Notepad++ displays \u{XXXXX} characters correctly.

(It was weird for me that Notepad++ does not support UTF-16, but only UCS-2.)

It's the same file, but it's in utf-8 opened by Notepad++

Anonan · June 16, 2021, 01:01:14 AM

Okay. The workaround that solves this problem is setting default Windows code page to 65001 (UTF-8) in the Windows' Region settings.

You need enable "Beta: Use Unicode UTF-8 for worldwide language support" checkbox and reboot the PC.

Windows Settings -> Time & Language ->Region -> [See the screenshot]
[In the screenshot] -> Additional date, time & regional settings -> Change date, time, or number formats -> Administrative -> Change system locale... -> Beta: Use Unicode UTF-8 for worldwide language support -> OK

For more info check the answer here: https://stackoverflow.com/questions/57131654/using-utf-8-encoding-chcp-65001-in-command-prompt-windows-powershell-window/57134096#57134096

-----

This fully fixes all Unicode problems for this program (and for other similar ones) in Windows.

Also you can experiment with console fonts, the default one does not display some character correctly.
But with any font CMD and PowerShell can't display emoji correctly. Git-Bash, for example, can even with the default font.
The piping to file work fine.

-----

Anonan · June 16, 2021, 01:22:39 AM

Only one problem is with "XP Keywords".

Check the image from the attachment.

It displays wrong both in console and in the file (after output piping).

I just added two properties (Tags, Comments) to the file with File Explorer (see the screenshot)

Notepad++

Windows Notepad

Anonan · June 16, 2021, 02:43:28 AM

In fact it (ED A0 BD ED B6 BC; ED A0 BD ED B3 81) are valid utf8 bytes of the emoji.

But why they are not display correctly in the text editors and the consoles?
While the same emoji display well in "Keywords", "Last Keyword XMP", "Last Keyword IPTC", "Subject" properties, but not in "XP Keywords" property.

hexed.it

---

The interested moment is that a valid UTF-8 text uses UTF-16 surrogate pair for emojis, but the valid utf8 bytes of emoji does not work correctly in most (all?) programs.

---

So, is it a bug? Or is "XP Keywords" property such special one? For me it looks like a bug.

Anonan · June 16, 2021, 04:18:11 AM

Yeah, technically you can use 5th and 6th bytes in UTF-8. But for comparability reason RFC 3629 (2003 year) forbids doing this.
So it explains why no program can display "XP Keywords" property properly if it contains emoji.

The same problem is with "XP Comment", "XP Author", "XP Keywords".
It looks all "XP" properties use not proper UTF-8 encoding.

StarGeek · June 16, 2021, 10:11:29 AM

Quote from: Anonan on June 16, 2021, 01:01:14 AM
Okay. The workaround that solves this problem is setting default Windows code page to 65001 (UTF-8) in the Windows' Region settings.

You need enable "Beta: Use Unicode UTF-8 for worldwide language support" checkbox and reboot the PC.

Yep, I regularly mention this in these forums as a solution. It may display strange characters in some programs, especially older programs. It's only visual though.

For example Ditto clipboard manager. There's supposed to be two leading spaces here

StarGeek · June 16, 2021, 10:25:37 AM

Quote from: Anonan on June 16, 2021, 02:43:28 AM
In fact it (ED A0 BD ED B6 BC; ED A0 BD ED B3 81) are valid utf8 bytes of the emoji.

But why they are not display correctly in the text editors and the consoles?
While the same emoji display well in "Keywords", "Last Keyword XMP", "Last Keyword IPTC", "Subject" properties, but not in "XP Keywords" property.
...
So, is it a bug? Or is "XP Keywords" property such special one? For me it looks like a bug.

Quote from: Anonan on June 16, 2021, 04:18:11 AM
The same problem is with "XP Comment", "XP Author", "XP Keywords".
It looks all "XP" properties use not proper UTF-8 encoding.

It may have to do with the last line in the EXIF section in FAQ #10, especially if the rest of the EXIF data is big-endian. Just a guess.

The EXIF "XP" tags (XPTitle, XPComment, etc) are always stored internally as little-endian Unicode (UCS‑2), and are read and written using the specified external character set.

Anonan · June 17, 2021, 02:55:29 AM

The default character set is UTF-8. And it returns "not valid" (based on RFC 3629) UTF-8 for character with code point over 0x10000.
I think I just the first person who wrote just for a test purpose to a XP tag a character what encodes with UTF-16 surrogate pair.
99.9+ % of people do not face this problem.

Martin Z · May 13, 2023, 09:47:17 AM

Quote from: StarGeek on June 16, 2021, 10:11:29 AM
Quote from: Anonan on June 16, 2021, 01:01:14 AMOkay. The workaround that solves this problem is setting default Windows code page to 65001 (UTF-8) in the Windows' Region settings.

You need enable "Beta: Use Unicode UTF-8 for worldwide language support" checkbox and reboot the PC.

Yep, I regularly mention this in these forums as a solution. It may display strange characters in some programs, especially older programs. It's only visual though.

For example Ditto clipboard manager. There's supposed to be two leading spaces here

I encountered this issue today (EXIFtool 12.62)...
- O/S: Windows 11
- Windows Unicode UTF-8 beta feature enabled: Yes
- Codepage: 65001

I tried with and without the ' -charset filename=UTF-8' parameter, but in both cases the output from EXIFtool was...

Code Select

> EXIFTool -csv="D:\EXIFMetadata.csv" -e -d "%d/%m/%Y %H:%M:%S" -sep ";" 
  "-AllDates<CreateDate" "-FileModifyDate<CreateDate" "-FileCreateDate<CreateDate" 
  -progress:"Writing metadata: %p%  [%f]" -overwrite_original *

> EXIFTool -csv="D:\EXIFMetadata.csv" -e -d "%d/%m/%Y %H:%M:%S" -sep ";" 
  "-AllDates<CreateDate" "-FileModifyDate<CreateDate" "-FileCreateDate<CreateDate" 
  -progress:"Writing metadata: %p%  [%f]" -overwrite_original -charset filename=UTF-8 *

-------------------------------------------------------------

Error: [Win32::FindFile] No support for unicode surrogates - *
No matching files

Any help would be greatly appreciated!
-- Thanks, Martin

StarGeek · May 13, 2023, 11:19:12 AM

What is the name of the problem file?

News:

No support for unicode surrogates | emoji