No support for unicode surrogates | emoji

Started by Anonan, January 01, 2019, 01:58:36 PM

Previous topic - Next topic

Anonan

Unicode surrogate pair are usual Unicode character except that it have code points from U+010000 to U+10FFFF, what required to use two 16-bit code units.
https://en.wikipedia.org/wiki/UTF-16#U+010000_to_U+10FFFF

This character can be also represent with UTF-8.
The same character, but the different byte representation with UTF-8 and UTF-16.

https://unicodebook.readthedocs.io/definitions.html#character-string

Phil Harvey

Quote from: Anonan on January 09, 2019, 08:13:48 PM
Bug:
If a file with name containing surrogate pair is contained in a folder, all other files with non-ASCII name will be endoded with ANSI encoding (Other data is encoded with UTF-8).

Try specifying -charset filename=YOUR_SYSTEM_CHARACTER_SET.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Anonan

#32
Quote from: Phil Harvey on January 10, 2019, 12:38:08 PM
Quote from: Anonan on January 09, 2019, 08:13:48 PM
Bug:
If a file with name containing surrogate pair is contained in a folder, all other files with non-ASCII name will be endoded with ANSI encoding (Other data is encoded with UTF-8).

Try specifying -charset filename=YOUR_SYSTEM_CHARACTER_SET.

- Phil

It doesn't work.
Windows has UTF-16 charset for file names. The program says Invalid Charset utf16 (or UTF-16, UTF16, utf-16).
With any other valid (for the program) charset (utf8, cp1251) I get Error: File not found

ёшэшщ – it's a mojibake. It should be "синий".



The stderr's text is encoded with ANSI (in my case ANSI is cp1251).

Anonan

#33
I have updated bug description.

Bug:
If a file with name containing surrogate pair is contained in a folder, the output lines that contains file name for all files in this folder all other files with non-ASCII name will be encoded with ANSI* encoding (Other data is encoded with UTF-8 by default).
And if you use -json, characters in this ANSI string that are not contained in ASCII charset just will be replaced by ??.??.?.

*ANSI is cp1251 in my case.


fix: only lines with file name




The folder structure:

Anonan

#34
Surrogate pair within meta tag are processed well, I get in result.txt a valid UTF-8 character.

I can copy and paste these 6 bytes, and character would be displayed correctly.

But with -json this data will be lost.



One more example:

Anonan

#35
The PowerShell's script to find out all files with names contain a surrogate pair:

Get-ChildItem -Recurse -Force | Where-Object -FilterScript {$_.name -match "[\uD800-\uDBFF][\uDC00-\uDFFF]"}
or
ls -r -fo | where {$_.name -match "[\uD800-\uDBFF][\uDC00-\uDFFF]"}




It's the output in Notepad++ and in Windows' notepad.
And I can change the encoding to UTF-8 via Windows' notepad. After this Notepad++ displays \u{XXXXX} characters correctly.

(It was weird for me that Notepad++ does not support UTF-16, but only UCS-2.)


It's the same file, but it's in utf-8 opened by Notepad++

Anonan

#36
Okay. The workaround that solves this problem is setting default Windows code page to 65001 (UTF-8) in the Windows' Region settings.

You need enable "Beta: Use Unicode UTF-8 for worldwide language support" checkbox and reboot the PC.

Windows Settings -> Time & Language ->Region -> [See the screenshot]
[In the screenshot] -> Additional date, time & regional settings -> Change date, time, or number formats -> Administrative -> Change system locale... -> Beta: Use Unicode UTF-8 for worldwide language support -> OK

For more info check the answer here: https://stackoverflow.com/questions/57131654/using-utf-8-encoding-chcp-65001-in-command-prompt-windows-powershell-window/57134096#57134096

-----


This fully fixes all Unicode problems for this program (and for other similar ones) in Windows.

Also you can experiment with console fonts, the default one does not display some character correctly.
But with any font CMD and PowerShell can't display emoji correctly. Git-Bash, for example, can even with the default font.
The piping to file work fine.

-----




Anonan

Only one problem is with "XP Keywords".

Check the image from the attachment.

It displays wrong both in console and in the file (after output piping).

I just added two properties (Tags, Comments) to the file with File Explorer (see the screenshot)




Notepad++


Windows Notepad

Anonan

#38
In fact it (ED A0 BD ED B6 BC; ED A0 BD ED B3 81) are valid utf8 bytes of the emoji.

But why they are not display correctly in the text editors and the consoles?
While the same emoji display well in "Keywords", "Last Keyword XMP", "Last Keyword IPTC", "Subject" properties, but not in "XP Keywords" property.

hexed.it





---

The interested moment is that a valid UTF-8 text uses UTF-16 surrogate pair for emojis, but the valid utf8 bytes of emoji does not work correctly in most (all?) programs.

---

So, is it a bug? Or is "XP Keywords" property such special one? For me it looks like a bug.

Anonan

#39
Yeah, technically you can use 5th and 6th bytes in UTF-8. But for comparability reason RFC 3629 (2003 year) forbids doing this.
So it explains why no program can display "XP Keywords" property properly if it contains emoji.

The same problem is with "XP Comment", "XP Author", "XP Keywords".
It looks all "XP" properties use not proper UTF-8 encoding.

StarGeek

Quote from: Anonan on June 16, 2021, 01:01:14 AM
Okay. The workaround that solves this problem is setting default Windows code page to 65001 (UTF-8) in the Windows' Region settings.

You need enable "Beta: Use Unicode UTF-8 for worldwide language support" checkbox and reboot the PC.

Yep, I regularly mention this in these forums as a solution.  It may display strange characters in some programs, especially older programs.  It's only visual though.

For example Ditto clipboard manager.  There's supposed to be two leading spaces here
* Did you read FAQ #3 and use the command listed there?
* Please use the Code button for exiftool code/output.
 
* Please include your OS, Exiftool version, and type of file you're processing (MP4, JPG, etc).

StarGeek

Quote from: Anonan on June 16, 2021, 02:43:28 AM
In fact it (ED A0 BD ED B6 BC; ED A0 BD ED B3 81) are valid utf8 bytes of the emoji.

But why they are not display correctly in the text editors and the consoles?
While the same emoji display well in "Keywords", "Last Keyword XMP", "Last Keyword IPTC", "Subject" properties, but not in "XP Keywords" property.
...
So, is it a bug? Or is "XP Keywords" property such special one? For me it looks like a bug.

Quote from: Anonan on June 16, 2021, 04:18:11 AM
The same problem is with "XP Comment", "XP Author", "XP Keywords".
It looks all "XP" properties use not proper UTF-8 encoding.

It may have to do with the last line in the EXIF section in FAQ #10, especially if the rest of the EXIF data is big-endian.  Just a guess.

     The EXIF "XP" tags (XPTitle, XPComment, etc) are always stored internally as little-endian Unicode (UCS‑2), and are read and written using the specified external character set.
* Did you read FAQ #3 and use the command listed there?
* Please use the Code button for exiftool code/output.
 
* Please include your OS, Exiftool version, and type of file you're processing (MP4, JPG, etc).

Anonan

The default character set is UTF-8. And it returns "not valid" (based on RFC 3629) UTF-8 for character with code point over 0x10000.
I think I just the first person who wrote just for a test purpose to a XP tag a character what encodes with UTF-16 surrogate pair.
99.9+ % of people do not face this problem.

Martin Z

Quote from: StarGeek on June 16, 2021, 10:11:29 AM
Quote from: Anonan on June 16, 2021, 01:01:14 AMOkay. The workaround that solves this problem is setting default Windows code page to 65001 (UTF-8) in the Windows' Region settings.

You need enable "Beta: Use Unicode UTF-8 for worldwide language support" checkbox and reboot the PC.

Yep, I regularly mention this in these forums as a solution.  It may display strange characters in some programs, especially older programs.  It's only visual though.

For example Ditto clipboard manager.  There's supposed to be two leading spaces here




I encountered this issue today (EXIFtool 12.62)...
- O/S: Windows 11
- Windows Unicode UTF-8 beta feature enabled: Yes
- Codepage: 65001

I tried with and without the ' -charset filename=UTF-8' parameter, but in both cases the output from EXIFtool was...

> EXIFTool -csv="D:\EXIFMetadata.csv" -e -d "%d/%m/%Y %H:%M:%S" -sep ";"
  "-AllDates<CreateDate" "-FileModifyDate<CreateDate" "-FileCreateDate<CreateDate"
  -progress:"Writing metadata: %p%  [%f]" -overwrite_original *

> EXIFTool -csv="D:\EXIFMetadata.csv" -e -d "%d/%m/%Y %H:%M:%S" -sep ";"
  "-AllDates<CreateDate" "-FileModifyDate<CreateDate" "-FileCreateDate<CreateDate"
  -progress:"Writing metadata: %p%  [%f]" -overwrite_original -charset filename=UTF-8 *

-------------------------------------------------------------

Error: [Win32::FindFile] No support for unicode surrogates - *
No matching files

Any help would be greatly appreciated!
-- Thanks, Martin

StarGeek

* Did you read FAQ #3 and use the command listed there?
* Please use the Code button for exiftool code/output.
 
* Please include your OS, Exiftool version, and type of file you're processing (MP4, JPG, etc).