ExifTool Forum

ExifTool => Bug Reports / Feature Requests => Topic started by: Anonan on January 01, 2019, 01:58:36 PM

Title: No support for unicode surrogates | emoji
Post by: Anonan on January 01, 2019, 01:58:36 PM
The program throws the exception "No support for unicode surrogates at script/exiftool line 3553." when you use it on files that contain emoji in a file name.

The examples of file names: (see the attachment)".
This forum also does not support emoji (I can't post here examples of file names that contain emoji.).


And yes, I don't like emoji too. I don't use them, but other people do. So the support of this is needed.
Title: Re: No support for unicode surrogates | emoji
Post by: Phil Harvey on January 01, 2019, 02:22:19 PM
Windows special characters are really a pain.  (I'm assuming you are on Windows.)

What version of ExifTool are you using?

- Phil
Title: Re: No support for unicode surrogates | emoji
Post by: Anonan on January 01, 2019, 02:31:57 PM
11.2.2.0 and 11.2.3.0 (I have tested this version right now. The result is the same). Yes, I use Windows 10.

I have also tried use both cmd.exe and Git Bash.
Title: Re: No support for unicode surrogates | emoji
Post by: Anonan on January 01, 2019, 03:03:00 PM
It also does not support symbols like https://en.wiktionary.org/wiki/º (https://en.wiktionary.org/wiki/%C2%BA) (Do not confuse with https://en.wikipedia.org/wiki/Degree_symbol (https://en.wikipedia.org/wiki/Degree_symbol), ExifTool sees ° normally.)
Example of file name: "360º Test.mp4"
In this case the program just write "No matching files".
Title: Re: No support for unicode surrogates | emoji
Post by: StarGeek on January 01, 2019, 03:17:28 PM
Quote from: Alternation on January 01, 2019, 03:03:00 PM
It also does not support symbols like https://en.wiktionary.org/wiki/º (https://en.wiktionary.org/wiki/%C2%BA) (Do not confuse with https://en.wikipedia.org/wiki/Degree_symbol (https://en.wikipedia.org/wiki/Degree_symbol), ExifTool sees ° normally.)
Example of file name: "360º Test.mp4"
In this case the program just write "No matching files".

This would seem to be a FAQ #18 (https://exiftool.org/faq.html#Q18) answer, as when I change the code page to 65001, it works fine.

C:\>exiftool -g1 -a -s -PNG:all "Y:\!temp\bb\360º Test.png"
---- PNG ----
ImageWidth                      : 336
---- PNG ----
ImageWidth                      : 336
ImageHeight                     : 509
BitDepth                        : 8
ColorType                       : Grayscale with Alpha
Compression                     : Deflate/Inflate
Filter                          : Adaptive
Interlace                       : Noninterlaced
Gamma                           : 2.2
WhitePointX                     : 0.3127
WhitePointY                     : 0.329
RedX                            : 0.64
RedY                            : 0.33
GreenX                          : 0.3
GreenY                          : 0.6
BlueX                           : 0.15
BlueY                           : 0.06
BackgroundColor                 : 255
Label                           : FinalDesignArt
ModifyDate                      : 2018:11:15 11:02:46
Title: Re: No support for unicode surrogates | emoji
Post by: Phil Harvey on January 01, 2019, 05:15:31 PM
I can't figure out that line number.  Line 3553 of exiftool version 11.22 doesn't do anything that could possibly generate a warning like that. :/

I guess I'll have to try this myself when I can.

What was the exact command you used?  (Maybe do a screen grab of the command and the warning you get.)

- Phil
Title: Re: No support for unicode surrogates | emoji
Post by: Anonan on January 02, 2019, 06:13:46 AM
It's strange, but today I have the exception on line 3547. (The result is the same for both 11.2.2 and 11.2.3; Win 10, RUS; "chcp 65001" does not effect on results).

I run "exiftool.exe *". And there is one or more files with emoji in a name in the folder, within that I run the command.
File names: https://pastebin.com/gtNj96mg (I can not post them here, In other way I get the forum error "The message body was left empty.")
Finally I get:
"No support for unicode surrogates at script/exiftool line 3547."
No more results are in a console.



> Maybe do a screen grab of the command and the warning you get.
Ok, I will do this later.
Title: Re: No support for unicode surrogates | emoji
Post by: Phil Harvey on January 02, 2019, 07:15:27 AM
OK.  Line 3547 would be an error in the Win32::FindFile package.  There isn't much I can do about this.

Try not using wildcards when you specify file names on the command line.

- Phil
Title: Re: No support for unicode surrogates | emoji
Post by: Anonan on January 02, 2019, 07:39:44 AM
Oh, wait. The error on line 3553 occurs when I just use "exiftool.exe FILENAME".
The wildcard usage works fine, when where are not files with these names.

Look at the attachment.
(Mirror: https://i.imgur.com/opg7Rj9.png)

CMD displays emoji incorrectly, but works with it correctly.
I can even copy these ⍰⍰ (https://unicode-table.com/en/2370/) and paste to a text editor that supports a displaying unicode surrogates, and see the correct "icon".

Or I can use the command to concat all files to one – "copy /b *.txt concated.txt" and this command works fine, even if file names contain unicode surrogates (CMD just displays them like ⍰⍰).
Title: Re: No support for unicode surrogates | emoji
Post by: Phil Harvey on January 02, 2019, 08:47:20 AM
OK.  The underlying problem is that Win32::FindFile does not support these surrogate codes.  The reason I'm using Win32::FindFile in the first place is because of the lack of built-in support in ActivePerl for Windows Unicode file names.  The situation is unfortunate, but one possible work-around could be to create a hard link with a plain ASCII name to the file with the surrogate characters, then run exiftool on the hard link.

- Phil
Title: Re: No support for unicode surrogates | emoji
Post by: Anonan on January 02, 2019, 09:24:52 AM
Can this program just skip the files with unicode surrogates in a name without stopping work?
And at the end write the names of the files that were skipped to be processed manually by me.

I need to get meta info from a lot of files and only rare files contain unicode surrogates in its name, but the program does not work at all in this case.
Title: Re: No support for unicode surrogates | emoji
Post by: Phil Harvey on January 02, 2019, 09:30:14 AM
I'll see what I can do.

- Phil
Title: Re: No support for unicode surrogates | emoji
Post by: Phil Harvey on January 02, 2019, 10:39:07 AM
I've managed to reproduce this.  (The hardest part was figuring out how to create a file with a surrogate character in its name.  I finally did it by creating the file on a Mac then sending it to the Windows machine.)

I will patch ExifTool 11.24 to catch this error from Win32::FindFile and issue a warning or error instead.

Thanks for this report.

- Phil
Title: Re: No support for unicode surrogates | emoji
Post by: Anonan on January 02, 2019, 11:11:56 AM
> And at the end write the names of the files that were skipped to be processed manually by me.
Probably it's better show them also at the start (in "err" stream) to be able to stop the program, fix the names and restart the program. In order not to run twice.
Since the work of the program can take some minutes, when you have several gigabytes of data.


> The hardest part was figuring out how to create a file with a surrogate character in its name.
For example, right click in Chrome/Opera on a text input and the first option in the context menu.

(https://i.imgur.com/8bIWEXs.png)

Title: Re: No support for unicode surrogates | emoji
Post by: Phil Harvey on January 02, 2019, 12:27:44 PM
Quote from: Anonan on January 02, 2019, 11:11:56 AM
Probably it's better show them also at the start (in "err" stream) to be able to stop the program, fix the names and restart the program.

This is problematic.  For one, there will likely be a problem interpreting the file name(s) in the ExifTool stderr messages due to character set problems.  I'll be outputting these messages in UTF-8.  The other thing is that it would be very hard for me to find these files beforehand.  So you will unfortunately be stuck trying to process them in a second pass.

- Phil
Title: Re: No support for unicode surrogates | emoji
Post by: Anonan on January 02, 2019, 01:04:58 PM
Presently I get an error if there is at least one file with that name (I use a command like "exiftool * > out.txt"). And no useful work.
It looks like the program gets the names of all files first and parses them. And after this the program works with parsed file names (and with real files), if there was no exception. Is not it?

If so, when it throws an exception it is enough to write the file name to stderr, and continue to parse the remaining file names.
(Useful output will be after all file names parsed. And in this case all "broken" file names are already listed in stderr before first metadata is outputted in stdout.)

It's my guess.
Title: Re: No support for unicode surrogates | emoji
Post by: Phil Harvey on January 02, 2019, 01:41:13 PM
Quote from: Anonan on January 02, 2019, 01:04:58 PM
It looks like the program gets the names of all files first and parses them.

I think this is true for each FILE argument on the command line.  But if you specify multiple FILE arguments then the files are processed before considering the next argument.

- Phil
Title: Re: No support for unicode surrogates | emoji
Post by: Anonan on January 02, 2019, 04:50:40 PM
I have tested the program some more time.
I used the simple command:
exiftool -r *

And I have the next folder structure:

[folder1 (run the program here)]
     pic1.jpg
     pic2.jpg
     pic3.jpg
     [sub_folder2]
         pic3.jpg
         pic4.jpg


At this moment the program work so:
Parse all file names (not only files, folders too) in the folder1.
If the names are "clear" – don't contain unicode surrogate pair, then process the files (give meta info) in this folder.
If a name of any file/subfolder contains unicode surrogate pair – throws the exception and other files are not processed any way.
And after goes to subfolders and repeats this algorithm.


So, in this way ExifTool theoretically (without changing the program logic) can not provide a full list with files contain unicode surrogate pair before processes the files (a part of), if there is a subfolder in the source folder.


Title: Re: No support for unicode surrogates | emoji
Post by: Anonan on January 03, 2019, 12:21:48 PM
Quote from: ... on January 02, 2019, 04:50:40 PMSo, in this way ExifTool theoretically (without changing the program logic) can not provide a full list with files contain unicode surrogate pair before processes the files (a part of), if there is a subfolder in the source folder.
But I think it's no hard to do, like the new option (something like -fulltreebypassfirst (no so good name)), it's command to bypass the full tree of the files first
(in this context this allows to get the full list of files with the names contain unicode surrogate pair before any file are processed (to extract meta info)), and after this to work in normal way.
Title: Re: No support for unicode surrogates | emoji
Post by: Phil Harvey on January 03, 2019, 01:14:59 PM
This is actually similar to what the -progress option does.  But I have to find some time to look into this in more detail.

- Phil
Title: Re: No support for unicode surrogates | emoji
Post by: Anonan on January 08, 2019, 11:55:11 AM
I have updated to the 24 version.
And now the program works very strange (in case existing a surragate pair in a file name).

I have created some folders for the testing, check the attachment.


I used CMD and git-bash in Windows 10 Rus.
My console output (of the .bat and .sh files) also presents in the according folder.
Title: Re: No support for unicode surrogates | emoji
Post by: Phil Harvey on January 08, 2019, 12:06:19 PM
What are you calling strange?  The file that ExifTool reads as APPLE_~1.JPG in example2?  This is the behaviour of the standard library, which I am using if Win32::FindFile fails.  Apparently the standard library falls back to using Windows short filenames for the files with Unicode characters, but I think this depends on your system settings.  I agree that this is strange.  From this post by StarGeek (https://exiftool.org/forum/index.php/topic,8166.msg41817.html#msg41817) it seems you can see these 8.3 filenames with dir /x.

- Phil
Title: Re: No support for unicode surrogates | emoji
Post by: Anonan on January 08, 2019, 01:13:14 PM
For example:

I use exiftool.exe -FileName * -r

1.
> ex1cmd.png and ex1cmd_cp866
https://imgur.com/a/LH4dG0p (https://imgur.com/a/LH4dG0p)

After processing the file with surr name (the name contains a surrogate pair), the program can't work with the file contains Cyrillic characters in the same folder.
It writes "Invalid filename encoding" and "Error opening directory".

2.
> ex2cmd.png and ex2cmd_cp866
> ex2bash.png
https://imgur.com/a/NTETU5x (https://imgur.com/a/NTETU5x)

The program does not work at all, if a file with surr name exists in the root folder. If I use CMD. It works with "*" incorrectly in this case.
But if I use Bash, this file is skipped with Error, and the program continues to work.


3.
> ex2bash.png
https://i.imgur.com/Qwt9r48.png
(https://i.imgur.com/Qwt9r48.png)
Warning contains only the folder name without the file name. (Error looks normal, it shows the file name like "apple_??_apple.jpg")


4.
> ex0bash.png, ex1bash.png, ex2bash.png
https://imgur.com/a/eMze8B1 (https://imgur.com/a/eMze8B1)

The charset for Cyrillic file names can be different in an output of the program, when I use Git-Bash.
Some file names are outputted in charset what displays the file names correctly, other file names are outputted in charset what displays the file names incorrectly.


5. The ex*bash_ls.png and ex*cmd_dir.png pictures demonstrate that both consoles can display file names correctly by they themselves.
Title: Re: No support for unicode surrogates | emoji
Post by: Phil Harvey on January 08, 2019, 01:42:45 PM
OK.  So apparently the 11.24 patch wasn't much help.  Was the previous behaviour better?

The globbing of filenames with wildcards in Windows is a problem that I may not be able to solve.  I have always recommended avoiding the use of wildcards in file names.  Does the situation improve if you do this instead?:

exiftool.exe -filename -ext "*" -r .

- Phil
Title: Re: No support for unicode surrogates | emoji
Post by: Anonan on January 09, 2019, 08:35:50 AM
Well, the dot at the end of the command do the work.
exiftool.exe -FileName -r -ext * .
I didn't even see it at first.
Now CMD and Git-Bash work almost similar (in the case existing a file with surr name in the root folder, except that Git-Bash have the problem with Cyrillic name (see my preview post)).
Title: Re: No support for unicode surrogates | emoji
Post by: Phil Harvey on January 09, 2019, 08:51:58 AM
I went to add a note about this to the common mistakes documentation, and discovered it was already there (common mistake 2f (https://exiftool.org/mistakes.html#M2)).  I had forgotten about this, but the problem was worse with surrogates in the name.

- Phil
Title: Re: No support for unicode surrogates | emoji
Post by: Anonan on January 09, 2019, 01:00:11 PM
So.
I have the next structure (where % is the apple emoji):

root_apple_%_apple.jpg
root_green.jpg
root_синий.jpg
folder1
    folder1_green.jpg
    folder1_синий.jpg
folder2
    sub_apple_%_apple.jpg
    sub_blue.jpg
    sub_синий.jpg
    sub_%.jpg


I run
on CMD: exiftool.exe -FileName -s -r -progress -j -ext * . > o.json
on Bash: exiftool.exe -FileName -s -r -progress -j * > o2.json


CMD shows me:
Warning: [Win32::FindFile] No support for unicode surrogates - .
Warning: [Win32::FindFile] No support for unicode surrogates - ./folder2

Bash shows me:
Error: [Win32::FindFile] No support for unicode surrogates - root_apple_??_apple.jpg
Warning: [Win32::FindFile] No support for unicode surrogates - folder2



CMD trimmed output:

  "SourceFile": "./folder1/folder1_green.jpg",
  "SourceFile": "./folder1/folder1_синий.jpg",
  "SourceFile": "./folder2/SUB_AP~1.JPG",    // !1 - sub_apple_??_apple.jpg
  "SourceFile": "./folder2/sub_blue.jpg",
  "SourceFile": "./folder2/sub_??^??.jpg",   // !2 - sub_синий.jpg (I have switched one "?" with "^")
  "SourceFile": "./folder2/SUB_~2.JPG",      // !1 - sub_??.jpg
  "SourceFile": "./ROOT_A~1.JPG",            // !1 - root_apple_??_apple.jpg
  "SourceFile": "./root_green.jpg",
  "SourceFile": "./root_??^??.jpg",          // !2 - root_синий.jpg | In the console I see "root_ёшэшщ.jpg" for cp866 or "root_.jpg" for cp65001


Bash trimmed output:

  "SourceFile": "folder1/folder1_green.jpg",
  "SourceFile": "folder1/folder1_синий.jpg",
  "SourceFile": "folder2/SUB_AP~1.JPG",   // !1
  "SourceFile": "folder2/sub_blue.jpg",
  "SourceFile": "folder2/sub_??^??.jpg",  // !2
  "SourceFile": "folder2/SUB_~2.JPG",     // !1
                                          // !3
  "SourceFile": "root_green.jpg",
  "SourceFile": "root_??^??.jpg"          // !2


If I run Bash with exiftool.exe -FileName -s -r -progress -j -ext "*" . > o2.json the result is similar to the CMD result.


If you wanna test by yourself download the attachment.

I will add my comment to this later.

Title: Re: No support for unicode surrogates | emoji
Post by: Anonan on January 09, 2019, 03:50:24 PM
Additional bug.

That's about Git-Bash in Windows (UTF-16):

This command works fine:
exiftool.exe -FileName -r -ext "*" .

But this command does no.
exiftool.exe -FileName -r  *
Non-ASCII file names in the root folder will be printed with an incorrect encoding (It's ANSI, but it should be UTF-8 like the other data).

(https://i.imgur.com/vMsppML.png)

And If I use -json all data* in ANSI encoding** are lost, it just was replaced by five "?" (??_?? (I can't post this – auto smile inserting occurs)).
(https://i.imgur.com/Vj9wXOs.png)


*In my case – Cyrillic characters.
**Except characters that are contained in ASCII charset, f.e. Latin characters.
Title: Re: No support for unicode surrogates | emoji
Post by: Anonan on January 09, 2019, 08:13:48 PM
Now let us return to my last big post.

All ??.?.?? (five ? in a row) are a lost data during an ANSI encoding string are written to JSON file.

Bug:
If a file with name containing surrogate pair is contained in a folder, all other files with non-ASCII name will be endoded with ANSI encoding (Other data is encoded with UTF-8). It leads to mojibake.
https://unicodebook.readthedocs.io/definitions.html#mojibake (https://unicodebook.readthedocs.io/definitions.html#mojibake)

And if you use -json, characters in this ANSI string that are not contained in ASCII charset just will be replaced by ??.??.?.
Title: Re: No support for unicode surrogates | emoji
Post by: Anonan on January 09, 2019, 08:31:35 PM
The error message "Error: [Win32::FindFile] No support for unicode surrogates - root_apple_??_apple.jpg" contains a readable file name, I think the warning should shows the same information.
Now a warning shows only a folder within that there are one or more file with surr name. It's no good. What if I have only one such file among a thousand files? It will be hard to find it. [1]

Is there way to no convert a file name from sub_apple_??_apple.jpg to SUB_AP~1.JPG?
But if the program will lists all these files [see 1] it will be possible to manually process its names (to remove surr pair, f.e.).
Title: Re: No support for unicode surrogates | emoji
Post by: Anonan on January 10, 2019, 06:07:31 AM
Unicode surrogate pair are usual Unicode character except that it have code points from U+010000 to U+10FFFF, what required to use two 16-bit code units.
https://en.wikipedia.org/wiki/UTF-16#U+010000_to_U+10FFFF

This character can be also represent with UTF-8.
The same character, but the different byte representation with UTF-8 and UTF-16.

https://unicodebook.readthedocs.io/definitions.html#character-string
Title: Re: No support for unicode surrogates | emoji
Post by: Phil Harvey on January 10, 2019, 12:38:08 PM
Quote from: Anonan on January 09, 2019, 08:13:48 PM
Bug:
If a file with name containing surrogate pair is contained in a folder, all other files with non-ASCII name will be endoded with ANSI encoding (Other data is encoded with UTF-8).

Try specifying -charset filename=YOUR_SYSTEM_CHARACTER_SET.

- Phil
Title: Re: No support for unicode surrogates | emoji
Post by: Anonan on January 10, 2019, 03:55:07 PM
Quote from: Phil Harvey on January 10, 2019, 12:38:08 PM
Quote from: Anonan on January 09, 2019, 08:13:48 PM
Bug:
If a file with name containing surrogate pair is contained in a folder, all other files with non-ASCII name will be endoded with ANSI encoding (Other data is encoded with UTF-8).

Try specifying -charset filename=YOUR_SYSTEM_CHARACTER_SET.

- Phil

It doesn't work.
Windows has UTF-16 charset for file names. The program says Invalid Charset utf16 (or UTF-16, UTF16, utf-16).
With any other valid (for the program) charset (utf8, cp1251) I get Error: File not found

ёшэшщ – it's a mojibake. It should be "синий".
(https://i.imgur.com/p2IKxde.png)


The stderr's text is encoded with ANSI (in my case ANSI is cp1251).
Title: Re: No support for unicode surrogates | emoji
Post by: Anonan on January 10, 2019, 04:28:20 PM
I have updated bug description.

Bug:
If a file with name containing surrogate pair is contained in a folder, the output lines that contains file name for all files in this folder all other files with non-ASCII name will be encoded with ANSI* encoding (Other data is encoded with UTF-8 by default).
And if you use -json, characters in this ANSI string that are not contained in ASCII charset just will be replaced by ??.??.?.

*ANSI is cp1251 in my case.

(https://i.imgur.com/ByRTEdk.png)
fix: only lines with file name


(https://i.imgur.com/DJoNSGm.png)

The folder structure:
(https://i.imgur.com/BZVOwlL.png)
Title: Re: No support for unicode surrogates | emoji
Post by: Anonan on January 10, 2019, 04:52:49 PM
Surrogate pair within meta tag are processed well, I get in result.txt a valid UTF-8 character.
(https://i.imgur.com/dg1kpmr.png)
I can copy and paste these 6 bytes, and character would be displayed correctly.

But with -json this data will be lost.
(https://i.imgur.com/wxxuFCX.png)


One more example:
(https://i.imgur.com/jD4vMsc.png)
Title: Re: No support for unicode surrogates | emoji
Post by: Anonan on January 16, 2019, 07:26:02 PM
The PowerShell's script to find out all files with names contain a surrogate pair:

Get-ChildItem -Recurse -Force | Where-Object -FilterScript {$_.name -match "[\uD800-\uDBFF][\uDC00-\uDFFF]"}
or
ls -r -fo | where {$_.name -match "[\uD800-\uDBFF][\uDC00-\uDFFF]"}

(https://i.imgur.com/FQ2Imaf.png)


It's the output in Notepad++ and in Windows' notepad.
And I can change the encoding to UTF-8 via Windows' notepad. After this Notepad++ displays \u{XXXXX} characters correctly.

(It was weird for me that Notepad++ does not support UTF-16, but only UCS-2.)
(https://i.imgur.com/U1kGdGN.png)

It's the same file, but it's in utf-8 opened by Notepad++
(https://i.imgur.com/PFmYGVz.png)
Title: Re: No support for unicode surrogates | emoji
Post by: Anonan on June 16, 2021, 01:01:14 AM
Okay. The workaround that solves this problem is setting default Windows code page to 65001 (UTF-8) in the Windows' Region settings.

You need enable "Beta: Use Unicode UTF-8 for worldwide language support" checkbox and reboot the PC.

Windows Settings -> Time & Language ->Region -> [See the screenshot]
[In the screenshot] -> Additional date, time & regional settings -> Change date, time, or number formats -> Administrative -> Change system locale... -> Beta: Use Unicode UTF-8 for worldwide language support -> OK

For more info check the answer here: https://stackoverflow.com/questions/57131654/using-utf-8-encoding-chcp-65001-in-command-prompt-windows-powershell-window/57134096#57134096

-----


This fully fixes all Unicode problems for this program (and for other similar ones) in Windows.

Also you can experiment with console fonts, the default one does not display some character correctly.
But with any font CMD and PowerShell can't display emoji correctly. Git-Bash, for example, can even with the default font.
The piping to file work fine.

-----

(https://i.imgur.com/oux6bGw.png)
(https://i.imgur.com/HMgWMao.png)
Title: Re: No support for unicode surrogates | emoji
Post by: Anonan on June 16, 2021, 01:22:39 AM
Only one problem is with "XP Keywords".

Check the image from the attachment.

It displays wrong both in console and in the file (after output piping).

I just added two properties (Tags, Comments) to the file with File Explorer (see the screenshot)


(https://i.imgur.com/T9XzzeM.png)

Notepad++
(https://i.imgur.com/cvTFXPk.png)

Windows Notepad
(https://i.imgur.com/vLOCOsG.png)
Title: Re: No support for unicode surrogates | emoji
Post by: Anonan on June 16, 2021, 02:43:28 AM
In fact it (ED A0 BD ED B6 BC; ED A0 BD ED B3 81) are valid utf8 bytes of the emoji.

But why they are not display correctly in the text editors and the consoles?
While the same emoji display well in "Keywords", "Last Keyword XMP", "Last Keyword IPTC", "Subject" properties, but not in "XP Keywords" property.

hexed.it
(https://i.imgur.com/u2B8mLl.png)

(https://i.imgur.com/2qutzKN.png)


---

The interested moment is that a valid UTF-8 text uses UTF-16 surrogate pair for emojis, but the valid utf8 bytes of emoji does not work correctly in most (all?) programs.

---

So, is it a bug? Or is "XP Keywords" property such special one? For me it looks like a bug.
Title: Re: No support for unicode surrogates | emoji
Post by: Anonan on June 16, 2021, 04:18:11 AM
Yeah, technically you can use 5th and 6th bytes in UTF-8. But for comparability reason RFC 3629 (2003 year) forbids doing this.
So it explains why no program can display "XP Keywords" property properly if it contains emoji.

The same problem is with "XP Comment", "XP Author", "XP Keywords".
It looks all "XP" properties use not proper UTF-8 encoding.
Title: Re: No support for unicode surrogates | emoji
Post by: StarGeek on June 16, 2021, 10:11:29 AM
Quote from: Anonan on June 16, 2021, 01:01:14 AM
Okay. The workaround that solves this problem is setting default Windows code page to 65001 (UTF-8) in the Windows' Region settings.

You need enable "Beta: Use Unicode UTF-8 for worldwide language support" checkbox and reboot the PC.

Yep, I regularly mention this in these forums as a solution.  It may display strange characters in some programs, especially older programs.  It's only visual though.

For example Ditto clipboard manager (https://ditto-cp.sourceforge.io).  There's supposed to be two leading spaces here
(https://i.imgur.com/tFjJIfh.png)
Title: Re: No support for unicode surrogates | emoji
Post by: StarGeek on June 16, 2021, 10:25:37 AM
Quote from: Anonan on June 16, 2021, 02:43:28 AM
In fact it (ED A0 BD ED B6 BC; ED A0 BD ED B3 81) are valid utf8 bytes of the emoji.

But why they are not display correctly in the text editors and the consoles?
While the same emoji display well in "Keywords", "Last Keyword XMP", "Last Keyword IPTC", "Subject" properties, but not in "XP Keywords" property.
...
So, is it a bug? Or is "XP Keywords" property such special one? For me it looks like a bug.

Quote from: Anonan on June 16, 2021, 04:18:11 AM
The same problem is with "XP Comment", "XP Author", "XP Keywords".
It looks all "XP" properties use not proper UTF-8 encoding.

It may have to do with the last line in the EXIF section in FAQ #10 (https://exiftool.org/faq.html#Q10), especially if the rest of the EXIF data is big-endian.  Just a guess.

     The EXIF "XP" tags (XPTitle, XPComment, etc) are always stored internally as little-endian Unicode (UCS‑2), and are read and written using the specified external character set.
Title: Re: No support for unicode surrogates | emoji
Post by: Anonan on June 17, 2021, 02:55:29 AM
The default character set is UTF-8. And it returns "not valid" (based on RFC 3629) UTF-8 for character with code point over 0x10000.
I think I just the first person who wrote just for a test purpose to a XP tag a character what encodes with UTF-16 surrogate pair.
99.9+ % of people do not face this problem.
Title: Re: No support for unicode surrogates | emoji
Post by: Martin Z on May 13, 2023, 09:47:17 AM
Quote from: StarGeek on June 16, 2021, 10:11:29 AM
Quote from: Anonan on June 16, 2021, 01:01:14 AMOkay. The workaround that solves this problem is setting default Windows code page to 65001 (UTF-8) in the Windows' Region settings.

You need enable "Beta: Use Unicode UTF-8 for worldwide language support" checkbox and reboot the PC.

Yep, I regularly mention this in these forums as a solution.  It may display strange characters in some programs, especially older programs.  It's only visual though.

For example Ditto clipboard manager (https://ditto-cp.sourceforge.io/).  There's supposed to be two leading spaces here
(https://i.imgur.com/tFjJIfh.png)



I encountered this issue today (EXIFtool 12.62)...
- O/S: Windows 11
- Windows Unicode UTF-8 beta feature enabled: Yes
- Codepage: 65001

I tried with and without the ' -charset filename=UTF-8' parameter, but in both cases the output from EXIFtool was...

> EXIFTool -csv="D:\EXIFMetadata.csv" -e -d "%d/%m/%Y %H:%M:%S" -sep ";"
  "-AllDates<CreateDate" "-FileModifyDate<CreateDate" "-FileCreateDate<CreateDate"
  -progress:"Writing metadata: %p%  [%f]" -overwrite_original *

> EXIFTool -csv="D:\EXIFMetadata.csv" -e -d "%d/%m/%Y %H:%M:%S" -sep ";"
  "-AllDates<CreateDate" "-FileModifyDate<CreateDate" "-FileCreateDate<CreateDate"
  -progress:"Writing metadata: %p%  [%f]" -overwrite_original -charset filename=UTF-8 *

-------------------------------------------------------------

Error: [Win32::FindFile] No support for unicode surrogates - *
No matching files

Any help would be greatly appreciated!
-- Thanks, Martin
Title: Re: No support for unicode surrogates | emoji
Post by: StarGeek on May 13, 2023, 11:19:12 AM
What is the name of the problem file?
Title: Re: No support for unicode surrogates | emoji
Post by: Martin Z on May 16, 2023, 02:09:25 PM
I'm not sure exactly... there are 1,000+ images in the folder (and the CSV) but the only output from EXIFtool was literally just those 2 lines.

Most filenames are plain ASCII (or at least nothing notably unusual) but some of the files have emoji's in their name. Is there a way I can get more info from EXIFtool?... Get it to tell me which file(s) specifically it can't process?
Title: Re: No support for unicode surrogates | emoji
Post by: Phil Harvey on May 16, 2023, 02:48:55 PM
It looks like the error is inside the Win32::FindFile library when trying to expand "*" into a list of file names.  There is no way to get more information about this from ExifTool.  I suggest a binary search to find a file and/or directory which generates this problem.  It is unlikely that I will be able to find a solution, but we may be able to come up with some work-around.

- Phil
Title: Re: No support for unicode surrogates | emoji
Post by: Martin Z on May 16, 2023, 04:54:43 PM
Quote from: Phil Harvey on May 16, 2023, 02:48:55 PMI suggest a binary search to find a file and/or directory which generates this problem.
Thanks Phil.

When you say a binary search, I'm not quite sure what you mean I'm afraid (I have a few ideas but omitted for brevity)... In the interim, I was thinking I can split the files into sub-groups, e.g. put all the plain ASCII files in folder A, emoji's in folder B, test folder B... then sub-divide further, etc -- assuming that there is a specific character that is causing the issue, this would find it through repeated trial and error... That approach is obviously a bit time-consuming / laborious but can try to do that at some point (just not sure about ETA).

EDIT: Actually, I'll bulk-write, individual file tests (using a template) - that should narrow it down faster (again, assuming it's just a small number of files causing the error)
Title: Re: No support for unicode surrogates | emoji
Post by: Phil Harvey on May 16, 2023, 08:32:37 PM
Binary search:

Move half of the files into another folder.  Find the folder that still has the problem.  Repeat for this folder.

You'll find a specific file that causes the problem after 10 iterations for 1000+ files.

- Phil
Title: Re: No support for unicode surrogates | emoji
Post by: Martin Z on May 22, 2023, 04:46:33 PM
Quote from: Phil Harvey on May 16, 2023, 08:32:37 PMBinary search:Move half of the files into another folder. Find the folder that still has the problem. Repeat for this folder.
- Phil
Hi Phil, apologies been busy the past few days...

Lol, coming back to this on a fresh (and perhaps clearer) mind, what you said makes perfect sense!

TBH, that is what I thought you meant, but I remember thinking at the time "he probably means X, but he might mean Y or Z" -- except now I can't think what Y or Z could have been, lol -- will just put it down to an extra glass of wine that night or something!



Update on unicode surrogate / emoji issue
Quote from: Phil Harvey on May 16, 2023, 02:48:55 PMIt looks like the error is inside the Win32::FindFile library when trying to expand "*" into a list of file names
- Yeah, I think this almost certainly the issue...
- As I mentioned in my later post, I decided to generate 1,000+ commands (1 per file) and run those in a batch to hopefully give me the answer a lot sooner, i.e. identify the individual filenames (and so characters) that were failings...
- However, EXIFtool processed EVERY file fine, when the filename was fed directly into the command
- As a cross-check I put a single emoji-named file in a folder, tried to run EXIFtool with * and it failed on the single file

Possible (deeper) root cause
- Assuming Windows beta UTF8 system-wide support is enabled...
- Does EXIFtool support emoji characters?
- I have a feeling that some emoji characters are UTF16 (ones that were extended to cover multiple skin-tones and genders) but could be wrong!... Are UTF16 emoji's this a thing?
- If so, could this be what is causing problems?



PS #1 -- I know you know this already and have even commented on it before, so mainly for the benefit of other readers, be advised that running EXIFtool in a '1 cmd per file' way is *MUCH* slower (probably took >1 hour to go through the 1,000+ files).

PS #2 -- Consider the time spent above, and that this actually got all files processed, I didn't end-up doing the binary search. If this would be helpful, please let me know.
Title: Re: No support for unicode surrogates | emoji
Post by: Phil Harvey on May 24, 2023, 10:28:01 AM
Quote from: Martin Z on May 22, 2023, 04:46:33 PM- Does EXIFtool support emoji characters?

Not fully.

Quote- I have a feeling that some emoji characters are UTF16 (ones that were extended to cover multiple skin-tones and genders) but could be wrong!... Are UTF16 emoji's this a thing?
- If so, could this be what is causing problems?

No idea.

- Phil