No support for unicode surrogates | emoji

Started by Anonan, January 01, 2019, 01:58:36 PM

Previous topic - Next topic

Anonan

Presently I get an error if there is at least one file with that name (I use a command like "exiftool * > out.txt"). And no useful work.
It looks like the program gets the names of all files first and parses them. And after this the program works with parsed file names (and with real files), if there was no exception. Is not it?

If so, when it throws an exception it is enough to write the file name to stderr, and continue to parse the remaining file names.
(Useful output will be after all file names parsed. And in this case all "broken" file names are already listed in stderr before first metadata is outputted in stdout.)

It's my guess.

Phil Harvey

Quote from: Anonan on January 02, 2019, 01:04:58 PM
It looks like the program gets the names of all files first and parses them.

I think this is true for each FILE argument on the command line.  But if you specify multiple FILE arguments then the files are processed before considering the next argument.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Anonan

#17
I have tested the program some more time.
I used the simple command:
exiftool -r *

And I have the next folder structure:

[folder1 (run the program here)]
     pic1.jpg
     pic2.jpg
     pic3.jpg
     [sub_folder2]
         pic3.jpg
         pic4.jpg


At this moment the program work so:
Parse all file names (not only files, folders too) in the folder1.
If the names are "clear" – don't contain unicode surrogate pair, then process the files (give meta info) in this folder.
If a name of any file/subfolder contains unicode surrogate pair – throws the exception and other files are not processed any way.
And after goes to subfolders and repeats this algorithm.


So, in this way ExifTool theoretically (without changing the program logic) can not provide a full list with files contain unicode surrogate pair before processes the files (a part of), if there is a subfolder in the source folder.



Anonan

Quote from: ... on January 02, 2019, 04:50:40 PMSo, in this way ExifTool theoretically (without changing the program logic) can not provide a full list with files contain unicode surrogate pair before processes the files (a part of), if there is a subfolder in the source folder.
But I think it's no hard to do, like the new option (something like -fulltreebypassfirst (no so good name)), it's command to bypass the full tree of the files first
(in this context this allows to get the full list of files with the names contain unicode surrogate pair before any file are processed (to extract meta info)), and after this to work in normal way.

Phil Harvey

This is actually similar to what the -progress option does.  But I have to find some time to look into this in more detail.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Anonan

I have updated to the 24 version.
And now the program works very strange (in case existing a surragate pair in a file name).

I have created some folders for the testing, check the attachment.


I used CMD and git-bash in Windows 10 Rus.
My console output (of the .bat and .sh files) also presents in the according folder.

Phil Harvey

What are you calling strange?  The file that ExifTool reads as APPLE_~1.JPG in example2?  This is the behaviour of the standard library, which I am using if Win32::FindFile fails.  Apparently the standard library falls back to using Windows short filenames for the files with Unicode characters, but I think this depends on your system settings.  I agree that this is strange.  From this post by StarGeek it seems you can see these 8.3 filenames with dir /x.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Anonan

For example:

I use exiftool.exe -FileName * -r

1.
> ex1cmd.png and ex1cmd_cp866
https://imgur.com/a/LH4dG0p

After processing the file with surr name (the name contains a surrogate pair), the program can't work with the file contains Cyrillic characters in the same folder.
It writes "Invalid filename encoding" and "Error opening directory".

2.
> ex2cmd.png and ex2cmd_cp866
> ex2bash.png
https://imgur.com/a/NTETU5x

The program does not work at all, if a file with surr name exists in the root folder. If I use CMD. It works with "*" incorrectly in this case.
But if I use Bash, this file is skipped with Error, and the program continues to work.


3.
> ex2bash.png
https://i.imgur.com/Qwt9r48.png

Warning contains only the folder name without the file name. (Error looks normal, it shows the file name like "apple_??_apple.jpg")


4.
> ex0bash.png, ex1bash.png, ex2bash.png
https://imgur.com/a/eMze8B1

The charset for Cyrillic file names can be different in an output of the program, when I use Git-Bash.
Some file names are outputted in charset what displays the file names correctly, other file names are outputted in charset what displays the file names incorrectly.


5. The ex*bash_ls.png and ex*cmd_dir.png pictures demonstrate that both consoles can display file names correctly by they themselves.

Phil Harvey

OK.  So apparently the 11.24 patch wasn't much help.  Was the previous behaviour better?

The globbing of filenames with wildcards in Windows is a problem that I may not be able to solve.  I have always recommended avoiding the use of wildcards in file names.  Does the situation improve if you do this instead?:

exiftool.exe -filename -ext "*" -r .

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Anonan

#24
Well, the dot at the end of the command do the work.
exiftool.exe -FileName -r -ext * .
I didn't even see it at first.
Now CMD and Git-Bash work almost similar (in the case existing a file with surr name in the root folder, except that Git-Bash have the problem with Cyrillic name (see my preview post)).

Phil Harvey

I went to add a note about this to the common mistakes documentation, and discovered it was already there (common mistake 2f).  I had forgotten about this, but the problem was worse with surrogates in the name.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Anonan

So.
I have the next structure (where % is the apple emoji):

root_apple_%_apple.jpg
root_green.jpg
root_синий.jpg
folder1
    folder1_green.jpg
    folder1_синий.jpg
folder2
    sub_apple_%_apple.jpg
    sub_blue.jpg
    sub_синий.jpg
    sub_%.jpg


I run
on CMD: exiftool.exe -FileName -s -r -progress -j -ext * . > o.json
on Bash: exiftool.exe -FileName -s -r -progress -j * > o2.json


CMD shows me:
Warning: [Win32::FindFile] No support for unicode surrogates - .
Warning: [Win32::FindFile] No support for unicode surrogates - ./folder2

Bash shows me:
Error: [Win32::FindFile] No support for unicode surrogates - root_apple_??_apple.jpg
Warning: [Win32::FindFile] No support for unicode surrogates - folder2



CMD trimmed output:

  "SourceFile": "./folder1/folder1_green.jpg",
  "SourceFile": "./folder1/folder1_синий.jpg",
  "SourceFile": "./folder2/SUB_AP~1.JPG",    // !1 - sub_apple_??_apple.jpg
  "SourceFile": "./folder2/sub_blue.jpg",
  "SourceFile": "./folder2/sub_??^??.jpg",   // !2 - sub_синий.jpg (I have switched one "?" with "^")
  "SourceFile": "./folder2/SUB_~2.JPG",      // !1 - sub_??.jpg
  "SourceFile": "./ROOT_A~1.JPG",            // !1 - root_apple_??_apple.jpg
  "SourceFile": "./root_green.jpg",
  "SourceFile": "./root_??^??.jpg",          // !2 - root_синий.jpg | In the console I see "root_ёшэшщ.jpg" for cp866 or "root_.jpg" for cp65001


Bash trimmed output:

  "SourceFile": "folder1/folder1_green.jpg",
  "SourceFile": "folder1/folder1_синий.jpg",
  "SourceFile": "folder2/SUB_AP~1.JPG",   // !1
  "SourceFile": "folder2/sub_blue.jpg",
  "SourceFile": "folder2/sub_??^??.jpg",  // !2
  "SourceFile": "folder2/SUB_~2.JPG",     // !1
                                          // !3
  "SourceFile": "root_green.jpg",
  "SourceFile": "root_??^??.jpg"          // !2


If I run Bash with exiftool.exe -FileName -s -r -progress -j -ext "*" . > o2.json the result is similar to the CMD result.


If you wanna test by yourself download the attachment.

I will add my comment to this later.


Anonan

#27
Additional bug.

That's about Git-Bash in Windows (UTF-16):

This command works fine:
exiftool.exe -FileName -r -ext "*" .

But this command does no.
exiftool.exe -FileName -r  *
Non-ASCII file names in the root folder will be printed with an incorrect encoding (It's ANSI, but it should be UTF-8 like the other data).



And If I use -json all data* in ANSI encoding** are lost, it just was replaced by five "?" (??_?? (I can't post this – auto smile inserting occurs)).



*In my case – Cyrillic characters.
**Except characters that are contained in ASCII charset, f.e. Latin characters.

Anonan

#28
Now let us return to my last big post.

All ??.?.?? (five ? in a row) are a lost data during an ANSI encoding string are written to JSON file.

Bug:
If a file with name containing surrogate pair is contained in a folder, all other files with non-ASCII name will be endoded with ANSI encoding (Other data is encoded with UTF-8). It leads to mojibake.
https://unicodebook.readthedocs.io/definitions.html#mojibake

And if you use -json, characters in this ANSI string that are not contained in ASCII charset just will be replaced by ??.??.?.

Anonan

#29
The error message "Error: [Win32::FindFile] No support for unicode surrogates - root_apple_??_apple.jpg" contains a readable file name, I think the warning should shows the same information.
Now a warning shows only a folder within that there are one or more file with surr name. It's no good. What if I have only one such file among a thousand files? It will be hard to find it. [1]

Is there way to no convert a file name from sub_apple_??_apple.jpg to SUB_AP~1.JPG?
But if the program will lists all these files [see 1] it will be possible to manually process its names (to remove surr pair, f.e.).