No support for unicode surrogates | emoji

Started by Anonan, January 01, 2019, 01:58:36 PM

Previous topic - Next topic

Martin Z

#45
I'm not sure exactly... there are 1,000+ images in the folder (and the CSV) but the only output from EXIFtool was literally just those 2 lines.

Most filenames are plain ASCII (or at least nothing notably unusual) but some of the files have emoji's in their name. Is there a way I can get more info from EXIFtool?... Get it to tell me which file(s) specifically it can't process?

Phil Harvey

It looks like the error is inside the Win32::FindFile library when trying to expand "*" into a list of file names.  There is no way to get more information about this from ExifTool.  I suggest a binary search to find a file and/or directory which generates this problem.  It is unlikely that I will be able to find a solution, but we may be able to come up with some work-around.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Martin Z

Quote from: Phil Harvey on May 16, 2023, 02:48:55 PMI suggest a binary search to find a file and/or directory which generates this problem.
Thanks Phil.

When you say a binary search, I'm not quite sure what you mean I'm afraid (I have a few ideas but omitted for brevity)... In the interim, I was thinking I can split the files into sub-groups, e.g. put all the plain ASCII files in folder A, emoji's in folder B, test folder B... then sub-divide further, etc -- assuming that there is a specific character that is causing the issue, this would find it through repeated trial and error... That approach is obviously a bit time-consuming / laborious but can try to do that at some point (just not sure about ETA).

EDIT: Actually, I'll bulk-write, individual file tests (using a template) - that should narrow it down faster (again, assuming it's just a small number of files causing the error)

Phil Harvey

Binary search:

Move half of the files into another folder.  Find the folder that still has the problem.  Repeat for this folder.

You'll find a specific file that causes the problem after 10 iterations for 1000+ files.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Martin Z

Quote from: Phil Harvey on May 16, 2023, 08:32:37 PMBinary search:Move half of the files into another folder. Find the folder that still has the problem. Repeat for this folder.
- Phil
Hi Phil, apologies been busy the past few days...

Lol, coming back to this on a fresh (and perhaps clearer) mind, what you said makes perfect sense!

TBH, that is what I thought you meant, but I remember thinking at the time "he probably means X, but he might mean Y or Z" -- except now I can't think what Y or Z could have been, lol -- will just put it down to an extra glass of wine that night or something!



Update on unicode surrogate / emoji issue
Quote from: Phil Harvey on May 16, 2023, 02:48:55 PMIt looks like the error is inside the Win32::FindFile library when trying to expand "*" into a list of file names
- Yeah, I think this almost certainly the issue...
- As I mentioned in my later post, I decided to generate 1,000+ commands (1 per file) and run those in a batch to hopefully give me the answer a lot sooner, i.e. identify the individual filenames (and so characters) that were failings...
- However, EXIFtool processed EVERY file fine, when the filename was fed directly into the command
- As a cross-check I put a single emoji-named file in a folder, tried to run EXIFtool with * and it failed on the single file

Possible (deeper) root cause
- Assuming Windows beta UTF8 system-wide support is enabled...
- Does EXIFtool support emoji characters?
- I have a feeling that some emoji characters are UTF16 (ones that were extended to cover multiple skin-tones and genders) but could be wrong!... Are UTF16 emoji's this a thing?
- If so, could this be what is causing problems?



PS #1 -- I know you know this already and have even commented on it before, so mainly for the benefit of other readers, be advised that running EXIFtool in a '1 cmd per file' way is *MUCH* slower (probably took >1 hour to go through the 1,000+ files).

PS #2 -- Consider the time spent above, and that this actually got all files processed, I didn't end-up doing the binary search. If this would be helpful, please let me know.

Phil Harvey

Quote from: Martin Z on May 22, 2023, 04:46:33 PM- Does EXIFtool support emoji characters?

Not fully.

Quote- I have a feeling that some emoji characters are UTF16 (ones that were extended to cover multiple skin-tones and genders) but could be wrong!... Are UTF16 emoji's this a thing?
- If so, could this be what is causing problems?

No idea.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).