JSON export not supporting non ASCII filenames

Started by karlgustavv, May 28, 2014, 06:30:20 AM

Previous topic - Next topic

karlgustavv

Windows 7 64 Bit
ExifTool version: 9.62


I am using ExifTool to export keywords which get processed by another program and are then written back.

There will be problems if the filename contains non ASCII characters.

Example:
filename: "Café.jpg", Keyword: "Café"

exiftool -json -keywords *.jpg -charset UTF8 > keywords.json

produces:

[{
  "SourceFile": "Caf?.jpg",
  "Keywords": "Café"
}]


The filename "Caf?.jpg" is not UTF8 formatted.

If I want to import the JSON file back with

exiftool -json=keywords.json *.jpg

it says:

No SourceFile 'CafÚ.jpg' in imported JSON database
(full path: 'D:/_div/excluded/tags_temp/CafÚ.jpg')


older ExifTool version (9.45) says:

Caf?.jpg: No such file or directory at script/exiftool line 1497



Here are some other strange inconsistent things:


  • exiftool -json -keywords *.jpg -charset cp1252 > keywords.json

[{
  "SourceFile": "Café.jpg",
  "Keywords": "Café"
}]


"Café.jpg" is cp1252 formatted. Keywords are UTF8 formatted.



  • exiftool -csv -keywords *.jpg -charset cp1252 > keywords.csv

SourceFile,Keywords
Café.jpg,Café


Filename and keywords are cp1252 formatted. As expected



  • exiftool -csv -keywords *.jpg -charset UTF8 > keywords.csv

SourceFile,Keywords
Café.jpg,Café


"Café.jpg" is cp1252 formatted. Keywords are UTF8 formatted.




Phil Harvey

#1
The Windows file name issue is a known problem.

The difference between JSON and CSV is that JSON strings must be valid UTF-8, so the -charset option is ignored for JSON output (as per the documentation).

- Phil

Edit:  Hmmm.  I was just playing with this.  Apparently the -charset option disables the UTF-8 validity check when the -json option is used.  I will fix this.  It should have no effect on the JSON output.  This is why the invalid SourceFile UTF-8 is sneaking through when you use the -charset option.

The Windows file name is a problem.  I don't know its character encoding so all I can do is assume UTF-8.  So don't expect the SourceFile to be valid for file names containing non-UTF-8 special characters in Windows.
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

karlgustavv

Thank you for your fast answer.

So why can't I just specify the needed character encoding?

Like:

exiftool -json -keywords *.jpg -charset UTF8 -windowsCharset cp1252 > keywords.json

an ExifTool will read "Café.jpg" correctly. It then encodes it internally to UTF8.

When one needs to write tags back:

exiftool -json=keywords.json -windowsCharset cp1252 *.jpg

This way it would be possible to use non-UTF-8 special characters in Windows.

Phil Harvey

The problem is that the JSON specification mandates that the strings all be valid UTF-8.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

karlgustavv

Sorry that I wasn't able to make it clear.

exiftool -json -keywords *.jpg -charset UTF8 -windowsCharset cp1252 > keywords.json

means that ExifTool reads filenames as cp1252 from windows filesystem e.g. "Café.jpg" it then transforms this string to UTF8 "Café.jpg" and stores this string in the JSON file.

If you want to read it back via

exiftool -json=keywords.json -windowsCharset cp1252 *.jpg

ExifTool reads UTF8 "Café.jpg" from the JSON file and internally transforms it back to cp1252 "Café.jpg" and modifies this file.

Phil Harvey

Sorry, I didn't read carefully enough.  You're suggesting adding a new -windowsCharset option.  Just to get the FileName encoded correctly?  Yuck.

I would be more enthusiastic about this if it wasn't for the general lack of support for special characters in Windows file names.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

karlgustavv

I know that this would be somewhat strange. But it would allow users to use JSON export and import with filenames like "Café.jpg". I thought that this might be worth it.

At the moment a

exiftool -json -keywords *.jpg > keywords.json

for a folder that contains files with names like "Café.jpg" would result in a JSON file that contains "?" for each non ASCII character in filenames.

therefore an import via

exiftool -json=keywords.json *.jpg

does not work because ExifTool does not find the file "Caf?.jpg". That will make JSON useless if you have files with non ASCII characters in their names. Although (as you said) "JSON specification mandates that the strings all be valid UTF-8" and thus would be able to store any filename correctly. I think this is even more strange.

If you don't want to "surrender" to Windows in this aspect you could maybe allow ExifTool to "find" "Caf?.jpg" which means that it could interpret the "?" as a wildcard (would be a bit risky as this would write to "Café.jpg" and "Cafe.jpg").

karlgustavv

You don't need a new -windowsCharset option. That was my fault.

JSON has to be UTF-8. So if someone specifies a -charset it should imply that this refers to the charset that is used by the operating system.

Therefore

exiftool -json -keywords *.jpg -charset cp1252 > keywords.json

should be enough for ExifTool to know how "Café.jpg" is encoded (as cp1252). It could then write "Café.jpg" to the JSON file.

If you want to read it back via

exiftool -json=keywords.json -charset cp1252 *.jpg

ExifTool reads UTF-8 "Café.jpg" from the JSON file and internally transforms it back to cp1252 "Café.jpg" and modifies this file.

Phil Harvey

Quote from: karlgustavv on May 28, 2014, 02:38:07 PM
therefore an import via

exiftool -json=keywords.json *.jpg

does not work because ExifTool does not find the file "Caf?.jpg".

On Windows, it can't necessarily even find "Café.jpg" either.  Whether or not this works depends on some system settings that I don't understand.

QuoteThat will make JSON useless if you have files with non ASCII characters in their names.

In general, ExifTool is useless on Windows with non-ASCII characters.  (Did you read the Known Problems?)  I have tried to find a solution for this, but so far I have only found partial solutions.

One work-around is to use the 8.3 names.

For more reading:

https://exiftool.org/forum/index.php/topic,3155.0.html
https://exiftool.org/forum/index.php/topic,5618.0.html
https://exiftool.org/forum/index.php/topic,2394.0.html
https://exiftool.org/forum/index.php/topic,3224.0.html
https://exiftool.org/forum/index.php/topic,4029.0.html
https://exiftool.org/forum/index.php/topic,3565.0.html
https://exiftool.org/forum/index.php/topic,4649.0.html
https://exiftool.org/forum/index.php/topic,2677.0.html
https://exiftool.org/forum/index.php/topic,5193.0.html
http://stackoverflow.com/questions/4232397/perl-managing-path-encoding-on-windows

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

karlgustavv

Thank you for being patient with me.

My solution is to use a "bug" of ExifTool:

exiftool -json -keywords *.jpg -charset cp1252 > keywords.json

produces

[{
  "SourceFile": "Café.jpg",
  "Keywords": "Café"
}]


Which is not a valid JSON file ("Café.jpg" is NOT UTF-8). But it stores the filename the way I need it and all other tags as UTF-8.

To apply it back

exiftool -json=keywords.json *.jpg

works without specifying a charset.

Now I hope that you will keep this bug ;)

Phil Harvey

I will have to think about this.  I had already patched this bug.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Phil Harvey

OK.  I'll leave the behaviour the way it was, and change the documentation accordingly:

            The JSON output is UTF-8 regardless of any -L or -charset
            option setting, but the UTF-8 validation is disabled if a
            character set other than UTF-8 is specified.


- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

karlgustavv

That's great news.

"It's not a bug it's a feature."

Thank you

Phil Harvey

...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).