ExifTool Forum

ExifTool => Archives => Topic started by: Archive on May 12, 2010, 08:54:21 AM

Title: Writing UTF-8 values via the command line
Post by: Archive on May 12, 2010, 08:54:21 AM
[Originally posted by robome on 2008-06-20 13:33:24-07]

Hi,

I'm trying to put some string containing an umlaut into XMP data of an image on Windows.
I'm using

-xmp:CreatorContactInfoCiAdrCity=München

From a batch file which is UTF-8 encoded it results in the umlaut being two pluses (0x2b) in the XMP data. From a batch file which is ISO-8859-1 encoded the umlaut will be 0xb3 in the XMP data. And calling it from the command line results in ISO-8859-1 coded umlaut (0xfc) in the XMP data.

What else can I do to get characters > 127 correct (that is UTF-8 according the spec) in XMP data?

Robert
Title: Re: Writing UTF-8 values via the command line
Post by: Archive on May 12, 2010, 08:54:21 AM
[Originally posted by exiftool on 2008-06-20 13:48:19-07]

This works fine for me.  In my console, ü is encoded
in UTF-8 as \303\274 (although I can paste the ü from
here into my console without trouble).  When I write this
to xmp:CreatorContactInfoCiAdrCity it works fine.
Are you sure your UTF-8 encoded file contains the sequence
\303 \274 (0xc3 0xbc)?  If so, it should work.  I can't comment
on the special character handling in a Windows shell, but
it is natively Windows Latin1, so if you can generate a
ü in this shell, you should be able to write it OK
assuming you use the exiftool -L option.

ExifTool will pass the text straight through to XMP without
translation unless you use -L, so if it isn't write,
you aren't passing the correct thing to exiftool.

- Phil
Title: Re: Writing UTF-8 values via the command line
Post by: Archive on May 12, 2010, 08:54:21 AM
[Originally posted by robome on 2008-06-21 08:59:27-07]

Ah yes, the -L switch. I missed that one.

Ok, typing ü directly on the console and using -L it's written as UTF-8 to the file. That's a good step forward.

But still, I'd like to get that into a batch file since I need it - together with other fields - to go into every file I give away. A batch file with ISO coded ü (0xfc) and -L generates 0xc2 0xb3, huh?

And yes, verified with a hex editor, my UTF-8 batch umlaut is really 0xc3 0xbc but genereates pluses (it also does with -L but that option shouldn't help here anyway).

So the batch thing must be some Windows issue then. :-/

Thanks,

Robert
Title: Re: Writing UTF-8 values via the command line
Post by: Archive on May 12, 2010, 08:54:21 AM
[Originally posted by exiftool on 2008-06-21 12:19:45-07]

Very odd.  It sounds like Windows may be doing some translation
of characters when parsing batch files.  I'm afraid I can't help
with this.  Try taking exiftool out of the equation, and just use
a batch file with the echo command to print a ü to the console.
If you can do this, then you should be able to pass it to exiftool
with the -L option (assuming the console character set is Latin1).

- Phil
Title: Re: Writing UTF-8 values via the command line
Post by: Archive on May 12, 2010, 08:54:21 AM
[Originally posted by robome on 2008-06-22 10:59:26-07]

Oh, oh, batch files are ancient stuff. I just discovered that this command line even on Windows XP is very DOSish. So it expects characters to be in OEM codepage 437 encoding and while this is true for everything you type on the command line it's normally not for files created with Windows editors.
Interestingly there seems to be a implicit conversion OEM->ISO-8859-1 when the command line (or a batch) calls an application (here a Perl script).

So the ü has to be 0x81 in the batch and will be converted to 0xfc on handing over to Perl where together with -L exiftool will produce the correct UTF-8 bytes for the file.

Unfortunately the reverse conversion doesn't happen when a Perl script outputs on the console, so "München" is always "M³nchen"—until today I never investigated the reason for that.

I'm somewhat glad Unix command lines nowadays know how to handle modern encodings including UTF-8.

Robert
Title: Re: Writing UTF-8 values via the command line
Post by: Archive on May 12, 2010, 08:54:21 AM
[Originally posted by exiftool on 2008-06-22 17:26:41-07]

Hi Robert,  I'm glad you figured it out. Did
you use -L when extracting the value to the DOS
console?
Title: Re: Writing UTF-8 values via the command line
Post by: Archive on May 12, 2010, 08:54:21 AM
[Originally posted by robome on 2008-06-22 17:42:57-07]

Yes. Though without it a special char for each byte of the UTF-8 sequence is shown and with -L only one special char is shown. So it doesn't really help.

Robert