Warning: Malformed UTF-8 character(s)

Started by Archive, May 12, 2010, 08:54:04 AM

Previous topic - Next topic

Archive

[Originally posted by pdi on 2007-05-15 07:41:47-07]

This is possibly an odd case, but I would certainly appreciate any help.

I prepare an ARGFILE, in utf-8, with XMP tags like -xmp-dc:title=. Exiftool writes the data correctly to jpegs.

I then (1) change the tags to the corresponding IPTC ones, e.g. -xmp-dc:title= to -iptc:ObjectName=, and (2) use iconv to convert the ARGFILE encoding from utf-8 to iso-8859-7 (Greek), as most programs do not read utf-8 IPTC. The resulting file is read correctly by text editors.

However, when exiftool tries to write the data to jpegs it returns "Warning: Malformed UTF-8 character(s)". The cause seems to be the Greek characters.

Both exiftool and iconv are well established, so perhaps I do something out of place. But if not, is there a way exiftool can accept the iconv output? Or is there another standard encoding conversion tool that exiftool is happy with?

Many thanks in advance,

pdi

Archive

[Originally posted by pdi on 2007-05-15 10:53:20-07]

After some trials I found to my surprise that, irrespective of iconv, the same error occurred with both entirely new txt iso-8859-7 files and old ARGFILES in the same encoding from about a year ago which worked perfectly.

Preliminary findings point to an important change in exiftool in ver. 6.70 about the treatment of encoded characters. I am still trying to understand it's implications. From a first reading it seems to cater either for utf-8 or cp1252. What about cp1253 (iso-8859-7)?

Regards,

pdi

Archive

[Originally posted by exiftool on 2007-05-15 11:31:15-07]

Yes, ExifTool now translates coded characters for IPTC.
See FAQ #10
for details.

You can use the -L option when writing IPTC if you want to disable translation
of special characters.

- Phil

Archive

[Originally posted by pdi on 2007-05-15 12:04:31-07]

Phil,

Thank you for your reply. I was confused by the mention only of cp1252, but when I tried the -L option the result was correct. I'm not sure I understand it, but I'm glad it works.

Regards,

pdi

Archive

[Originally posted by exiftool on 2007-05-15 12:16:10-07]

This works because 1) ExifTool assumes IPTC in the file is coded in
Latin1 unless the recorded CodedCharacterSet is "ESC % G" (UTF8),
and 2) the -L option specifies the external character set as Latin1.

When the recorded character set is the same as the external character
set, no translation is performed.

I hope this makes a bit more sense now. Smiley

- Phil

Archive

[Originally posted by pdi on 2007-05-15 12:46:37-07]

Phil,

I'm afraid I was not very clear about what I don't understand. Encodings and translations is a terrain only partly familiar to me. So I wonder how it all works when, while -L denotes the txt file character set as Latin1 (cp1252), the file's character set is Greek (cp1253). To be more exact, various text editors recognize the file as ANSI, but the underlying code page in Windows for Greek is cp1253. So exiftool is told to write cp1252 and writes in fluent cp1253 :-) It suits me fine, but I'd rather understand it than not :-)

Regards,

pdi

Archive

[Originally posted by exiftool on 2007-05-15 13:06:24-07]

I understood your confusion, but I guess you didn't understand my
explanation.

It is really fairly simple.  You give ExifTool a string of bytes and tell it
what character encoding was used.  As long as ExifTool thinks that
the internal and external character sets are the same, then no translation
is performed and the bytes are passed through unchanged.  (This is the
behaviour of older ExifTool versions for IPTC information.)

As long as ExifTool is not translating the text, it is totally irrelevant
what character set is actually used since the bytes are passed
through unchanged.  So as long as ExifTool believes there is no need
to translate the text, you are free to use whatever character set you
like.

I can see how this could be confusing.

If possible, it is best to use UTF8 to avoid this confusion.

- Phil

Archive

[Originally posted by pdi on 2007-05-15 16:45:16-07]

Phil,

I appreciate your patience with my dim wits :-) All is much clearer now.

Code:
As long as ExifTool is not translating the text, it is totally irrelevant what character set is  actually used since the bytes are passed through unchanged.

Perhaps you might include some similar note in FAQ #10, to make it clearer we are not limited only to cp1252.

I am writing IPTC data to a jpg which has no previous IPTC data, only XMP; so my guess is that ExifTool handles the case of no internal data the same as if these existed and were of the same character set with the external ones.

Unfortunately, many IPTC tools cannot handle the notorious "ESC % G" sequence and fail to display utf-8 properly. I was very surprised to see the change in the default behaviour of ExifTool, but I am sure you had very sound reasons for it. It must be that the tide is turning :-)

Regards,

pdi

Archive

[Originally posted by exiftool on 2007-05-15 17:43:00-07]

I'm glad it makes a bit more sense now.

When writing information, ExifTool uses the value of CodedCharacterSet to
determine how to encode the text.  If CodedCharacterSet is being written at
the same time as text, the new character set is used.  If no CodedCharacterSet
exists and none is written, then Latin1 is assumed.

The special character handling in IPTC is a real mess.  The way ExifTool
originally handled it (by never translating) was simplest, but it seems that
other applications most commonly assume Latin1 characters (contrary to
the actual IPTC specification) so ExifTool was displaying special characters
written by these applications incorrectly.  This is the reason for the change.

If enough people have problems with this, I am open to changing it back
again.

It is a pity that not many applications support UTF8 in IPTC, because this
is the best solution.  The original IPTC specification used ISO 2022, which
is a real can of worms and hence isn't well supported either, but UTF8
support was added as a revision to the IPTC specification (I believe),
and is a much better solution.

- Phil

Archive

[Originally posted by exiftool on 2007-05-25 19:18:08-07]

For reference, here
is the thread
which prompted the change in handling of special
characters in IPTC.

- Phil