Non Printable Ascii Chars In XMP

Started by Archive, May 12, 2010, 08:53:59 AM

Previous topic - Next topic

Archive

[Originally posted by themonk on 2007-01-10 17:20:32-08]

Hi Phil

Happy New Year

What do you make of this ?

I have been duplicating IPTC blocks in XMP using -tagsFromFile ......

We have been getting spurious PhotoShop CS2 errors about un-readable data being ignored.

It turns out that XMP only supports printable ASCII chars (0-127), the character

that was causing the problem was a £ (pound sign).

If you run "exiftool -XMP:Description='£' testfile.jpg" and then load testfile.jpg in PS you should

see a simpler version of the problem.

-tagsromFile is a handy solution and I do not want to have to limit myself to extracting

the IPTC values, re-encoding and then re-inserting.

Can you suggest anything ?

Mark Tate

Archive

[Originally posted by exiftool on 2007-01-10 22:38:47-08]

Hi Mark,

XMP supports special characters beyond the standard ASCII using
UTF-8 encoding.  The problem is on the IPTC side:  IPTC character
encoding is fantastically obscure, and not well implemented by
other software.  Even Photoshop does not adhere
to the IPTC specification, and will write Latin1 characters ad-hoc
in IPTC without properly setting the CodedChararacterSet
tag.

For this reason it is very difficult to properly handle special characters
in IPTC.

Also, I don't have a very good test set of IPTC containing special
characters from other applications, so it is difficult for me to know
what the best way to handle this is.  Can you tell me what encoding
is used in your IPTC samples that contain special characters, and
what the CodedCharacterSet tag is set to?

According to
this source,
it may be sufficient in most cases to just assume Latin-1 encoding if not
specified.  If this is true, I could add an option which would force
ExifTool to assume Latin1 encoding and convert appropriately.

If anyone has any ideas on this matter, I'd love to hear them.

- Phil

Archive

[Originally posted by exiftool on 2007-01-11 19:01:47-08]

Hi Mark,

Thanks for the sample via email.

I think the thing to do is to assume Latin1 coding unless otherwise specified.
This should fix the problem with your sample image at least.  It is a rather significant
change to start translating IPTC text, but I hope I have done it in a way that won't
break things for too many people, and hopefully it will solve more problems than
it creates.

The strategy now is to convert IPTC text if the CodedCharacterSet is recognized,
and to assume Latin1 if the CodedCharacterSet tag doesn't exist.  The ISO 2022
escape sequences used to switch between different codings are not yet supported,
and the text is assumed to be all in a single character set.  Also, when creating
a new IPTC record from scratch, a CodedCharacterSet value of "UTF8" is written by
default.

The new version will require a lot of testing since this is a fairly significant change.
It would help if you could help with this effort.  I have uploaded a
6.70
pre-release here
for you to play with.

 
Note that the translations are only performed if the coding is Latin1 or UTF8.  Otherwise
no translation is done.  This will all be spelled out in the new FAQ #10, which
will read:

IPTC: IPTC text is converted only for recognized values of
the IPTC:CodedCharacterSet tag.  Currently recognized encodings are UTF-8
("UTF8" or "ESC % G") and Latin1/ISO-8859-1
("Latin" or "ESC . A"). "Latin"
is assumed if the CodedCharacterSet tag is missing.  No translation is performed
for all other values of CodedCharacterSet. When reading, text is translated to
UTF-8 by default, or Latin1 with the -L option.  When
writing, the inverse translation is performed.  When creating a new IPTC record,
ExifTool automatically sets CodedCharacterSet to "UTF8" unless
otherwise specified.  This causes all text strings to be stored in UTF-8, which
is the preferred encoding.

- Phil

Archive

[Originally posted by themonk on 2007-01-17 16:43:06-08]

Thanks Phil ...

I've put a dozen images through which previously had problems and they open in PhotoShop

with the XMP in-tact...

I will continue to test as and when I come across images but so far so good...

Let me know if anyone spots any issues....

Mark

Archive

[Originally posted by themonk on 2007-01-17 17:12:11-08]

Thanks Phil ...

I've put a dozen images through which previously had problems and they open in PhotoShop

with the XMP in-tact...

I will continue to test as and when I come across images but so far so good...

Let me know if anyone spots any issues....

Mark

Archive

[Originally posted by exiftool on 2007-01-17 17:29:38-08]

Hi Mark,

I've been reading more about the ISO 2022 specification that is used
in IPTC, and there are a couple of things I'm thinking about changing.
The first issue won't affect you because you're not writing IPTC, but the
second may help if you have translation problems with images where
CodedCharacterSet has been set to an unrecognized value.

1) I think I'll change the default behaviour of setting
CodedCharacterSet to UTF8 when creating a new IPTC record
because it seems there isn't good support for this in other
applications.

2) I may try applying the Latin conversion even for unrecognized
CodedCharacterSets provided no alternate ISO 2022 character
sets have been invoked in the text.

- Phil

Archive

[Originally posted by themonk on 2007-01-17 18:11:42-08]

Thanks Phil ...

I've put a dozen images through which previously had problems and they open in PhotoShop

with the XMP in-tact...

I will continue to test as and when I come across images but so far so good...

Let me know if anyone spots any issues....

Mark

Archive

[Originally posted by exiftool on 2007-01-19 14:00:24-08]

I've released version 6.70 officially now.  This version implements the
changes that I mentioned in my last post.  So here is the updated
FAQ #10 text for IPTC character coding:

IPTC: The value of the IPTC:CodedCharacterSet tag determines
how the internal IPTC string values are interpreted. If CodedCharacterSet
exists and has a value of "UTF8" (or "ESC % G") then string values
are assumed to be stored as UTF-8, otherwise Latin1 (cp1252) coding
is assumed. When reading, these strings are translated to UTF-8 by
default, or Latin1 with the -L option. When writing, the inverse
translation is performed. No translation is done if the internal
(IPTC) and external (ExifTool) character sets are the same. Note
that ISO 2022 character set shifting is not supported. Instead, a
warning is issued and the string is not translated if an ISO 2022
shift code is found. See the IPTC specification for more information
about IPTC character coding.

- Phil

Archive

[Originally posted by themonk on 2007-01-19 17:56:35-08]

I will let you know of any issues..

Excellent response as usual..

Thanks Phil.....