charset to extract German text from diverse JPGs

Started by Archive, May 12, 2010, 08:54:39 AM

Previous topic - Next topic

Archive

[Originally posted by 190002 on 2009-08-24 23:41:32-07]

Hi all,

I operate a photo-sharing site and we accept JPG uploads from 2000 different photographers worldwide.  Our site automatically extracts file keywords from XMP-Subject (falling back to IPTC-Keywords if XMP-Subject is blank).  We extract caption from XMP-Description, falling back to IPTC-Caption-Abstract if XMP-Description is blank.

Periodically we get complaints from photographers whose metadata is not extracted in the proper charset.  Typically this happens for non-English European photographers whose words contain umlauts.  These umlauts end up being extracted as non-printable characters that display as boxes.

When we initialize the IPTC Perl code, by default we use:

exifTool--Options(

'Unknown' == 1,

);

According to the docs, this should cause the default UTF8 charset to be used.

Since this problem was first reported, I have added a photographer-level control to our web site whereby the photographer can opt for the Latin charset.  If he chooses this option, then we initialize the Image-ExifTool code with:

exifTool--Options(

'Unknown' == 1,

'Charset' == 'Latin',

);

For every photographer of ours who has complained about the problem, switching to Charset == Latin has solved the issue for them (about 3-4 of 2000 have complained).

However I continue to get complaints from photographers and my management about this issue.  They feel that asking the photographer to understand charsets and find/test a deeply buried setting is not good enough (and I would agree, if there were any good options available).

So:

* Is there a way to auto-detect the encoding of the metadata?  Would it work to extract as both Latin and UTF8 and then compare the ratio of non-printable characters that each brings out???

* Is there a standard "best practices" approach of defaulting to Latin versus UTF8?

* Any suggestions on the best way to deal with this issue?

Our platform is Windows 2003 Server, ActiveState Perl 5.10.0 build 1004, Image-ExifTool 7.67.

Thanks,

James

Archive

[Originally posted by exiftool on 2009-08-25 00:23:23-07]

Interesting, but I think there is some misunderstanding:

You wrote:
Code:
exifTool--Options(
'Unknown' == 1,
);

According to the docs, this should cause the default UTF8 charset to be used.

I wonder how you got this impression.  The docs state:
Code:
   Unknown
        Flag to get the values of unknown tags.  If set to 1, unknown
        tags are extracted from EXIF (or other tagged-format)
        directories.

The setting of the Charset option is entirely application dependent.
If your application interprets the tag values as UTF8, then it should
be set to the default "UTF8".  But if you want special characters
translated to Windows Latin1, then set it to "Latin".

But this all assumes that the IPTC is properly encoded to begin
with (which is unlikely).  Historically, applications have written
IPTC using whatever local character set the computer was using,
and there is no way to tell what this character set was.  Blame
Adobe -- they are responsible for this mess because Photoshop
set the standard.

For these historic encodings, setting the ExifTool Charset to
"Latin" effectively disables translation of IPTC and the characters
are passed without translation.  This may be what you want,
I don't know.  You should read FAQ number 10 for more details
about the character handling.

The only real solution for this is to allow the user to specify
which character set to use if not specified, then do the translations
for the specific character set.  But ExifTool will not do these
translations for you (there are just too many character sets,
and I don't want to implement them all).  So this solution is
not easy.

- Phil