ExifTool Forum

ExifTool => Newbies => Topic started by: xalex on August 01, 2013, 08:48:18 AM

Title: Problems with charset and german umlaut
Post by: xalex on August 01, 2013, 08:48:18 AM
Hi,

I have read the FAQs 10 and 18 and lots of posts on howto extract / show iptc data correctly when these contain german umlauts. But I am facing a situation right now when I need your help: The situation is as follows:
1. I have an image (a.jpg) prepared with Photoshop CS6 with IPTC Data containing umlauts (this string: ä -> ae ö -> oe ü -> ue Ä -> Ae Ö -> Oe Ü -> Ue ß -> ss)
2. This image is imported into an image database, the data is shown correctly there
3. When exported from database (a1.jpg) the data is still correct. This file can again be imported into the database without problems, but:
4. When the exported image (a1.jpg) is modified with Photoshop CS6 the result is (aps.jpg) "damaged". Which means, when this file is imported into the database the metadata are shown wrong (e.g. �> ae � oe �e �-> Ae �-> Oe �-> Ue �-> ss)

And now the question for exiftool: which parameter do I need to set -charset xx? to display the data correctly? Is the file really damaged or can it be read with exiftool?
The 3 files are attached. I would also like to know is there a switch, which automatically sets the "correct" charset?

Thanks very much!
Alex
Title: Re: Problems with charset and german umlaut
Post by: Phil Harvey on August 01, 2013, 09:11:12 AM
Hi Alex,

What a mess.  It is great that you read the FAQ's, but they won't give all the answers in this case.

1) The IPTC CodedCharacterSet is duplicated (4 times!) in a.jpg and a1.jpg.  There should be only one of these.

2) The IPTC in a1.jpg is encoded in Windows Latin1 even though the CodedCharacterSet specifies UTF-8 (4 times!).  The XMP in a1.jpg is correct.  ExifTool will read the IPTC from this file correctly if the CodedCharacterSet tags are deleted.

3) I have never seen this, but it looks like the individual special characters in aps.jpg have been double-UTF8 encoded (in both IPTC and XMP).  Also, there is an additional IPTC CodedCharacteSet tag (so 5 now!).  I have seen the entire XMP double-UTF8 encode, and ExifTool actually handles this case, but I have never seen individual characters erroneously re-encoded like this.  Also, I have never before seen double-UTF8 in IPTC.  ExifTool can not currently decode this mess (neither IPTC nor XMP).

- Phil
Title: Re: Problems with charset and german umlaut
Post by: xalex on August 01, 2013, 09:42:15 AM
Hi Phil,

thank you very much for the quick reply! This is really great support :-)
I am not able to see this mess, as I can see the data only with use of applications.
But perhaps you can help me to complain at the responsible software producer.
a1 is generated by Photoshop - do we have the mess here already?
a2 is generated by the databasetool - is the mess increasing?
aps is again produced by adobe photoshop.

The point, that the aps is double encoded seems to be an error of photoshop, right? And this double encoding can be seen as damage therefor this is not decoded by exiftool, right?

Thanks a lot!
Alex
Title: Re: Problems with charset and german umlaut
Post by: Phil Harvey on August 01, 2013, 10:17:47 AM
Hi Alex,

Quote from: xalex on August 01, 2013, 09:42:15 AM
I am not able to see this mess, as I can see the data only with use of applications.

The ExifTool -v3 option will show the binary data of the IPTC tags.  Not the XMP though.  For the XMP  I used -xmp -b > out.xmp, and looked at the actual XMP with a hex dump.

Quotea1 is generated by Photoshop - do we have the mess here already?

The IPTC is not coded correctly, but the XMP is OK.

Quotea2 is generated by the databasetool - is the mess increasing?

You didn't send a2.jpg

QuoteThe point, that the aps is double encoded seems to be an error of photoshop, right?

My guess is that Photoshop is reading the incorrectly encoded IPTC and propagating this to the XMP.  If you feed Photoshop correctly-encoded metadata it shouldn't cause a problem like this.

- Phil
Title: Re: Problems with charset and german umlaut
Post by: xalex on August 01, 2013, 10:57:39 AM
Hi Phil,

just to avoid confusion with the attached images:
the names are a.jpg, a1.jpg and aps.jpg.

a.jpg    is generated by Photoshop - do we have the mess here already?
a1.jpg   is generated by the databasetool - is the mess increasing?
aps.jpg  is again produced by adobe photoshop.

-- Alex
Title: Re: Problems with charset and german umlaut
Post by: Phil Harvey on August 01, 2013, 11:03:50 AM
Hi Alex,

Quote from: xalex on August 01, 2013, 10:57:39 AM
a.jpg    is generated by Photoshop - do we have the mess here already?

No, this one is OK.

Quotea1.jpg   is generated by the databasetool - is the mess increasing?

The IPTC is not coded correctly, but the XMP is OK

- Phil
Title: Re: Problems with charset and german umlaut
Post by: xalex on August 02, 2013, 08:02:06 AM
Hi Phil,

i had a closer look at the files and have an additional question:
The first file (a.jpg from PS) has the 4 times set of characterset UTF8. I think that this is already wrong and source of all mess.
Or did you find additional problems in IPTC data written into a1.jpg (by database tool)? If so, what? The 4 times stated characterset is just let the same as in source file a.jpg

-- Alex
Title: Re: Problems with charset and german umlaut
Post by: Phil Harvey on August 02, 2013, 08:21:43 AM
Quote from: xalex on August 02, 2013, 08:02:06 AM
The first file (a.jpg from PS) has the 4 times set of characterset UTF8. I think that this is already wrong and source of all mess.

Right. Forgot to mention that again in my last post.  The encoding is correct though.

QuoteOr did you find additional problems in IPTC data written into a1.jpg (by database tool)? If so, what? The 4 times stated characterset is just let the same as in source file a.jpg

For the fourth time:  The IPTC is encoded improperly in this file.  I even told you how to fix this in my first reply.

- Phil
Title: Re: Problems with charset and german umlaut
Post by: xalex on August 11, 2013, 07:36:00 AM
Hi Phil,

thank you again for explanation and sorry for my misunderstanings!

I tried more things to get the right coded metadata. To find out the correct encoding for my images, I ran exiftool 19 times in each run with different settins for -charset UTF-8, -charset Latin, ... In every run I displayed the meta data tag of interest. So I found out the possible encodings with wich the tags are displayed correctly. See yellow marked lines in et-xmp-supplementarycategories.jpg

With these tests, i was able to adjust the parameters for getting IPTC and XMP Data correctly, but the parameter -charset has no effect on EXIF tag Imagedescription, see et-imagedescription.jpg
Is there another special setting for this?

-- Alex
Title: Re: Problems with charset and german umlaut
Post by: Phil Harvey on August 11, 2013, 07:36:02 PM
Hi Alex,

I'm glad you are understanding things better now.

Quote from: xalex on August 11, 2013, 07:36:00 AM
With these tests, i was able to adjust the parameters for getting IPTC and XMP Data correctly, but the parameter -charset has no effect on EXIF tag Imagedescription, see et-imagedescription.jpg

I don't understand this.  I get this (in a Mac UTF-8 terminal):

> exiftool a.jpg -imagedescription
Image Description               : © Phil

> exiftool a.jpg -imagedescription -charset exif=latin
Image Description               : © Phil


- Phil
Title: Re: Problems with charset and german umlaut
Post by: xalex on August 12, 2013, 09:02:12 AM
Hi Phil,

In my tests, I found correlations when combining the -charset CHARSET and -charset exif=CHARSET Options. When extracting the exifdata tag imagedescription, the applied charset and charset exif= has no effect, see attachment.
Do you know what I am doing wrong?

-- Alex
Title: Re: Problems with charset and german umlaut
Post by: Phil Harvey on August 12, 2013, 09:18:36 AM
Hi Alex,

How did you run the ExifTool command?  Did you pipe the output to a file, or view it in the terminal?  What system are you using?  What where your terminal character settings?

Working with the "a.jpg" file from your first post, in a UTF-8 terminal, and ExifTool 9.34, I get this:

> exiftool ~/Desktop/images/a.jpg -imagedescription
Image Description               : ä -> ae ö -> oe ü -> ue Ä -> Ae Ö -> Oe Ü -> Ue ß -> ss

> exiftool ~/Desktop/images/a.jpg -imagedescription -charset exif=latin
Image Description               : ä -> ae ö -> oe ü -> ue Ã,, -> Ae Ö -> Oe Ãœ -> Ue ß -> ss

> exiftool ~/Desktop/images/a.jpg -imagedescription -charset exif=latin2
Image Description               : ä -> ae ö -> oe ĂĽ -> ue Ă,, -> Ae Ă– -> Oe Ăś -> Ue Ăź -> ss

> exiftool ~/Desktop/images/a.jpg -imagedescription -charset exif=latin2 -charset latin
Image Description               : ?? -> ae ?? -> oe ?? -> ue ?? -> Ae ?? -> Oe ?? -> Ue ?? -> ss


- Phil
Title: Re: Problems with charset and german umlaut
Post by: xalex on August 12, 2013, 10:23:22 AM
Hi Phil,

I am using exifTool Version 9.05. I am running Windows XP.
I ran the command from within windows commandshell and pipe to file exiftool xxx > out.txt
The problem is the combination of -charset AND -charset exit=

The interesing is that all works fine with exif and xmp ...

-- Alex
Title: Re: Problems with charset and german umlaut
Post by: Phil Harvey on August 12, 2013, 10:31:23 AM
Hi Alex,

Just to be sure, try updating to the current ExifTool version.

Piping to file should be the best way to do it.

Note that the -charset XXX option will have no effect on EXIF unless -charset exif=YYY is used.  But I agree, the problem is that your output is not changing as it should when both of these are used.

- Phil

Edit: Oh, wait.  I just noticed you're always setting them to the same value.  No translation is done if the internal and external character sets are the same.  So your results are to be expected.
Title: Re: Problems with charset and german umlaut
Post by: Phil Harvey on August 12, 2013, 11:17:24 AM
In response to a private message...

OK.  I think maybe you don't understand what is going on.

The -charset XXX option tells exiftool what external character set you want to use for input/output.

The -charset TYPE=YYY option tells exiftool what internal character set should be used for a specific metadata type.

If XXX and YYY are the same, then no recoding is done, and the text is passed straight through.

Additionally, for EXIF no recoding is ever done unless -charset exif=YYY is specified.  Also, a -charset iptc=YYY is ignored if the CodedCharacterSet tag is set to "UTF8" in the IPTC.  This is all explained in FAQ 10 (https://exiftool.org/faq.html#Q10)

You may specify the internal character sets for different types of information in a single command:

exiftool -charset exif=utf8 -charset iptc=latin2 ...

Note that in your PDF document, for Keywords (an IPTC tag), the -charset exif=YYY setting will have no effect since it is an IPTC tag.

I hope this helps.

- Phil
Title: Re: Problems with charset and german umlaut
Post by: xalex on August 12, 2013, 04:22:25 PM
Hi Phil,

now I found the solution. It is strange, but it works. If I always set the -exif=UTF8 for each  external charset it works fine.
The confusion was that the exiftool ignores the recoding if setting internal and external encoding to the same value.
And if you do not set eny value for -exif then this is not recoded also. Perhaps it would be a good Idea to have -exif=UTF8 as default value?`

But as I said now it works fine for me.

Thanks for helping,
Alex
Title: Re: Problems with charset and german umlaut
Post by: Phil Harvey on August 12, 2013, 08:30:09 PM
Hi Alex,

Yes, I think you have it now.

I can't make UTF-8 the default for EXIF because this (relatively recent) change would not be backward compatible to earlier versions of ExifTool, and could break tools that people have developed based on ExifTool.  Backward compatibility is very important to me (unlike, it seems, all other software vendors).

- Phil