ExifTool Forum

ExifTool => The "exiftool" Application => Topic started by: Kugelblitz on November 13, 2018, 09:29:39 AM

Title: Replace characters in all fields and set the Charset to UTF-8
Post by: Kugelblitz on November 13, 2018, 09:29:39 AM
Hello,

I noticed some of the German written Names in many fields appear double with the duplicate beeing written with wrong characters.

(https://exiftool.org/forum/index.php?action=dlattach;topic=9664.0;attach=2827)

For example the City "Cologne" is written in German "Köln" and appears twice in the Metadata of some files
something like this Müllenbach is displayed as  M¸llenbach or even Müllenbach and Köln, Kˆln or even Kö ln.

I like to replace all of the following characters in all Fields. Cause they Appear in so many different fields like Keywords, Geo Names and Copyright

‰ -> ä
ˆ -> ö
¸ -> ü
ƒ -> Ä
÷ -> Ö
‹ -> Ü
fl -> ß

and also

ä -> ä
ö -> ö
ü -> ü
Ã,, -> Ä
Ö -> Ö
Ü -> Ü
ß -> ß

And set the encoding to UTF-8 for all fields too.

How can I check which charset has been used in the Files?

When I drag n drop files to the exiftool this is the output.
---- ExifTool ----
ExifToolVersion                 : 11.17
---- XMP ----
XMPToolkit                      : Image::ExifTool 10.96
CountryCode                     : DEU
Location                        : N├╝rburg
Subject                         : Deutschland, geotagged, M├╝llenbach, Rheinland-Pfalz, DEU, Deutschland, N├╝rburg, Rheinland-Pfalz
Country                         : Deutschland
State                           : Rheinland-Pfalz
CreatorTool                     : 1.01
---- IPTC ----
ApplicationRecordVersion        : 4
Keywords                        : Deutschland, geotagged, M├â┬╝llenbach, Rheinland-Pfalz, DEU, Deuts
City                            : N├╝rburg
Sub-location                    : N├╝rburg
Province-State                  : Rheinland-Pfalz
Country-PrimaryLocationCode     : DEU
Country-PrimaryLocationName     : Deutschland


or

---- ExifTool ----
ExifToolVersion                 : 11.17
---- XMP ----
XMPToolkit                      : Image::ExifTool 10.96
CountryCode                     : DEU
Location                        : M├╝llenbach
Rights                          : 2009 mARTin Bierschenk
Subject                         : Deutschland, geotagged, M┬©llenbach, M├╝llenbach, Rheinland-Pfalz
Country                         : Deutschland
State                           : Rheinland-Pfalz
CreatorTool                     : 1.01
---- IPTC ----
EnvelopeRecordVersion           : 4
CodedCharacterSet               : UTF8
ApplicationRecordVersion        : 4
Keywords                        : Deutschland, geotagged, M┬©llenbach, M├╝llenbach, Rheinland-Pfal
City                            : M┬©llenbach
Sub-location                    : M┬©llenbach
Province-State                  : Rheinland-Pfalz
Country-PrimaryLocationCode     : DEU
Country-PrimaryLocationName     : Deutschland

Title: Re: Replace characters in all fields and set the Charset to UTF-8
Post by: Phil Harvey on November 13, 2018, 10:07:43 AM
The first step is to sort out your IPTC character coding problem.  See the IPTC section of FAQ 10 (https://exiftool.org/faq.html#Q10) for help here.

It looks like your XMP has got invalid characters because you have copied them from IPTC without using the proper encoding.

I would suggest these steps to fix the problem:

1. Delete the IPTC entries from XMP (using the same incorrect encoding that they were added with)

2. Solve your IPTC encoding problems

3. Re-insert the IPTC back into XMP

- Phil
Title: Re: Replace characters in all fields and set the Charset to UTF-8
Post by: Kugelblitz on November 13, 2018, 05:31:14 PM
Hello Phil, thank you very much for your reply.
Sounds like that is a more of a manual task than an automation.

How can I list all the Files that contain any of these "wrong" charaters? So I have the files to work on.

ˆ
¸
ƒ
÷



Cheers
Title: Re: Replace characters in all fields and set the Charset to UTF-8
Post by: Phil Harvey on November 13, 2018, 07:23:52 PM
Very good question.  The character encoding is system dependent (a-la FAQ 10), so your mileage may vary, but this works for me on the Mac:

> exiftool a.jpg b.jpg -filename -subject -if '$subject =~ /[‰ ˆ¸ƒ÷‹fl]/'
======== a.jpg
File Name                       : a.jpg
Subject                         : ƒ
    1 files failed condition


- Phil
Title: Re: Replace characters in all fields and set the Charset to UTF-8
Post by: Kugelblitz on November 18, 2018, 07:31:07 AM
Just like to add a quite helpful Table here.


Table for Debugging Common UTF-8 Character Encoding Problems.



































































UnicodeWin1252ExpectedActualUTF-8Byte | UnicodeWin1252ExpectedActualUTF-8Byte
U+20AC 0x80 â,¬ %E2 %82 %AC | U+00C0 0xC0 ÀÀ %C3 %80
0x81 | U+00C1 0xC1 ÁÃ %C3 %81
U+201A 0x82 ,‚ %E2 %80 %9A | U+00C2 0xC2 ÂÃ, %C3 %82
U+0192 0x83 Į' %C6 %92 | U+00C3 0xC3 ̈ %C3 %83
U+201E 0x84 ,,„ %E2 %80 %9E | U+00C4 0xC4 ÄÃ,, %C3 %84
U+2026 0x85 ...… %E2 %80 %A6 | U+00C5 0xC5 ÅÃ... %C3 %85
U+2020 0x86 †%E2 %80 %A0 | U+00C6 0xC6 ÆÃ† %C3 %86
U+2021 0x87 ‡ %E2 %80 %A1 | U+00C7 0xC7 ÇÇ %C3 %87
U+02C6 0x88 ˆË† %CB %86 | U+00C8 0xC8 ÈÈ %C3 %88
U+2030 0x89 ‰ %E2 %80 %B0 | U+00C9 0xC9 ÉÉ %C3 %89
U+0160 0x8A ŠÅ %C5 %A0 | U+00CA 0xCA ÊÊ %C3 %8A
U+2039 0x8B ‹ %E2 %80 %B9 | U+00CB 0xCB ËË %C3 %8B
U+0152 0x8C ŒÅ' %C5 %92 | U+00CC 0xCC ÌÃŒ %C3 %8C
0x8D | U+00CD 0xCD ÍÃ %C3 %8D
U+017D 0x8E ŽÅ½ %C5 %BD | U+00CE 0xCE ÎÃŽ %C3 %8E
0x8F | U+00CF 0xCF ÏÃ %C3 %8F
0x90 | U+00D0 0xD0 ÐÃ %C3 %90
U+2018 0x91 '‘ %E2 %80 %98 | U+00D1 0xD1 ÑÃ' %C3 %91
U+2019 0x92 '’ %E2 %80 %99 | U+00D2 0xD2 ÒÃ' %C3 %92
U+201C 0x93 "“ %E2 %80 %9C | U+00D3 0xD3 ÓÃ" %C3 %93
U+201D 0x94 "†%E2 %80 %9D | U+00D4 0xD4 ÔÃ" %C3 %94
U+2022 0x95 • %E2 %80 %A2 | U+00D5 0xD5 ÕÕ %C3 %95
U+2013 0x96 â€" %E2 %80 %93 | U+00D6 0xD6 ÖÖ %C3 %96
U+2014 0x97 â€" %E2 %80 %94 | U+00D7 0xD7 ×× %C3 %97
U+02DC 0x98 ˜Ëœ %CB %9C | U+00D8 0xD8 ØÃ˜ %C3 %98
U+2122 0x99 â,,¢ %E2 %84 %A2 | U+00D9 0xD9 ÙÙ %C3 %99
U+0161 0x9A šÅ¡ %C5 %A1 | U+00DA 0xDA ÚÚ %C3 %9A
U+203A 0x9B › %E2 %80 %BA | U+00DB 0xDB ÛÛ %C3 %9B
U+0153 0x9C œÅ" %C5 %93 | U+00DC 0xDC ÜÜ %C3 %9C
0x9D | U+00DD 0xDD ÝÃ %C3 %9D
U+017E 0x9E žÅ¾ %C5 %BE | U+00DE 0xDE ÞÞ %C3 %9E
U+0178 0x9F ŸÅ¸ %C5 %B8 | U+00DF 0xDF ßß %C3 %9F
U+00A0 0xA0  %C2 %A0 | U+00E0 0xE0 àà %C3 %A0
U+00A1 0xA1 ¡Â¡ %C2 %A1 | U+00E1 0xE1 áá %C3 %A1
U+00A2 0xA2 ¢Â¢ %C2 %A2 | U+00E2 0xE2 ââ %C3 %A2
U+00A3 0xA3 £Â£ %C2 %A3 | U+00E3 0xE3 ãã %C3 %A3
U+00A4 0xA4 ¤Â¤ %C2 %A4 | U+00E4 0xE4 ää %C3 %A4
U+00A5 0xA5 ¥Â¥ %C2 %A5 | U+00E5 0xE5 åÃ¥ %C3 %A5
U+00A6 0xA6 ¦Â¦ %C2 %A6 | U+00E6 0xE6 æÃ¦ %C3 %A6
U+00A7 0xA7 §Â§ %C2 %A7 | U+00E7 0xE7 çç %C3 %A7
U+00A8 0xA8 ¨Â¨ %C2 %A8 | U+00E8 0xE8 èè %C3 %A8
U+00A9 0xA9 ©Â© %C2 %A9 | U+00E9 0xE9 éé %C3 %A9
U+00AA 0xAA ªÂª %C2 %AA | U+00EA 0xEA êê %C3 %AA
U+00AB 0xAB «Â« %C2 %AB | U+00EB 0xEB ëë %C3 %AB
U+00AC 0xAC ¬Â¬ %C2 %AC | U+00EC 0xEC ìì %C3 %AC
U+00AD 0xAD ­Â­ %C2 %AD | U+00ED 0xED íí %C3 %AD
U+00AE 0xAE ®Â® %C2 %AE | U+00EE 0xEE îî %C3 %AE
U+00AF 0xAF ¯Â¯ %C2 %AF | U+00EF 0xEF ïï %C3 %AF
U+00B0 0xB0 °Â° %C2 %B0 | U+00F0 0xF0 ðð %C3 %B0
U+00B1 0xB1 ±Â± %C2 %B1 | U+00F1 0xF1 ññ %C3 %B1
U+00B2 0xB2 ²Â² %C2 %B2 | U+00F2 0xF2 òò %C3 %B2
U+00B3 0xB3 ³Â³ %C2 %B3 | U+00F3 0xF3 óó %C3 %B3
U+00B4 0xB4 ´Â´ %C2 %B4 | U+00F4 0xF4 ôô %C3 %B4
U+00B5 0xB5 µÂµ %C2 %B5 | U+00F5 0xF5 õõ %C3 %B5
U+00B6 0xB6 ¶ %C2 %B6 | U+00F6 0xF6 öö %C3 %B6
U+00B7 0xB7 ·Â· %C2 %B7 | U+00F7 0xF7 ÷÷ %C3 %B7
U+00B8 0xB8 ¸Â¸ %C2 %B8 | U+00F8 0xF8 øÃ¸ %C3 %B8
U+00B9 0xB9 ¹Â¹ %C2 %B9 | U+00F9 0xF9 ùù %C3 %B9
U+00BA 0xBA ºÂº %C2 %BA | U+00FA 0xFA úú %C3 %BA
U+00BB 0xBB »Â» %C2 %BB | U+00FB 0xFB ûû %C3 %BB
U+00BC 0xBC ¼Â¼ %C2 %BC | U+00FC 0xFC üü %C3 %BC
U+00BD 0xBD ½Â½ %C2 %BD | U+00FD 0xFD ýý %C3 %BD
U+00BE 0xBE ¾Â¾ %C2 %BE | U+00FE 0xFE þþ %C3 %BE
U+00BF 0xBF ¿Â¿ %C2 %BF | U+00FF 0xFF ÿÿ %C3 %BF


Source: https://www.i18nqa.com/debug/utf8-debug.html (https://www.i18nqa.com/debug/utf8-debug.html)