Replace characters in all fields and set the Charset to UTF-8

Started by Kugelblitz, November 13, 2018, 09:29:39 AM

Previous topic - Next topic

Kugelblitz

Hello,

I noticed some of the German written Names in many fields appear double with the duplicate beeing written with wrong characters.



For example the City "Cologne" is written in German "Köln" and appears twice in the Metadata of some files
something like this Müllenbach is displayed as  M¸llenbach or even Müllenbach and Köln, Kˆln or even Kö ln.

I like to replace all of the following characters in all Fields. Cause they Appear in so many different fields like Keywords, Geo Names and Copyright

‰ -> ä
ˆ -> ö
¸ -> ü
ƒ -> Ä
÷ -> Ö
‹ -> Ü
fl -> ß

and also

ä -> ä
ö -> ö
ü -> ü
Ã,, -> Ä
Ö -> Ö
Ãœ -> Ü
ß -> ß

And set the encoding to UTF-8 for all fields too.

How can I check which charset has been used in the Files?

When I drag n drop files to the exiftool this is the output.
---- ExifTool ----
ExifToolVersion                 : 11.17
---- XMP ----
XMPToolkit                      : Image::ExifTool 10.96
CountryCode                     : DEU
Location                        : N├╝rburg
Subject                         : Deutschland, geotagged, M├╝llenbach, Rheinland-Pfalz, DEU, Deutschland, N├╝rburg, Rheinland-Pfalz
Country                         : Deutschland
State                           : Rheinland-Pfalz
CreatorTool                     : 1.01
---- IPTC ----
ApplicationRecordVersion        : 4
Keywords                        : Deutschland, geotagged, M├â┬╝llenbach, Rheinland-Pfalz, DEU, Deuts
City                            : N├╝rburg
Sub-location                    : N├╝rburg
Province-State                  : Rheinland-Pfalz
Country-PrimaryLocationCode     : DEU
Country-PrimaryLocationName     : Deutschland


or

---- ExifTool ----
ExifToolVersion                 : 11.17
---- XMP ----
XMPToolkit                      : Image::ExifTool 10.96
CountryCode                     : DEU
Location                        : M├╝llenbach
Rights                          : 2009 mARTin Bierschenk
Subject                         : Deutschland, geotagged, M┬©llenbach, M├╝llenbach, Rheinland-Pfalz
Country                         : Deutschland
State                           : Rheinland-Pfalz
CreatorTool                     : 1.01
---- IPTC ----
EnvelopeRecordVersion           : 4
CodedCharacterSet               : UTF8
ApplicationRecordVersion        : 4
Keywords                        : Deutschland, geotagged, M┬©llenbach, M├╝llenbach, Rheinland-Pfal
City                            : M┬©llenbach
Sub-location                    : M┬©llenbach
Province-State                  : Rheinland-Pfalz
Country-PrimaryLocationCode     : DEU
Country-PrimaryLocationName     : Deutschland


Phil Harvey

The first step is to sort out your IPTC character coding problem.  See the IPTC section of FAQ 10 for help here.

It looks like your XMP has got invalid characters because you have copied them from IPTC without using the proper encoding.

I would suggest these steps to fix the problem:

1. Delete the IPTC entries from XMP (using the same incorrect encoding that they were added with)

2. Solve your IPTC encoding problems

3. Re-insert the IPTC back into XMP

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Kugelblitz

Hello Phil, thank you very much for your reply.
Sounds like that is a more of a manual task than an automation.

How can I list all the Files that contain any of these "wrong" charaters? So I have the files to work on.

ˆ
¸
ƒ
÷



Cheers

Phil Harvey

Very good question.  The character encoding is system dependent (a-la FAQ 10), so your mileage may vary, but this works for me on the Mac:

> exiftool a.jpg b.jpg -filename -subject -if '$subject =~ /[‰ ˆ¸ƒ÷‹fl]/'
======== a.jpg
File Name                       : a.jpg
Subject                         : ƒ
    1 files failed condition


- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Kugelblitz

Just like to add a quite helpful Table here.


Table for Debugging Common UTF-8 Character Encoding Problems.



































































UnicodeWin1252ExpectedActualUTF-8Byte | UnicodeWin1252ExpectedActualUTF-8Byte
U+20AC 0x80 â,¬ %E2 %82 %AC | U+00C0 0xC0 ÀÀ %C3 %80
0x81 | U+00C1 0xC1 ÁÃ %C3 %81
U+201A 0x82 ,‚ %E2 %80 %9A | U+00C2 0xC2 ÂÃ, %C3 %82
U+0192 0x83 Į' %C6 %92 | U+00C3 0xC3 ̈ %C3 %83
U+201E 0x84 ,,„ %E2 %80 %9E | U+00C4 0xC4 ÄÃ,, %C3 %84
U+2026 0x85 ...… %E2 %80 %A6 | U+00C5 0xC5 ÅÃ... %C3 %85
U+2020 0x86 †%E2 %80 %A0 | U+00C6 0xC6 ÆÆ %C3 %86
U+2021 0x87 ‡ %E2 %80 %A1 | U+00C7 0xC7 ÇÇ %C3 %87
U+02C6 0x88 ˆË† %CB %86 | U+00C8 0xC8 ÈÈ %C3 %88
U+2030 0x89 ‰ %E2 %80 %B0 | U+00C9 0xC9 ÉÉ %C3 %89
U+0160 0x8A ŠÅ %C5 %A0 | U+00CA 0xCA ÊÊ %C3 %8A
U+2039 0x8B ‹ %E2 %80 %B9 | U+00CB 0xCB ËË %C3 %8B
U+0152 0x8C ŒÅ' %C5 %92 | U+00CC 0xCC ÌÃŒ %C3 %8C
0x8D | U+00CD 0xCD ÍÃ %C3 %8D
U+017D 0x8E ŽÅ½ %C5 %BD | U+00CE 0xCE ÎÃŽ %C3 %8E
0x8F | U+00CF 0xCF ÏÃ %C3 %8F
0x90 | U+00D0 0xD0 ÐÃ %C3 %90
U+2018 0x91 '‘ %E2 %80 %98 | U+00D1 0xD1 ÑÃ' %C3 %91
U+2019 0x92 '’ %E2 %80 %99 | U+00D2 0xD2 ÒÃ' %C3 %92
U+201C 0x93 "“ %E2 %80 %9C | U+00D3 0xD3 ÓÃ" %C3 %93
U+201D 0x94 "†%E2 %80 %9D | U+00D4 0xD4 ÔÃ" %C3 %94
U+2022 0x95 • %E2 %80 %A2 | U+00D5 0xD5 ÕÕ %C3 %95
U+2013 0x96 â€" %E2 %80 %93 | U+00D6 0xD6 ÖÖ %C3 %96
U+2014 0x97 â€" %E2 %80 %94 | U+00D7 0xD7 ×× %C3 %97
U+02DC 0x98 ˜Ëœ %CB %9C | U+00D8 0xD8 ØØ %C3 %98
U+2122 0x99 â,,¢ %E2 %84 %A2 | U+00D9 0xD9 ÙÙ %C3 %99
U+0161 0x9A šÅ¡ %C5 %A1 | U+00DA 0xDA ÚÚ %C3 %9A
U+203A 0x9B › %E2 %80 %BA | U+00DB 0xDB ÛÛ %C3 %9B
U+0153 0x9C œÅ" %C5 %93 | U+00DC 0xDC ÜÃœ %C3 %9C
0x9D | U+00DD 0xDD ÝÃ %C3 %9D
U+017E 0x9E žÅ¾ %C5 %BE | U+00DE 0xDE ÞÞ %C3 %9E
U+0178 0x9F ŸÅ¸ %C5 %B8 | U+00DF 0xDF ßß %C3 %9F
U+00A0 0xA0  %C2 %A0 | U+00E0 0xE0 àà %C3 %A0
U+00A1 0xA1 ¡Â¡ %C2 %A1 | U+00E1 0xE1 áá %C3 %A1
U+00A2 0xA2 ¢Â¢ %C2 %A2 | U+00E2 0xE2 ââ %C3 %A2
U+00A3 0xA3 £Â£ %C2 %A3 | U+00E3 0xE3 ãã %C3 %A3
U+00A4 0xA4 ¤Â¤ %C2 %A4 | U+00E4 0xE4 ää %C3 %A4
U+00A5 0xA5 ¥Â¥ %C2 %A5 | U+00E5 0xE5 åÃ¥ %C3 %A5
U+00A6 0xA6 ¦Â¦ %C2 %A6 | U+00E6 0xE6 ææ %C3 %A6
U+00A7 0xA7 §Â§ %C2 %A7 | U+00E7 0xE7 çç %C3 %A7
U+00A8 0xA8 ¨Â¨ %C2 %A8 | U+00E8 0xE8 èè %C3 %A8
U+00A9 0xA9 ©Â© %C2 %A9 | U+00E9 0xE9 éé %C3 %A9
U+00AA 0xAA ªÂª %C2 %AA | U+00EA 0xEA êê %C3 %AA
U+00AB 0xAB «Â« %C2 %AB | U+00EB 0xEB ëë %C3 %AB
U+00AC 0xAC ¬Â¬ %C2 %AC | U+00EC 0xEC ìì %C3 %AC
U+00AD 0xAD ­Â­ %C2 %AD | U+00ED 0xED íí %C3 %AD
U+00AE 0xAE ®Â® %C2 %AE | U+00EE 0xEE îî %C3 %AE
U+00AF 0xAF ¯Â¯ %C2 %AF | U+00EF 0xEF ïï %C3 %AF
U+00B0 0xB0 °Â° %C2 %B0 | U+00F0 0xF0 ðð %C3 %B0
U+00B1 0xB1 ±Â± %C2 %B1 | U+00F1 0xF1 ññ %C3 %B1
U+00B2 0xB2 ²Â² %C2 %B2 | U+00F2 0xF2 òò %C3 %B2
U+00B3 0xB3 ³Â³ %C2 %B3 | U+00F3 0xF3 óó %C3 %B3
U+00B4 0xB4 ´Â´ %C2 %B4 | U+00F4 0xF4 ôô %C3 %B4
U+00B5 0xB5 µÂµ %C2 %B5 | U+00F5 0xF5 õõ %C3 %B5
U+00B6 0xB6 ¶ %C2 %B6 | U+00F6 0xF6 öö %C3 %B6
U+00B7 0xB7 ·Â· %C2 %B7 | U+00F7 0xF7 ÷÷ %C3 %B7
U+00B8 0xB8 ¸Â¸ %C2 %B8 | U+00F8 0xF8 øø %C3 %B8
U+00B9 0xB9 ¹Â¹ %C2 %B9 | U+00F9 0xF9 ùù %C3 %B9
U+00BA 0xBA ºÂº %C2 %BA | U+00FA 0xFA úú %C3 %BA
U+00BB 0xBB »Â» %C2 %BB | U+00FB 0xFB ûû %C3 %BB
U+00BC 0xBC ¼Â¼ %C2 %BC | U+00FC 0xFC üü %C3 %BC
U+00BD 0xBD ½Â½ %C2 %BD | U+00FD 0xFD ýý %C3 %BD
U+00BE 0xBE ¾Â¾ %C2 %BE | U+00FE 0xFE þþ %C3 %BE
U+00BF 0xBF ¿Â¿ %C2 %BF | U+00FF 0xFF ÿÿ %C3 %BF


Source: https://www.i18nqa.com/debug/utf8-debug.html