repair meta data error and bad chars

Lanthony · March 29, 2021, 07:12:33 AM

PRELUDE:

Over the last twenty plus years have been digitising photos and slides using various scanners and any application software that is provided. This was done mainly with windows XP and later Win 7.

Eventually digital cameras and now smartphones have relegated all that to the scrap heap.

Currently the main computer (new) has win10 with all the digitised media.(with multi backups)

Note also have Linux mint on the old computer to use for experimenting with exiftool.

While using ffmpeg with some of the photos (and videos) found the results OK but not exactly as expected. (still learning to use that as well)

PROBLEM:

Long story short it seems the scanner SW has been very recalcitrant in what it did especially when it allowed adding extra details like comments etc.

All the supplied SW and its behaviours has been left behind when migrated to win7. Until now with exiftool exposing them all.

Code Select


Exiftool -r -s -G -csv -sort DIR > all.csv

or due to majority being jpg

Code Select


Exiftool -r -s -G -csv -sort -ext jpg DIR > alljpg.csv

The csv file opened in libre office calc and immediately could see odd characters in many places.

1-Some bad chars where in the rows associated with an image and in many places that can be linked to the SW used in scanning and subsequent editing. But not all can be though. Some even using modern SW on both win10 and Linux. Then again it might be that it was there in the first place (maybe but still looking).

2-Some bad characters where found scattered in the first column (source file). Maybe leaked from the previous image but easy enough to delete the rows..

Opened the csv file in linux mint standard text editor and it made sure it was know that it had bad chars and not to be used saved etc.

Tracked bad chars to different columns that could then (in calc) have the content cleared leaving the header which then satisfied the text editor.

My thought experiment was to use the said csv file to update images after deleting everything.

Code Select

Exiftool -r -ext jpg -all= DIR
then

Code Select

Exiftool -r -ext jpg -csv=alljpg.csv DIR

That way it would repair any images --- but no.

Code Select

exiftool -validate -error -warning -a DIR

did not provide any errors only warnings with minor number in brackets.
Same before and after above.

QUESTION:

Is it possible for exiftool to repair or even delete the errors (bad chars).
Anything that is not done right etc..

Code Select


exiftool  -r -all= -tagsfromfile @ -all:all -unsafe -icc_profile DIR

Still left bad chars in the csv file.

END:

most important fields are

dateTimeOriginal (if any)
comments within old photos (if any)

make and model would be advantageous to help identify who (camera used) and event location etc.

of course all location data in GPS enabled devices

If all goes well then will add (change) other fields in csv file to update each image.
EG: dates to reflect when photo was taken rather than when scanned.

StarGeek · March 29, 2021, 11:20:30 AM

What are these "bad characters"? This is very vague and gives us no information to work with. Can you provide an example image with this problem?

Also, have you made adjustments for the command line code page? See FAQ #18. Windows has extremely poor support for non-ascii characters and UTF8 characters can show up as weird even though that actual data in the file is correct. This is because Windows would be displaying 2 byte-1 character symbols as 2 characters, something like Ã© instead of just © for the copyright symbol.

Try checking one of these files on the linux machine to see if they're actually bad. Also check with a program with good metadata support such as Adobe Bridge or DigiKam, both of which are free.

Lanthony · March 30, 2021, 04:45:07 PM

As mentioned I use Linux.

I use Linux with exif tools installed so as to learn about exiftools and what it can do and cannot do.

I use Linux with a copy of photos so any mistakes (by me) do not cause any long term grief.

Previously all my work flow has been with a progression of windows versions and currently all photos are stored on the latest windows machine.

I am aware of windows behaviours when using the cli but if you mean using and storing on windows machine is the problem, well anything is possible in the digital age.

Digicam
Have use it in the past and it was one of those apps that tries to be everything for everybody and very annoying at that. Had to eventually remove it due to its behaviour that could not be tamed.

Will have to resurrect it now to see if it can recognise my problems and help with answers.

BTW: did I mention that I use Linux.

quickshot · January 29, 2025, 05:54:37 PM

Quote from: StarGeek on March 29, 2021, 11:20:30 AMWhat are these "bad characters"?

I'm not the original poster, but I realized that Adobe Photoshop Lightroom seems to write invalid character encodings in RDF, too.
For example exiftool outputs "LÃ¶ffelstÃ¶r" for the title when it should read "Löffelstör" ("ö" is code point $00F6 / ö / \u00F6 / UTF-8 \xC3\xB6, and "Ã" is \xC3 and "¶" is \xB6), so it seems UTF-8 was written, but not read as UTF-8.

The original looks like this:

Code Select

  <dc:title>
   <rdf:Alt>
    <rdf:li xml:lang='x-default'>LÃ¶ffelstÃ¶r</rdf:li>
   </rdf:Alt>
  </dc:title>

(https://www.w3.org/TR/rdf12-concepts/ states "RDF uses Unicode [Unicode] as the fundamental representation for string values. Within this, and related specifications, the term string, or RDF string, is used to describe an ordered sequence of zero or more Unicode code points which are Unicode scalar values. Unicode scalar values do not include the surrogate code points. Note that most concrete RDF syntaxes require the use of the UTF-8 character encoding [RFC3629], and use the \u0000 or \U00000000 forms to express certain non-character values.")

But exiftool has different defaults, it seems:

Code Select

             TYPE       Description                                  Default
             ---------  -------------------------------------------  -------
             EXIF       Internal encoding of EXIF "ASCII" strings    (none)
             ID3        Internal encoding of ID3v1 information       Latin
             IPTC       Internal IPTC encoding to assume when        Latin
                         IPTC:CodedCharacterSet is not defined
             Photoshop  Internal encoding of Photoshop IRB strings   Latin
             QuickTime  Internal encoding of QuickTime strings       MacRoman
             RIFF       Internal encoding of RIFF strings            0

StarGeek · January 29, 2025, 07:31:13 PM

Quote from: quickshot on January 29, 2025, 05:54:37 PM
Quote from: StarGeek on March 29, 2021, 11:20:30 AMWhat are these "bad characters"?
I'm not the original poster, but I realized that Adobe Photoshop Lightroom seems to write invalid character encodings in RDF, too.
...
The original looks like this:
Code Select Expand
<dc:title> <rdf:Alt> <rdf:li xml:lang='x-default'>LÃ¶ffelstÃ¶r</rdf:li> </rdf:Alt> </dc:title>

Based upon this listing and that you are using LightRoom, I'm guessing that you are looking at the "Raw Data"? I don't use LR, so I don't know exactly what you are looking at, but I'm assuming it's similar to what Adobe Bridge displays

The thing to understand about the "Raw Data" is that you are not seeing the data that is actually in the file, but LightRoom's interpretation of the data as it would be written to an XMP sidecar. There might not be any actual XMP data in the file, as is the case with the above example.

In this file, I have written "Löffelstör" to the IPTC:Caption-Abstract. There is no XMP data in the file

Code Select

C:\>exiftool -G1 -a -s -IPTC:All -XMP:All y:\!temp\Test4.jpg 
[IPTC]          Caption-Abstract                : Löffelstör
[IPTC]          ApplicationRecordVersion        : 4

So LightRoom is reading the Caption-Abstract and placing the value in the corresponding XMP location XMP-dc:Description in the Raw Data display.

As you can see in the exiftool output, the IPTC:CodedCharacterSet tag has not been written to the file. This means that the data is not supposed to be interpreted as UTF8 data. Instead, according to the specs, it is to be interpreted as ISO 646 IRV (7 bits) or ISO 4873 DV (8 bits). See this post where I went digging around in the original IPTC IIM specs.

From what I could tell, Latin encoding seems to be the closest to those ISO specs. So technically, LightRoom is reading the data correctly. The data has probably been written incorrectly, having been written to the IPTC IIM locations without using the proper CodedCharacterSet.

Double check you example file with
exiftool -G1 -a -s -IPTC:All -XMP:All -IPTCDigest -Warning file.jpg

I've included IPTCDigest and Warning in the command because if the IPTCDigest is not current, exiftool will give a warning that says "IPTCDigest is not current". When that is the case, LR will ignore any XMP data and only read the IPTC data. That's a whole different issue, and you can search these forums for my comments on the IPTCDigest.

News:

repair meta data error and bad chars

Lanthony

StarGeek

Lanthony

quickshot

StarGeek