Verifying metadata before operations, repair options

Started by Jabber, February 26, 2025, 01:19:14 AM

Previous topic - Next topic

Jabber

Currently using exiftool to automate keyword and caption generation using local AI models. Nothing but praise for it.

Context:

In order to mitigate potential issues with reading images I have set exiftool run validate while extracting the metadata and if non-minor errors are found it skips that image from processing by checking the ExifTool:Validate entry.

However it seems various image storage providers export images en mass with corrupt or invalid metadata which causes validation to fail. Research has led me to:

exiftool -all= -tagsfromfile @ -all:all -unsafe bad.jpg
Questions:

  • Is this recommended as a method to naively repair invalid metadata?
  • Is this only applicable to jpegs?
  • Can one do this via "-ext jpg jpeg -r ."?
  • Any recommendations or thoughts?

I appreciate the help.

Phil Harvey

Quote from: Jabber on February 26, 2025, 01:19:14 AM1. Is this recommended as a method to naively repair invalid metadata?

No.  This method will drop any unknown metadata.  It is better taken on a case-by-case basis.  If the errors are in the EXIF, then rebuilding only the EXIF makes sense, but will still lose unknown/proprietary EXIF metadata.

Quote2. Is this only applicable to jpegs?

Mainly.  It should also work for HEIC files, but I haven't had occasion to try it on one yet.

Quote3. Can one do this via "-ext jpg jpeg -r ."?

Yes.

Quote4. Any recommendations or thoughts?

You should get a feeling for what command is best for the type of damage you typically see.  Try comparing a file before and after repair using this command to see if the result is acceptable:

exiftool -u test.jpg -diff test.jpg_original

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

StarGeek

Quote from: Jabber on February 26, 2025, 01:19:14 AMIs this recommended as a method to naively repair invalid metadata?

It should never be used on TIFF or TIFF based RAW files (such as NEF, CR2, etc). See the note under FAQ #20.

QuoteCan one do this via "-ext jpg jpeg -r ."?

Not quite. You have to use the -ext (-extension) option for every extension
-ext jpg -ext jpeg

Quote from: Jabber on February 26, 2025, 01:19:14 AMAny recommendations or thoughts?

I still would like an option to append the AI description to the Description tag. My workflow already adds some basic data to the Description and this gets overwritten. The AI description will never be the final result for me, but it will give me some useful information initially so I can still search the images until I find time to write the final Description.

Also, I use Tesseract to embed OCRd text. But the files (game screenshots) also include images. The Tesseract OCR gets clobbered. An example image can be seen in this post. The AI description would be useful but I would also need the OCR text.
"It didn't work" isn't helpful. What was the exact command used and the output.
Read FAQ #3 and use that cmd
Please use the Code button for exiftool output

Please include your OS/Exiftool version/filetype

Jabber

Currently the AI will add a brief caption and keywords in one step or a detailed caption and keywords in two steps to XMP:Description. It currently will overwrite the entry. Do you want to keep the old entry and add another one? I suppose an option could be added for that.

MiniCPM-V 2.6 is pretty good at OCR. The models are getting better at it all the time. There is a major problem though with having a vision capable language model do OCR work: they tend to fill in spaces with data that makes sense but isn't there if they can't read something.

Here is a little script I made that can help test the capability.

Here what MiniCPM-V 2.6 got from your image:



**Title:**
Witch's Wits

**Body Text:**
Without a warning, an old, withered woman has appeared at the gate. Nobody saw her arrive, but some explain that there are rumors about an eccentric, old lady living in the woods.

The woman cackles. "Interested in riddles? Answer this, [player name], arrive unasked, unseen, and when I do ask early, I bring death with me. What am I?"

**Options:**
1. Frost
2. Poison
3. Bad luck
4. A snake

**Highlighted Answer:**
2. Poison

StarGeek

Quote from: Jabber on February 26, 2025, 01:24:37 PMDo you want to keep the old entry and add another one? I suppose an option could be added for that.

It would just append the AI description to the current, with a couple of line feeds to separate it at least. On the command line it would be
exiftool -escapeC "-Description<${Description}\n\nAI Generated Description" /path/to/files/

QuoteMiniCPM-V 2.6 is pretty good at OCR. The models are getting better at it all the time. There is a major problem though with having a vision capable language model do OCR work: they tend to fill in spaces with data that makes sense but isn't there if they can't read something.


I figured as much, but I would think that tesseract would be faster. I just tested the tesseract config on 1,500 files and it processed about 2 files/second.
"It didn't work" isn't helpful. What was the exact command used and the output.
Read FAQ #3 and use that cmd
Please use the Code button for exiftool output

Please include your OS/Exiftool version/filetype

Jabber

Let me see what I can do regarding your feature request. I am in the middle of refactoring it to not use a json datastore for file status, so I think I can put that in.

For reference, with the refactor right now it takes an average of 1.39s per image to generate a short caption and keywords (150 tokens, total). Of course this is highly variable depending on the images, model, and whether or not it gets a good generation on the first try every time, but the speed on a 5 year old video card is pretty impressive. This is on Windows with an nVidia 3080 and Intel i3 12100f CPU.

StarGeek

One other thing that you might take into consideration. The Validate tag (-api Validate option) is a bit overly sensitive for casual use. It will list warnings for items that won't impact the actual viewing/processing of an image, nor will it affect writing data to the file.

For example
C:\>exiftool -g1 -a -s -warning -validate Y:\!temp\x\y\z\Test4.jpg
---- ExifTool ----
Warning                         : Wrong IFD for 0x9003 DateTimeOriginal (should be ExifIFD not IFD0)
Warning                         : Missing required JPEG ExifIFD tag 0x9101 ComponentsConfiguration
Warning                         : Missing required JPEG ExifIFD tag 0xa000 FlashpixVersion
Warning                         : Missing required JPEG ExifIFD tag 0xa002 ExifImageWidth
Warning                         : Missing required JPEG ExifIFD tag 0xa003 ExifImageHeight
Warning                         : [minor] IFD0 tag 0x0100 ImageWidth is not allowed in JPEG
Warning                         : [minor] IFD0 tag 0x0101 ImageHeight is not allowed in JPEG
Warning                         : [minor] Missing required JPEG IFD0 tag 0x0213 YCbCrPositioning
Validate                        : 8 Warnings (3 minor)

None of these would affect the viewing or processing of this image. The first is that a tag is in the wrong location. This may or may not affect the reading of this data, depending upon the program, but wouldn't affect any other operation.

The "Missing required" warnings are tags that the JPEG specs require, but are otherwise ignored by every program I've used.

And the "not allowed" warnings are tags that aren't supposed to be there, but often are created by some programs during conversions, i.e. when a program might convert a TIFF file (which requires these tags) into a JPEG.

Instead of checking for Warning with the Validate, you might check for Error. Warnings usually won't cause problems with writing, at least nothing I can think of offhand, but Errors will.
"It didn't work" isn't helpful. What was the exact command used and the output.
Read FAQ #3 and use that cmd
Please use the Code button for exiftool output

Please include your OS/Exiftool version/filetype

Jabber

I took your advice on the errors. I also added the feature you requested. Feedback is valued and appreciated!

* https://github.com/jabberjabberjabber/LLavaImageTagger/