Automated metadata generation via local AI

Started by Jabber, July 29, 2024, 03:55:16 AM

Previous topic - Next topic

Jabber

I am working on a script that generates metadata based querying a local API hosting an open weights AI vision model. It should give a decent out for filling in subjective metadata fields like keywords, titles, description, subject, and caption. I know there is a lot of AI hype and a much of it is hot air, but this is actually a use-case which is perfectly served by a small, open source, locally running model. No data is sent out anywhere, it is all done on the PC running it. Of course this means speed is exclusively going to depend on the hardware running it, with nvidia card users having the best performance, second to Mac users with unified memory, and all the others dependent mostly on their memory bandwidth.

Would love some feedback as I am new to working with metadata, and not the best programmer.


LLavaImageTagger

Example:



======== image2.jpg
ExifTool Version Number         : 12.85
File Name                       : image2.jpg
Directory                       : .
File Size                       : 105 kB
File Modification Date/Time     : 2024:07:29 03:42:09-04:00
File Access Date/Time           : 2024:07:29 03:44:36-04:00
File Creation Date/Time         : 2024:07:08 18:47:29-04:00
File Permissions                : -rw-rw-rw-
File Type                       : JPEG
File Type Extension             : jpg
MIME Type                       : image/jpeg
Current IPTC Digest             : 5ca225237efbf6c6b634a9a643f72966
Keywords                        : man, dog, snowy road, winter, clothing, tattoo, sunglasses, leas
Application Record Version      : 4
XMP Toolkit                     : Image::ExifTool 12.85
Caption                         : In the image, a man and his dog are the main subjects. The man is standing on a snowy road, dressed in a black t-shirt, a black beanie, a black and red plaid jacket, and black pants. He has a beard, a tattoo on his left arm, and is wearing sunglasses. In his right hand, he holds a red leash attached to a brown dog. The dog, sitting on the snow, is wearing a red collar. The background features a snowy road with trees on either side. The man and the dog are the only human and animal figures in the scene, making them the focal point of the image. There are no discernible texts or numbers in the image.
Description                     : A man and his dog are the focal point of this scene, standing on a snowy road surrounded by trees. The man is dressed in black and red, with a beard, a tattoo, and sunglasses. The dog sits patiently on the snow, wearing a red collar. A snowy landscape is the backdrop.
Subject                         : Outdoor scene, Winter, Human-Animal interaction, Man, Dog, Clothing, Accessory (leash, collar)
Title                           : Snowy Road with Man and Dog
Image Width                     : 720
Image Height                    : 750
Encoding Process                : Baseline DCT, Huffman coding
Bits Per Sample                 : 8
Color Components                : 3
Y Cb Cr Sub Sampling            : YCbCr4:2:0 (2 2)
Image Size                      : 720x750
Megapixels                      : 0.540


greybeard

Interesting (although I don't have a PC to test out your script).

I do think that local processing is the answer for privacy.

It will be interesting to see how effective Apple is going to be later this year (although they seem to be moving towards a hybrid approach combining local processing with secure cloud based Private Cloud Compute (PCC) servers and some integration with ChatGPT)

Martin B.

I think this is really cool! And I agree with greybeard about the advantage of local processing.

I haven't tried it, but I have a couple of questions:

1. Browsing the code, I think this is limited to files with jpg, jpeg, png, gif, and tiff extensions (no raw files). Am I correct? Is this a limitation of the AI processor? (It would be useful to mention this in the documentation.)

2. This modifies the image files to add the metadata (hence the requirement for ExifTool), right? Is there a way to store the metadata elsewhere? Would it be in the "local TinyDB database for easy querying" mentioned in the documentation? I use Lightroom to manage my photos, and that's where the metadata is stored (it avoids modifying the original raw files). Getting the metadata from your database into Lightroom would require separate software, using either the Lightroom API (if it supports modifying the metadata), or through XMP files and then reading those from Lightroom, but it's doable.

For what it's worth, digiKam uses local processing for Automatic Metadata (face recognition and image quality) and Find by Sketch. I haven't tried digiKam, but it seems to do things that Lightroom cannot, which is interesting.

Phil Harvey

This looks very interesting.  You mention MacOS in your post, but the README mentions only Windows.  What are the system requirements?  I'll add this to the ExifTool home page, but I need to know the systems it can run on.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

StarGeek

Quote from: Phil Harvey on July 29, 2024, 08:43:16 AMWhat are the system requirements?

It looks like it's mostly Python, though it's using a precompiled Windows binary for KoboldCPP (the AI model used). Though I can't quite see where it gets downloaded. There are precompiled Linux binaries (that sounds wrong :D) but a Mac has to compile the binaries from source.

I would think this could also be Dockerized.

So overall, it could technically be run on anything, with a little extra work.

KoboldCPP has been something on my list of things to take a closer look at... if I ever find the time.
* Did you read FAQ #3 and use the command listed there?
* Please use the Code button for exiftool code/output.
 
* Please include your OS, Exiftool version, and type of file you're processing (MP4, JPG, etc).

Jabber

Quote from: Martin B. on July 29, 2024, 07:20:09 AMI think this is really cool! And I agree with greybeard about the advantage of local processing.

I haven't tried it, but I have a couple of questions:

1. Browsing the code, I think this is limited to files with jpg, jpeg, png, gif, and tiff extensions (no raw files). Am I correct? Is this a limitation of the AI processor? (It would be useful to mention this in the documentation.)

2. This modifies the image files to add the metadata (hence the requirement for ExifTool), right? Is there a way to store the metadata elsewhere? Would it be in the "local TinyDB database for easy querying" mentioned in the documentation? I use Lightroom to manage my photos, and that's where the metadata is stored (it avoids modifying the original raw files). Getting the metadata from your database into Lightroom would require separate software, using either the Lightroom API (if it supports modifying the metadata), or through XMP files and then reading those from Lightroom, but it's doable.

For what it's worth, digiKam uses local processing for Automatic Metadata (face recognition and image quality) and Find by Sketch. I haven't tried digiKam, but it seems to do things that Lightroom cannot, which is interesting.


1. CLIP uses RGB pixel data, but KoboldCPP is able to turn jpegs and pngs into normalized RGB for it. I am sure different image encodings can be added, but I am working on one thing at a time right now, and I personally dislike webp, so I didn't care about adding it as a priority. I think the easiest way might just be to reencode the images into jpeg and send that to the model, and discard it afterwards. llava 1.5 downsizes everything to 576pixels square anyway (and chops into a grid of 8 pieces), so the resolution doesn't matter.

2. It stores everything in UTF-8 text as a series of JSON objects, like so:

        "2": {
            "absolute_path": "C:\\Users\\User\\Desktop\\image.jpg",
            "created": "Thu Jul 25 23:27:07 2024",
            "exif_metadata": {
                "Composite:ImageSize": "1920 1520",
                "Composite:Megapixels": 2.9184,
                "ExifTool:ExifToolVersion": 12.85,
                "File:BitsPerSample": 8,
                "File:ColorComponents": 3,
                "File:CurrentIPTCDigest": "69448c104cc383fe55650e72d76709b1",
                "File:Directory": "C:/Users/User/Desktop",
                "File:EncodingProcess": 0,
                "File:FileAccessDate": "2024:07:26 00:38:41-04:00",
                "File:FileCreateDate": "2024:07:25 23:27:07-04:00",
                "File:FileModifyDate": "2024:07:26 00:32:26-04:00",
                "File:FileName": "image.jpg",
                "File:FilePermissions": 100666,
                "File:FileSize": 1395551,
                "File:FileType": "JPEG",
                "File:FileTypeExtension": "JPG",
                "File:ImageHeight": 1520,
                "File:ImageWidth": 1920,
                "File:MIMEType": "image/jpeg",
                "File:YCbCrSubSampling": "2 2",
                "IPTC:ApplicationRecordVersion": 4,
                "IPTC:Keywords": "Astronomical, Night Sky, Oil Painting, Tower, Cityscape, Moon, S",
                "SourceFile": "C:/Users/User/Desktop/image.jpg",
                "XMP:Caption": "A captivating oil painting depicting an astronomical scene with a deep blue night sky, white stars, a bright yellow moon, and a mesmerizing blue and white spiral. The foreground features a tall black tower surrounded by a cityscape set against a mountain backdrop. Framed by a white border, the painting draws attention to its central focus.",
                "XMP:Description": "A captivating oil painting depicting an astronomical scene with a deep blue night sky, white stars, a bright yellow moon, and a mesmerizing blue and white spiral. The foreground features a tall black tower surrounded by a cityscape set against a mountain backdrop. Framed by a white border, the painting draws attention to its central focus.",
                "XMP:Subject": "Art, Astronomy, Landscape, Cityscape, Tower",
                "XMP:Title": "Astronomical Night Sky with Tower and Cityscape",
                "XMP:XMPToolkit": "Image::ExifTool 12.85"
            },
            "extension": ".jpg",
            "file_hash": "b72c7db508a57421",
            "filename": "image.jpg",
            "llm_metadata": {
                "Keywords": [
                    "Astronomical",
                    "Night Sky",
                    "Oil Painting",
                    "Tower",
                    "Cityscape",
                    "Moon",
                    "Stars"
                ],
                "Subject": "Art, Astronomy, Landscape, Cityscape, Tower",
                "Summary": "A captivating oil painting depicting an astronomical scene with a deep blue night sky, white stars, a bright yellow moon, and a mesmerizing blue and white spiral. The foreground features a tall black tower surrounded by a cityscape set against a mountain backdrop. Framed by a white border, the painting draws attention to its central focus.",
                "Title": "Astronomical Night Sky with Tower and Cityscape"
            },
            "modified": "Fri Jul 26 00:32:26 2024",
            "relative_path": "image.jpg",
            "size": 1395551
        }

These are all in the filedata.json left in the root directory. You can do whatever you want with it, since it is super easy to parse. Nothing gets written to the images unless the boxes are checked in the UI or the flags are given in the CLI.

FYI: The way that digikam and other AI image indexing/searching works is fundamentally different to this. They use the embeddings for comparison to other images or text, but aren't able to 'describe' what the image is because they aren't plugged into a language model. The LLM (large language model) acts as the 'translator' with the image projector. The image projector is basically CLIP, which takes an image and maps it to a vector embedding, which is a many-dimensional matrix of numbers in which different 'ideas' group together. The language model has language mapped to these numbers while the projector has images, and it can then explain them, if that makes sense.

Jabber

Quote from: Phil Harvey on July 29, 2024, 08:43:16 AMThis looks very interesting.  You mention MacOS in your post, but the README mentions only Windows.  What are the system requirements?  I'll add this to the ExifTool home page, but I need to know the systems it can run on.

- Phil

Sorry I don't have a Mac so I can't write instructions, but the only thing that needs to work is KoboldCPP (which is cross compiled for Windows and Linux), exiftool, and python. Unfortunately it looks like Mac users have to compile it themselves:

* https://github.com/LostRuins/koboldcpp/wiki#how-do-i-get-started-with-koboldcpp-what-do-i-need-how-do-i-compile-koboldcpp-from-source-code

Though I don't see why a forked version compiled for Mac couldn't be hosted as long as the source is published since KoboldCPP is licensed under AGPL.

Running powerful language models locally is kind of the razor's edge of bleeding edge right now. Just this week we have had a release of 3 huge foundational models (Llama 3.1, Mistral Large, Mistral Nemo) and just a few weeks previous we had Gemma. Each new model release has to be engineered into the inference engines and the dev time tends to be a few days, all open sourced, which is remarkable. Keeping up is tough! Luckily the stuff this script does is not reliant on any SOTA tech, unless you count 6 months old as state-of-the-art, but in terms of language models it is practically geriatric.

Thanks for the interest. I am really looking for feedback from people who are able to give it a go, more than I am for visibility. I'd like to get it at least somewhat useable by people who aren't savvy with this particular niche of tech to be able to operate it, and it is tough for me to judge.

Phil Harvey

OK, thanks.  I hope some people here will have the time to try this out.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Jabber

I updated it to handle RAW files. What it now does is check mimetype, and if it is an image but not a jpeg or png it will use exiftool to check for an embedded jpeg and send that to the API, and if there isn't one it will do a quick JPEG encode and send that to the API then discard it, keeping it in memory to prevent writing to the disc excessively.

Is this an acceptable way to handle RAW files? I only have Nikon NEF files but they seem to work.



I appreciate the suggestion. Always open to more if anyone has them.

Jabber

Quote from: StarGeek on July 29, 2024, 09:46:13 AM
Quote from: Phil Harvey on July 29, 2024, 08:43:16 AMWhat are the system requirements?

It looks like it's mostly Python, though it's using a precompiled Windows binary for KoboldCPP (the AI model used). Though I can't quite see where it gets downloaded. There are precompiled Linux binaries (that sounds wrong :D) but a Mac has to compile the binaries from source.

I would think this could also be Dockerized.

So overall, it could technically be run on anything, with a little extra work.

KoboldCPP has been something on my list of things to take a closer look at... if I ever find the time.

Sorry I missed the part about the models being downloaded:

KoboldCPP is a front end for Llama.cpp which is an inference engine. The models themselves ('weights') are downloaded via Kobold using a curl wrapper. It gets the weights from huggingface, which is github for AI dev basically. You can see the links in the kcpp file:

   
    "mmproj": "https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-gguf/resolve/main/llava-llama-3-8b-v1_1-mmproj-f16.gguf",
    "model": "",
    "model_param": "https://huggingface.co/qwp4w3hyb/SFR-Iterative-DPO-LLaMA-3-8B-R-iMat-GGUF/resolve/main/sfr-iterative-dpo-llama-3-8b-r-imat-Q6_K.gguf",
     

The mmproj is a projector file. It is a CLIP vision encoder trained to work in the same embedding space as a language model. In this case that projector will work with almost all Llama-3 8b models.

GGUF is a container for model weights while allows quantization. This means you take the 32-bit floats (one for each of the 8billion in Llama-3 8b) and turn them into 6 bit ints (the Q6 in the file name), losing a bit of precision but also losing 26bits * 8billion needed to run the model, which would have to sit in RAM and it also makes it a good deal faster.

Phil Harvey

...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Jabber

Quote from: Phil Harvey on July 29, 2024, 08:43:16 AMThis looks very interesting.  You mention MacOS in your post, but the README mentions only Windows.  What are the system requirements?  I'll add this to the ExifTool home page, but I need to know the systems it can run on.

- Phil

I am probably mid-way through the process of getting it as functional as it should be.

I put up instructions for Mac users and Linux users and made a shell script which should be identical in effect to the one for Windows.

A user currently has the option of generating Keywords (MWG: Keywords) and Summary/Caption (MWG: Description) with a toggle for updating Keywords to add to existing. If keywords exist and the user chooses not to clear them, the model is given the existing keywords with the metadata which can be used generation. The number of keywords to generate is also user definable -- though it is beholden to the whims of the model whether to produce that number (and may cause it to just repeat the same words over and over if you press it hard enough).
 
I spoke with the KoboldCPP devs and they got a working binary with Metal acceleration built this week, though it is not yet officially distributed. It was given to me by them and is available on my github repo as a Mac ARM64 binary. As of 2024/08/10 it is current with 1.72, the the last main release of about 5 days ago. It was made given to me with the knowledge I would make it available to others but with the understanding that it is not supported by them.

Additional caveats:

  • I can't run it myself but am assured that it works by two trusted, separate and independent sources
  • I cannot think of a way to prove provenance from my end, so 'caveat emptor'
  • It is not to my knowledge signed because Apple requires a paid dev account to sign it and the Kobold devs are the type to refuse to do something as a statement of principle

If that didn't discourage every rational personal from even looking in its general direction then whomever is left, if you decide to run it, please let me know.

Be aware that my script is still a work in progress, and I am not an experienced developer -- I make things when I need them for myself, though in the case where there appears to be a use for others I am happy to spend time to polish and add so that it creates a value for them as well. I just can't confidently say that it will be anything beyond functional because unless someone with skill and experience takes over or substantially contributes, then I am fundamentally limited by my lack of talent in this particular field. Apologies.

I am of course very amenable to any good faith criticism and feedback. If it seems I am barking up the wrong tree on this venue I will move on, but I don't get that impression so if that is the case you may have to be direct about it. 

LlavaImageTagger