Proposal: creation of a knowledge-base ready for ai-scraping

Started by audiogalaxy, November 01, 2024, 12:29:46 AM

Previous topic - Next topic

audiogalaxy

If you are not opposed to the use of generative artificial intelligences, you know how useful they can be in code generation.
Exiftool is no exception. We are not always ready to understand complex syntaxes that we will then use once every three years. The help that Generative Ai and colloquial style can provide is a great help. Unfortunately, ChatGPT and the like do not always work well if they have been trained on a less than excellent foundation. I have personally seen, however, loading the most important internal medicine textbook into the so-called context, asking questions and having them checked by physicians: with this upload of context, the answers were flawless. Nothing replaced expertise in complex analysis, of course, but an excellent knowledge base helps.
Unfortunately, there seems to be no clean, useful knowledge base, with syntax that is always syntax-checked (windows / *x ) and very "dialogic" (e.g. where it is made explicit "don't write this way, but rather write this way" (for people, not machines).

Perhaps starting with the forums, cleaning up and putting all possible case histories in one file for the use and consumption of Ai scraping would help a lot of people, considering that the more literal and repetitive you are in examples (accompanied by explicit explanations) and in solutions to problems that are also explicit, the more useful you will be to everyone.

The syntax of exiftool is not easy for everyone. Even less so when one has to delve into combined Perl syntaxes and perhaps take into account the use of single , double quotemarks, operating systems, and simultaneously what is written in the FAQ. The iAs are here to stay. Where there are no ethical, moral or commercial counter reasons, the smartest thing to do, in my opinion, is to expose as much useful and correct explicit matter to these crawlers as possible.

Specially made pages (and explicitly non-excluded from the appropriate ROBOT.TXT and the like) and especially an updated downloadable document (TXT or PDF) to be uploaded as a context augmentation (as for the so-called RAG, however, I don't know how it works, I'll leave it to those who know more) could help everyone a lot. To date both the windows CMD command line and exiftool do not always give good results with chatGPT. You have to correct a lot, know a lot to correct: you get by in the end, you have suggestions if it goes well. Improving all this could be useful: what do you think? it's a lot of work but maybe there are super-experts on the forum who ... well they get older and want more free time for example? :-)

Here, they may be the first to say, "okay, let's write this syntax well that solved this problem, question, solution, recommendation on what not to do, syntax, result" ... and to update the file or page. Of course if someone did "exiftool courses" I would understand that keeping the knowledge and delegating to the individual's effort is more correct. But if there is no such situation ... maybe we can get iA to help us in a better and more effective way :)

What do you think? Especially the bosses here! :-D
--
Sorry for my halting English: I'm not a natural English speaker.
On a PC / windows commandline

Phil Harvey

Are you suggesting writing a comprehensive document containing a wide range of ExifTool examples and explanations for training an AI engine?

I see 2 problems here:

1. I've tried and failed to write a comprehensive (possibly interactive) document like this for ExifTool users.  The main issue is the time to accomplish such a task.  It is much more involved than you think, and one person can not accomplish this.  If we had the input of all ExifTool users, then this would be possible.  But in fact we do.  And we do have such a document.  It is called the ExifTool forum. :)

2. If we did write a stand-alone document for AI training it would be a drop in the ocean of information used to train the AI model.  I don't know of any way to give it the priority it would need to be useful.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

audiogalaxy

Quote from: Phil Harvey on November 01, 2024, 09:24:16 AM[...] It is much more involved than you think, and one person can not accomplish this.  If we had the input of all ExifTool users, then this would be possible.  But in fact we do.  And we do have such a document.  It is called the ExifTool forum. :)
[...]
- Phil

Maybe some kind of a more linear extraction of content? an only-text export?
something that can be automated systematically, reissued every month, for example. Because just-in-time browsing of the whole forum certainly doesn't do that (doesn't work), whereas a single page, a single text document, this can take that into consideration and I assure you it gives great results. And by "I assure you," the test was uploading the so-called "Harrison" and asking him questions, under medical review: no drop in the bucket, indeed, accurate, correct, precise answers, no hallucinations, accurate reasoning.
As you rightly point out, producing the Harrison is Mr. Harrison's job and co, for money. What you can perhaps consider doing with the forum, through automation, is to try to extract value from the existing forum in a way that is easily readable by crawlers OR in another way that can be loaded into an Ai on the fly. If you feel it is worthwhile, you might consider thinking about it. If, on the other hand, good old elbow grease and sweat of the brow are preferable even "politically," I completely understand.

--
Sorry for my halting English: I'm not a natural English speaker.
On a PC / windows commandline

StarGeek

Quote from: Phil Harvey on November 01, 2024, 09:24:16 AM2. If we did write a stand-alone document for AI training it would be a drop in the ocean of information used to train the AI model.  I don't know of any way to give it the priority it would need to be useful.

It is possible to add a specifically trained layer on top of the normal ChatGPT model.  For example, about a year ago I read about someone who trained a model on the FFmpeg documentation. At the time, it was using ChatGPT 4, which was only available as a paid option. It appears to be freely available now, though I don't know how accurate it is.

It appears to be called fine-tuning and it costs money to use it. You have to scroll down to the "Fine-tuning models" section as the anchor links change each time you load the page :(

Overall, there would be a lot of work creating the documentation to fine tune the model on. Then, someone has to pay real money to run the training.

There is also the option of training locally if you have a powerful graphics card with a lot of VRAM and then making the model available on HuggingFace. There's also self-hosting using LocalLLaMA or Koboldcpp, but that wouldn't really be something the casual user would use.
"It didn't work" isn't helpful. What was the exact command used and the output.
Read FAQ #3 and use that cmd
Please use the Code button for exiftool output

Please include your OS/Exiftool version/filetype

Phil Harvey

One other thing that we aren't discussing...  It won't be long before people become dependent on these AI resources, and at that time they will start charging subscription fees to everyone, and we would have effectively paid to improve the AI model that they are now charging us to use.  Or am I wrong?

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

StarGeek

"It didn't work" isn't helpful. What was the exact command used and the output.
Read FAQ #3 and use that cmd
Please use the Code button for exiftool output

Please include your OS/Exiftool version/filetype

audiogalaxy

My English is poor. So i tried to craft a more clear explanation using that Ai, so here we are:

Clarifying My Proposal on Creating a Knowledge Base for AI Context Augmentation

I realize there may have been some misunderstandings about my suggestion, so I'd like to clarify.

I'm proposing that "we" (LOL! not me! I totallu can't) create an automated, regularly updated, text-only extract of the forum content—a consolidated document that includes well-structured examples, syntax explanations, case studies, and common solutions related to ExifTool. This document wouldn't require us to write new content from scratch; rather, it would systematically compile the valuable knowledge we've already shared in the forum.

The key benefits are:

    Context Augmentation for AI Models: Users can upload this document as additional context when interacting with AI language models like GPT-4. This doesn't require fine-tuning the AI or incurring any training costs. By providing the AI with this context, it can generate more accurate and helpful responses regarding ExifTool usage.

    Improved AI Assistance: With a well-structured knowledge base, AI tools can better assist users in crafting complex ExifTool commands, understanding syntax nuances, and avoiding common pitfalls.

    Automated and Sustainable: The extraction and consolidation process can be automated and scheduled to run monthly, ensuring the knowledge base remains up-to-date without requiring continuous manual effort.

    Accessible Resource: This consolidated document would be a valuable resource not just for AI interaction but also for users who prefer a linear, searchable text for learning and reference.

To address concerns raised:

    No Need for Fine-Tuning or Costs: This approach doesn't involve fine-tuning AI models or any associated costs. It's about enhancing the AI's immediate context with existing information.

    Not a Drop in the Ocean: While the AI model is trained on vast data, providing specific, relevant context can significantly improve its output for specialized topics like ExifTool.

    Ethical and Open Access: By making this document available and ensuring it's not blocked by robots.txt, we support open knowledge sharing without any ethical or commercial conflicts.

I believe this approach can greatly benefit both new and experienced users, making it easier to leverage AI tools for ExifTool-related tasks. If there's interest, I'd be happy to help set up the automation process for extracting and updating the knowledge base.
--
Sorry for my halting English: I'm not a natural English speaker.
On a PC / windows commandline