Automated metatagging for images containing text

Started by jbionic, September 21, 2023, 05:29:39 PM

Previous topic - Next topic

jbionic

While ImageNet (image-net.org) seems to be pursuing a more large-scale purpose of visual recognition of a very diverse set of images containing various complex objects, OCR tools are more straight forward and mainly deal with character recognition of text content.

Annotation is the final stage of any recognition process and can include metatagging. If an image contains only plain text or annotated figures, then each recognized word must be checked against an external dictionary to ensure validity, whereas a combination of words must be checked for consistency (e.g. by a search query in Google or in wordnet.princeton.edu to establish the semantic relations among words). Tags can be then assigned automatically based on the list of recognized words.

I wonder if there are any free solutions or web services doing both things in a fully automated consecutive manner
1) OCR and
2) exiftool metatagging
- for images with text content?

StarGeek

There are plenty of websites which will do OCR for you, just search "free OCR".  But personally, if the data is personal, I'd advise against it.

Tesseract is an open source OCR program and is used on the back side of many other apps.  For example OCRmyPDF and Paperless ngx use it.  And probably most of the free OCR websites use it as well.
"It didn't work" isn't helpful. What was the exact command used and the output.
Read FAQ #3 and use that cmd
Please use the Code button for exiftool output

Please include your OS/Exiftool version/filetype

jbionic

#2
Thanks, StarGeek. I agree, there are many OCR projects
https://github.com/kba/awesome-ocr

But I was thinking about something bigger than just an OCR engine. I need to
1) Recognize all words in an image via an OCR
2) Check all words for spelling mistakes in the dictionaries
3) Check that phrases are consistent and occur many times in other searcheable web pages on the Internet
4) Assign exiftool tags for recognized-corrected words

5) (Ideally) reduce the number of exiftool tags for an image by checking the database of all tags previously assigned by user manually

I thought all the steps must be fully automated and catered to users as a separate web service - harvester

Just let me give you an example
Here is an image


I would assign the image the following exiftool tags:
FT; figure; 2000s; Canada; France; Germany; Italy; Japan; Britain; USA; foreign; ownership; sovereign; public debt; share; private sector;

But I do it manually.  Sometimes my image could be a page from a publication. Like the following one


I would assign it the following tags:
JRFM; fiscal consolidation; NPL; impact; paper; abstract;

So out of the whole bunch of words on the page I've selected only the words that are of some interest to me. This is what I call reduction. The reduction can be automatated by searching my tag base and by using machine-learning algorithms , but I don't expect the engine to be so advanced yet

jbionic

#3
One more example. I've been reading from my tablet a book by Dominic Lieven. Here is a page


It contains something that I find curious. So I've screenshotted it and tagged by assigning the following words:
Lieven; Italy; Russia; education; professor; 1900s; per capita;

But this all comes down to how I achieve the result (i.e. how much time I've spent to process a page): manually or automatically? Ideally, tags selection must be automated and performed by AI based on my individual history of tagging. For me the selection of tags would normally be related to some statistical facts with numbers, so there could be a lot of text on a page, but I would skip all that bullsh*tting and only check the bits where the numbers are present, which can be automated, I suppose. Because my tag base already contains such words as Russia, Italy, education, teacher (synonym to professor) for cross-checks and consistency.


jbionic

It seems someone is already working on this task  :)
https://towardsdatascience.com/extracting-text-from-pdf-files-with-python-a-comprehensive-guide-9fc4003d517