A recent Reddit post (https://redd.it/1bk5erd) about OCRing an image and saving the text into the file interested me enough to figure it out and even make a Windows BAT file to do it in batch.
I knew about tesseract (https://github.com/tesseract-ocr/tesseract) but never looked into it. Turns out it was Super Easy (insert Ryan George GIF here). The OCR part to output on STDOUT is simply
tesseract file.jpg -
From there, it's simple enough to pipe the tesseract into exiftool and use the -TAG<=DATFILE option (https://exiftool.org/exiftool_pod.html#TAG-DATFILE-or--TAG-FMT) option to save the text into Description
tesseract file.jpg - |exiftool "-Description<=-" file.jpg
The resulting BAT file
@echo off
rem OCR_and_embed.bat
rem OCR images and embed results in a directory and its subdirectories
REM Loop through all directories specified as arguments
for %%a in (%*) do (
echo "%%a"
pushd "%%a"
REM Loop through all jpg files in the current directory and its subdirectories
for /r %%b in (*.jpg) do (
REM Process the jpegs
echo Processing "%%b" in "%%~dpb"
tesseract "%%b" - |exiftool -P -overwrite_original "-Description<=-" "%%b"
)
popd
)
endlocal
The only thing I didn't like was that it is looping exiftool but I couldn't figure out a way to do it otherwise. I could have just looped tesseract and made a text file to match each image, then run exiftool once, but I wanted to avoid writing temp files. I also figured that tesseract was going to be a bigger bottleneck than exiftool's startup time, though I haven't tested it. On the simple images I was using and with my CPU, tesseract was very quick to process the files.
The looping code was created by ChatGPT for a different BAT file and I simply replaced the command.
I'm now planning on running this on a bunch of video game screenshots to save the dialog and info into the files, which I'll then be able to search through in IMatch.