Experimental Perceptual Image Hash user defined tag

Started by StarGeek, July 29, 2018, 06:39:07 PM

Previous topic - Next topic

StarGeek

Another experimental user defined tag.  This one will create a Perceptual Hash for an image using the Perl Image::Hash module.  In theory, it could be used to find similar and duplicate images, though in practice, it would require a bit of external coding to figure out the difference between any two image hashes (see Hamming Distance maybe?).

It comes with a lot of caveats.  First and foremost, it will NOT run with the Windows executable, only with the perl version of exiftool.  This is because it requires certain perl .dlls that aren't packed into the Windows executable version.  After installing Perl (it was tested with Strawberry Perl on Windows), then the Image::Hash module will need to be installed (cpan Image::Hash on the command line, I believe).  It also needs File::Slurp which is included with Strawberry Perl.

Second, it will crash badly with any file greater than 8 bits/pixel, so any deep color files, such as raws or 16 bit/pixel PNGs or TIFFs will break it.  At the moment, it will only accept JPEG files until such time as I get some deep color PNGs and TIFFs  to look at adding some code to prevent problems.  Also, I need to figure out which other files would have to be tested for (can BMPs be 16 bits/pixel, for example?).

Example output.  Test.jpg is the original, Test.png is PNG file saved from the original (made operation only for this test), Test_Grey.jpg is a greyscale version created with JpegTran, Test_HalfSize.jpg is the original scaled to 50%, and Test_Progressive.jpg is a Progressive Jpeg converted from the original using JpegTran.  The actual image of Test_Progressive.jpg is exactly the same as Test.jpg, just encoded differently.  Test_Completely_Different.jpg is a image that is completely different from Test.jpg.
C:\>exiftool.pl -config ImageHash.config  -AverageHash -DynamicHash -PerceptualHash Y:\!temp\HashTest
======== Y:/!temp/HashTest/Test.jpg
Average Hash                    : 9F8F87979F3D7DFF
Dynamic Hash                    : 5159555951218174
Perceptual Hash                 : BA3E808800040000
File Size                       : 542 kB
======== Y:/!temp/HashTest/Test.png
Average Hash                    : 9F8F87979D3D7DFF
Dynamic Hash                    : 51595D5951A181B0
Perceptual Hash                 : BA3E808800000000
File Size                       : 2.0 MB
======== Y:/!temp/HashTest/Test_Completely_Different.jpg
Average Hash                    : F0E0EFCF9A434508
Dynamic Hash                    : 444658545A55D58D
Perceptual Hash                 : E1CC400800800000
File Size                       : 220 kB
======== Y:/!temp/HashTest/Test_Grey.jpg
Average Hash                    : 9F8F87879D3D7DFF
Dynamic Hash                    : 5159555951A1C1F0
Perceptual Hash                 : BA3E808800000000
File Size                       : 370 kB
======== Y:/!temp/HashTest/Test_HalfSize.jpg
Average Hash                    : 9F8F87879D3D7DFF
Dynamic Hash                    : 51595D495121C1AC
Perceptual Hash                 : B236808800000000
File Size                       : 179 kB
======== Y:/!temp/HashTest/Test_Progressive.jpg
Average Hash                    : 9F8F87979F3D7DFF
Dynamic Hash                    : 5159555951218174
Perceptual Hash                 : BA3E808800040000
File Size                       : 524 kB
    1 directories scanned
    6 image files read


All the Hash values between Test.jpg and Test_Progressive.jpg are exactly the same, as they are exactly the same image, just encoded differently.  The Perceptual Hash between Test.jpg and the PNG and the grey scale are close, only 1 character off, while the half size image is 3 characters different.  Test_Completely_Different.jpg has hashes which are very different than the other test images.

One possible idea I had for this was to use the -csv option and then sorting by hash.  Then maybe a formula to compare each hash to the one above or below and see the difference.  Unfortunately, I'm not good enough with spreadsheets to figure this out and I couldn't find an example to copy/paste.

See Looks Like It for more info on the subject.

Comments, ideas, etc, are welcome.
* Did you read FAQ #3 and use the command listed there?
* Please use the Code button for exiftool code/output.
 
* Please include your OS, Exiftool version, and type of file you're processing (MP4, JPG, etc).

Phil Harvey

Very interesting.

If you want to calculate all 3 hashes for each image, then you could speed things up by storing the Image::Hash object in the ExifTool object so you only have to read the file once.  For example:

            ValueConv => q{
# acceptable filetypes, may expand later
my @Filetypes = qw( JPEG );
# Check if acceptable filetype
grep(/$val[1]/, @Filetypes) or return undef;
unless ($$self{MyImageHash}) {
    # read image
    my $image = read_file( $val[0], binmode => ':raw' ) or return undef;
    # create hash object
    $$self{MyImageHash} = Image::Hash->new($image) or return undef;
}
return ($$self{MyImageHash}->ahash());
}


Now, this does rely on an undocumented but useful feature of ExifTool:  All Image::ExifTool member variables with a lower-case letter in their name are deleted before processing each file.  So you don't have to worry about using an older Image::Hash object for a new file.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).