This is Part II of my post on image similarity in Python with perceptual hashing. In this post, we will use Spotify's Annoy library to perform nearest neighbors search on a collection of images to find similar images to a query image.
Read More
I've had several questions about my previous post, Determining how similar two images are with Python + Perceptual Hashing.
I mentioned using a BallTree data structure to compare a lot of images, and will go over that in this post.
I demonstrated how to hash an image using a perceptual hashing algorithm, which you can then use to compare the similarity of two images.
I used the imagehash library for this.
The hamming distance between two hashes is used to calculate similarity by comparing them at each index, and increasing by one for each index where they are different.
So a hamming distance of zero means that they are the same.
If you want to compare a lot of images, it is not efficient to compare each image to all of the others, every time you want to do something.
Around 2015 I had an app idea.
Users would:
I had a collection of images of clothing items, and the idea was that the app could help users identify an item of clothing(the brand, etc) that they were interested in, or maybe identify the item and give them other similar items at various price points.
At the center of that idea was the need to frequently compare a lot of images to each other and rank how similar they are.
A BallTree data structure, like this one from Scikit-learn, allows for efficient nearest neighbors search, and this is what I attemped to use.
However, there was one problem with using it that I had forgotten about until I revisited my code from then as I was trying to write up this post.
The input data gets converted into floats.
The hamming distance is calculated by iterating through two bit sequences of the same length and comparing them at each index.
Since the data gets converted into floats once you pass it to the BallTree object, the distance measures will be wrong.
I will detail this a bit more at the end of the post, but for now I am just going to get into what I DID do that worked!
What I did was switch to another library!
Annoy - Approximate Nearest Neighbors Oh Yeah is used at Spotify for their music recommendations.
You can read more about it in the docs, but one thing it does is build a forest of trees to store your data.
It was pretty straightforward to get up and running.
Import all of the necessary packages.
import os import random import imagehash # pip install imagehash import numpy as np # pip install numpy from PIL import Image #pip install pillow from annoy import AnnoyIndex #pip install annoy
You can install all of these packages with pip.
First, here's how you can convert the perceptual hash returned from the imagehash
library into a bit array, which we will use with Annoy.
image_one = 'example_image_file.jpg' img = Image.open(image_one) img_hash = imagehash.whash(img) #img_hash.hash returns a boolean array, which we will convert into 0s and 1s hash_array = img_hash.hash.astype('int').flatten()
It looks like this:
hash_array array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0])
Now for the real example.
I have a directory called images
, with around 900 images of handbags, that I'm going to use here.
images_dir = "images" images_list = [img for img in os.listdir(images_dir)]
So I just used a list comprehension to put them all in a list.
As mentioned, we need an integer id for each image, along with its corresponding vector of hash data.
vector_length = 0 id_to_vec = {} for count,f in enumerate(images_list): img = Image.open(''.join([images_dir,'/',f])) img_hash = imagehash.whash(img) hash_array = img_hash.hash.astype('int').flatten(); vector_length = hash_array.shape[0] id_to_vec[count] = hash_array
Now we've got a dictionary id_to_vec
with integer ids for the data, from the count, as well as the vector of hash data in 0s and 1s.
The next step is to set up the AnnoyIndex
and add this data to it. You could do this all in one step, but I wanted to separate it out for this post.
f = vector_length dist_function = "hamming" t = AnnoyIndex(f, dist_function) for key,value in id_to_vec.items(): t.add_item(key,value)
So we iterated through the id_to_vec
dictionary and added the keys and values to t
.
t.add_item(key,value)
Annoy is a C++ library with Python bindings, and the vector length tells it how much memory to allocate for each vector.
Now we can build the trees. I just picked 200 for the number of trees as I was playing around with different results.
num_trees = 200 t.build(num_trees)
Now the forest is ready, and we can query it.
I'm going to find the nearest neighbors of this image, img-24.jpg from my images_list
.
And since the input data maps an integer id to the image vector data, we need to query with the id of the query image.
In this case, I just incremented the id for each iteration, so it is the same as the index of the image in the original images_list
.
query_index = images_list.index('img-24.jpg') num_neighbors = 9 neighbors = t.get_nns_by_item(query_index,num_neighbors,include_distances=True)
So here I've queried the tree with the method .get_nns_by_item()
and passed:
including_distances
- setting this to true will return the distances of each of these neighbors to the query image.In the code above, as well as in the previous post, I used the wavelet hash
img_hash = imagehash.whash(img)
And one thing it does is convert the image to grayscale, so the color data is not preserved in the hash.
Now there is also a colorhash option, which I think is relatively new.
img_hash = imagehash.colorhash(img)
I will show the nearest neighbors using both of these hashes(separately) below.
Here are the 9 nearest neighbors of the query image, when using imagehash.whash
.
Here is a random sample of 81 images from the dataset overall for comparison.
There are 900+ images in the directory.
And here are the 9 nearest neighbors using imagehash.colorhash
.
And there are a few other hashing options in the imagehash
library as well, so I would definitely recommend trying them all out to see the different results.
Just a few comments on what I tried with this.
I am not really sure when the hamming distance metric could be used out of the box with nearest neighbor search in Scikit-learn, because all of the inputs seem to get converted into floats, and there is nothing you can do about it.
Also, with the way that the BallTree works, even writing your own custom metric(instead of using the built-in hamming distance) doesn't really work, because the metric is not only run against the input data.
The BallTree also creates its own tree node bounds and calculates the metric between those nodes and your data as well.
So I couldn't find a meaningful way to represent the image hash data that didn't have problems when calculating the distance between the node bounds.
If anyone reading this has any insight on use cases of scikit-learn's BallTree with the hamming distance, I would love to hear it!
This is one solution for how compare a large dataset of images for similarity and retrieve the most similar images for a particular query image.
I really liked the Annoy library and appreciate the ease of using the hamming distance with bit arrays.
Hopefully this helped some of you.
Let me know what kind of image search projects you're working on!
This is Part II of my post on image similarity in Python with perceptual hashing. In this post, we will use Spotify's Annoy library to perform nearest neighbors search on a collection of images to find similar images to a query image.
Read MoreKruskal's algorithm finds a minimum spanning tree in an undirected, connected and weighted graph. We will use a union-find algorithm to do this, and generate a random maze from a grid of points.
Read MoreJust in time for Valentine's day, create a puckering lips animation in D3 from an SVG path, using interpolations and .attrTween(). We will go through the steps from generating points from an SVG path, to interpolating lines in D3 to animate them.
Read More