Image Recommendation with Approximate KNN using Collaborative Filtering with ANNoy!
I was recently working on an image recommendation engine which was to be based on a dataset that contains details of interaction of users with images i.e. which users “liked” a particular image. I have described about a Vanilla approach to this in this post. In this article, I am going to discuss about how Approximate KNN works with ANNoy.
ANNoy is a python library created by folks at Spotify that provides abstraction over core ANN implementation. The basic idea behind ANNoy or any other Approximate Nearest Neighbors is to index the training data points in a form that they are efficiently searchable. For example, in case of ANNoy makes use of random projections to create hyperplanes which are used to determine a region where one or more points exist. Search only needs to be performed within this region. To understand more about how ANN works, follow this link.
Coming to the implementation, the first step was to re-arrange our data in a form that was easy to process.
In the above code snippet, the input file is first read and the image identifiers are collected by grouping based on user_id. This prepares data in the following form:
In addition to this, we have collected a couple more variables:
- item_id_set: This is a set of all the unique item_id values. We will use this to prepare vectors in the next step
- backup_df: As the name suggests, this is supposed to act as a backup of our source data
It’s also noteworthy that we have limited the number of users-item combination to 2000 to keep dimensionality in check.
In the next step, we convert the above form of data into vector form:
The above code snippet converts the data set into the following vector form:
choice_vector contains 1s for the item_id values that the user “liked” and 0s for the item_id values, that the user didn’t react to.
The choice vector here, has over 10000 dimensions which makes it very impractical to work with. Hence, the next step would be to reduce the dimensionality of these vectors. There are multiple ways of achieving this, but since our vectors are highly sparse, SVD or Singular Value Decomposition seems to be the best fit. To understand more about what SVD is and how it works, please follow this link.
The above code snippet reduces the number of dimensions from over 10000 to 40, and leaves us with something like below:
Once, we have the dimensionality in check, let’s start with initialising an ANNoy index:
The above code snippet does the following things:
- Initialises an AnnoyIndex with 40 dimensions i.e. the number of dimensions in our reduced vectors
- Adds combination of all user_id and vectors to the index
- Builds a forest of 20 trees to facilitate ANN
- Saves the index to a file named ‘behance.ann’
The next step would be to get the n nearest neighbors:
The above method needs to be called to identify the k nearest neighbors. Once we have the user_id of k nearest neighbors, we can do the following:
- Go to our backup data frame that we created in the first step
- Get a list of items that the nearest neighbors liked
- Remove any item_id that the current user has already liked
- Return the list of item_id recommendations
Compared to the vanilla approach discussed earlier, this is a significantly efficient and practical way to identify nearest neighbors which can then be extended to use cases like collaborative filtering, content filtering and many other classification systems.