A vanilla implementation of Collaborative Filtering for Image Recommendation System
Recently, I came across a problem where I had to design an image recommendation system i.e. I had a dataset with 1 million user interaction points. These points were in the form of <user_id, picture_id> which represents that the user with the given user_id “liked” a picture with the given “picture_id”.
This was a classic use case for collaborative filtering, where all I had to do was, for a given user, find the users with most similar interests when it comes to pictures, and recommend what they “liked”. To understand more about collaborative filtering, see this link.
Coming to the implementation, the first step was to preprocess data in such a way that calculation of common “liked” pictures becomes simpler. One of the ways that I found to be effective was to create a dictionary with user_id as keys and a list of picture_id representing what pictures did that user like.
Given dataset was in the following form:
I had to get it into the following form:
To achieve this, I read the tuples from a data frame and created a dictionary as follows:
Once I had the data in the form mentioned above, all that was needed was to compare similarities between users. I found K-Nearest Neighbours to be the relevant and suitable algorithm for this problem. To keep things simple, my measure of similarity between 2 users was set to the ratio of number of pictures liked by both the users to the total unique number of pictures that both the users have interacted with.
The following method was created to identify and store the details of k users closest to the given user.
Once, the list of k most similar users was available, I just needed to fetch the list of pictures that those users had liked. This was fetched from the dictionary mentioned above. Of course, I had to eliminate the pictures that the given user has already liked so as to not recommend duplicates.
Points to note:
- The above explanation is not meant to describe a real world recommendation system. The objective was to understand and present the concepts of Collaborative Filtering and KNN.
- KNN, while very effective isn’t very scalable. This is evident by the fact that to classify/calculate for a given input, the distance has to be calculated with all the other data points which becomes time and compute intensive as the data grows in volume.
- The above model is intuitive and works well because we have only 2 dimensions in our data set. As the number of dimensions grow, visualisation becomes challenging and hence we seek help from mathematics rather than intuition.