Recommendation System and Big Data Mining
Content
- Content
- Recommendation System
- Content Based System
- How to build Recommendation System:
- How to evaluate recommender system ?
- How to fetch similar items (pictures, documents) when the size of item set is in billion?
- How to cluster N data points where each data points have D dimension.
Recommendation System
- For more recommendation system content, click here
Recommendation systems use a number of different technologies. We can classify these systems into two broad groups.
-
Content-based systems examine
properties of the items recommended
. For instance, if a Netflix user has watched many cowboy movies, then recom- mend a movie classified in the database as having the “cowboy” genre. -
Collaborative filtering systems recommend items based on
similarity measures between users and/or items
. The items recommended to a user are those preferred by similar users.
Content Based System
Our ultimate goal for content-based recommendation is to create both an item profile
consisting of feature-value pairs and a user profile
summarizing the pref- erences of the user, based of their row of the utility matrix.
- Create Item Profile: using the features of the item
- $movie: [actor, director, genre, imdb_{rating}, …]$
- documets: remove stop words, compute tf-idf score for each word and pick the top $k$ words to represent the document
- $document: [words_1, word_2, \dots]$
- Measure similary between 2 items using appropriate
distance based
methods.
Recommending Items to Users Based on Content
Given a user to whom we want to recommend some items, we can apply the same two techniques – random hyperplanes and LSH – to determine in which buckets we must look for items that might have a small cosine distance from the user.
How to build Recommendation System:
There are basically three types of recommender systems:-
Demographic Filtering - They offer generalized recommendations to every user, based on movie popularity and/or genre. The System recommends the same movies to users with similar demographic features. Since each user is different , this approach is considered to be too simple. The basic idea behind this system is that movies that are more popular and critically acclaimed will have a higher probability of being liked by the average audience.
Content Based Filtering - They suggest similar items based on a particular item. This system uses item metadata, such as genre, director, description, actors, etc. for movies, to make these recommendations. The general idea behind these recommender systems is that if a person liked a particular item, he or she will also like an item that is similar to it.
For example, for designing a Movie Recommendation system, using Content based Filtering
, the movie description can be used as a notion of content. Now for K movies, let’s say we have their description.
- Convert the description into TF-IDF vector of dimension D, i.e we will have a matrix of M of size
K x D
- If we take dot product of M.dot(M.transpose), this will give a
K x K
matrix, where each row denotes the similarity score of the movie indexed at that row with all the other movies indexed at the columns of that row. -
Pick
top N
movies from that rows ranking based on their similarity - Idea
Collaborative Filtering - This system matches persons with similar interests and provides recommendations based on this matching. Collaborative filters do not require item metadata like its content-based counterparts.
Resource:
- Kaggle: Recommender systems in python INTERMEDIATE
- Kaggle: Movie recommendation system EASY
- Kaggle: Neural network embedding recommendation system
- Kaggle: How to recommend anything deep recommender
- Kaggle: Film recommendation engine
- movie-recommender-systemsx
How to evaluate recommender system ?
Precision and Recall are useful for set based result.
But for rank based result we measure Precision@K
and Recall@K
and their formulas are different.
The two most popular ranking metrics are MAP and NDCG
- MAP: Mean Average Precision
- NDCG: Normalized Discounted Cumulative Gain
The main difference between the two is that MAP assumes binary relevance (an item is either of interest or not), while NDCG allows relevance scores in form of real numbers. The relation is just like with classification and regression.
Resource:
- evaluating-recommender-systems
- An Introduction to Information Retrieval: Manning, Chapter 8
- IMP Slide: cs276, Chapter 8
How to fetch similar items (pictures, documents) when the size of item set is in billion?
Question: Given high dimensional data points x1, x2, …, (e.g: image) and some distance metric d(x1, x2), Find all pairs of data points $(x_i, x_j)$ that are within distance threshold $d(x_i, x_j) \leq s$
-
Shingles: A document is a string of characters. Define a k-shingle for a document to be any substring of length k found within the document.
-
Example: Suppose our document D is the string $abcdabd$, and we pick $k = 2$. Then the set of
2-shingles
for D is ${ab, bc, cd, da, bd}$.
-
Example: Suppose our document D is the string $abcdabd$, and we pick $k = 2$. Then the set of
-
Min-Hashing: Compress large documents into small signatures and preserve the expected similarity of any pair of documents, The signatures we desire to construct for sets are composed of the results of a large number of calculations, say several hundred, each of which is a
minhash
. There is aremarkable connection between minhashing and Jaccard similarity
of the sets that are minhashed.- Minhash vs Jaccard Similarity: The probability that the minhash function for a random permutation of rows produces the same value for two sets equals the Jaccard similarity of those sets.
-
Locally Sensitive Hashing (LSH) or Nearest Neighbour Search (NNS): If there are $N$ documents and we want to find pairwise similarity, then it’s $N \choose 2$, i.e, $O(N^2)$. And if $N \sim 1M$ then computation time is huge. So idea is,
don’t compary every pair, rather compary pairs which are likely to be similar
- This technique is called Locally Sensitive Hashing or Nearest Neighbour Search.
- One general approach to LSH is to “hash” items several times, in such a way (create a band of size $r$) that similar items are more likely to be hashed to the same bucket than dissimilar items are. We then consider any pair that hashed to the same bucket for any of the hashings to be a candidate pair. We check only the candidate pairs for similarity. The hope is that most of the dissimilar pairs will never hash to the same bucket, and therefore will never be checked.
- If we have minhash signatures for the items, an effective way to choose the hashings is to divide the signature matrix into b bands consisting of r rows each. For each band, there is a hash function that takes vectors of r integers (the portion of one column within that band) and hashes them to some large number of buckets.
Summary:
- For each documets first apply
k-shingling
- Apply
minhashing
on thek-shingles
to get a compressed signature for large documents - Apply LSH or NNS on those minhashed signatures to get the bucket which contains all the items which are
likely to be similar
.
Similar question:
- How to fetch similar items (pictures, documents) when the size of item set is in billion?
- Document de-duplication
Resource:
- Stanford cs246: Mining Massive Datasets
- Youtube: Lecture 12: MinHashing
- Youtube: Lecture 13: LSH
- Pydata - Talk on LSH
How to cluster N data points where each data points have D dimension.
Case 1: If N and D are small enough to fit into memory (RAM) then use in-memory Clustering algorithm:
- Hierarchical Aglomerative Clustering (Bottom Up)
- K - Means
Case 2: If N and D are huge that can’t fit into memory and they are in disk then use the below methods:
- BFR [
Bradley-Fayyad-Reina
] (1 Pass algo) - CURE [
Clustering Using REpresentations
] (2 Pass algorithm)
In my opinion, go with CURE, easy to understand
Key Sauce for these kind of problem:
- Remember to apply sampling. Sample points such that they can be fit into the memory and in-memory algorirthm like Hierarchical or K-Means can be applied.
- Then load data from disk one block (that can be fit into memory) at a time and update the initial cluster and proceed.
Also remember that in both K-Means and BFR, each cluster has a single representative point, namely, the centroid. But in case of CURE algorithm, each cluster is represented by K points.
Resource: