Posts

Showing posts from April, 2017

Recommendation system approaches with pyspark

I am currently exploring building recommendation systems using PySpark.  Spark is an in memory processing engine that allows for real time distributed computing. Different techniques for building a recommendation system: Similarity Approach Cosine Similarty Calculate the cosine angle between two feature vectors (such as # of times songs were listened to between two users) Jaccard Similarity / Index Compute the intersection over the union:  wiki Collaborative Filtering Approach Matrix Factorization Find two dense matrices (which represent latent features) that when multiplied together recreate the original sparse matrix K nearest neighbors find the k nearest vector points and use that to classify the new observation Setup Spark: import pyspark from pyspark.context import SparkContext from pyspark.ml.feature import HashingTF, IDF, Tokenizer from pyspark.sql.types import * from pyspark.sql import Row from sklearn.metrics.pairwise import cosine_simil...