Using neural embeddings for search ranking and recommendations

This is a summary of a project that I worked on while working as a data scientist / software engineer at Peerspace.

Peerspace is a two-sided marketplace for renting unique spaces for meetings, events, off-sites and things like that.  It's basically like Airbnb, but leaning more towards commercial use cases and hourly rentals.

A screenshot of what the Peerspace search results page looks like


While I was at Peerspace, I had the opportunity to work on alot of pretty interesting things.  This included learning Clojure, which is a dynamically typed Lisp dialect which runs on the JVM.  In addition, I got to do a lot of work with search ranking and search infrastructure.  One of the search-related projects I worked on was exploring and applying deep learning to improve our search ranking and recommendations.

One algorithm I explored was called product2vec. which essentially uses deep learning to learn a neural "embedding" for each of the products in a product catalog.  Embeddings are vectorized representations of the product themselves, which exist in the weights of the penultimate (second last) layer of the neural network before making the prediction.  The algorithm is trained off of historical engagement data, such as the clickstream of products that are clicked on by customers.  Once the embeddings have been learned, you can use it to find products that are similar to each other.
 
If you were shopping for cell phones right now, an Iphone and Samsung Galaxy would probably be some of the products
that you looked at during a typical search session.  By training off of customer clickstreams, we can learn these relationships between products. 


Overview of product2vec algorithm:

The idea of product2vec comes from natural language processing (NLP).  NLP is basically the domain of figuring out how to programmatically work with words.  One challenge in NLP is learning a representation of words that can be useful.  For example, words that are synonyms of each other (i.e smart, intelligent, precocious), should not be treated as independent and orthogonal entities but rather things are are closely related to each other.   But how would you "learn" these relationships? 

An algorithm called word2vec learns from sentences to find words that are related to each other.  It does so by mapping sparse high dimensional word vectors into a denser lower-dimensional space, and whilst doing so, similar words are grouped together.

Learning the relationship between words:

To get a better understanding how this works, lets first take a step back.

The English language has approximately ~100,000 words.  One way of programmatically representing these words is as a sparse vector consisting of 100,000 dimensions.

Each word in the English language can be represented as a vector of 99,999 zeroes and
a single one:

For example:

bear = [0, 0, 0, 0, 1, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0]
cat   = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 1, 0, 0, 0]

etc etc..

However, if you were to represent the words this way, you would just have 100,000 orthogonal vectors.  Imagine a 100,000 dimensional space, and in this space exists 100,000 vectors each 90 degrees of each other (i.e "orthogonal").  There would be no way of knowing which of the word vectors are similar/synonyms to each other, which are opposite/antonyms, and by what degree.

Ideally, you would want to map all the words to a lower dimension, and group all the related words together.  An example would be something similar to what you see in the diagram below.
It a nutshell, we would like to "collapse" this 100,000 dimensional space filled with orthogonal word vectors into a lower dimension, while learning the relationships between the words.  The end result would be that related words are grouped together in the vector space.

- smart, intelligent, precious, wise (synonyms)
- January, February, March (months of the year)
- tree, arbor, plant, bush (all plant related)
- currency, money, cash, stock (finance related)

The Word2vec Skip-Gram algorithm

So how would we learn this lower dimensional space?  In simple terms, we would do so by training a neural network to output the probability of seeing nearby "context" words in sentences.

For example, say one of the sentences from our word corpus was:

"The quick brown fox jumps over the lazy dog"

Now, given the word "fox", what are some of the words that would appear next to it in this sentence?
They include: "quick", "brown", "jump", "lazy".  We'd want to train the neural network to be able to predict this.


We generate a training set by creating tuples of the target word and a context word that is near it in a sentence.  See the diagram above.

Target word + context word pairs are used as positive training examples, whereras target word + randomly selected word pairs would be used as negative examples.

The following would be the architecture of the neural network:



The training set for this neural network model would be target word + context word pairs generated from a large corpus of sentences (say, from Wikipedia, or from the comments on Youtube or IMDB).  The input of the neural network would be a one-hot encoded "sparse" word vector, and the output would be a probability distribution across all the possible words that are near it in a sentence.

Word2Vec Keras - negative sampling architecture

Now, if we were to train the neural network over a large training set, i.e over million sentences, you would see that on average, words that are similar to each other would have similar context words!  For example, words such as "smart" and "intelligent" are essentially interchangeable in sentences, and it would make sense that they have similar "context" words surrounding them.   Words such as "January" and "March" would also have similar context words since months of the year are interchangeable in sentences without botching the meaning.

Where do embeddings come from?

After the training process is complete, what would you see if we were to examine the weights of the second last layer (i.e the embedding) of the neural network used to make the prediction?  You would see that words that are similar to each other have very similar weights!  The reason they have similar weights is because they are predicting similar probability distribution of context words!



If you were to visualize the weights of this second last layer by plotting them as points or vectors on a graph, you would see that similar words are grouped together.  If you were to take the dot product of those similar word vectors, you would see that the angle between these two vectors is very small.  On the other hand, if you were to take the dot product of two word vectors that are opposites, they would be close to 180 degree angles of each other.




Now what does all of this have to do with search ranking and product recommendations?

Well here is the key insight and analogy.

words = products
sentences = a series of products engaged with in a typical search session.

Imagine if you were a someone shopping for a cell phone on Amazon.  During a typical search session, you might click on an Iphone, a Samsung Galaxy, and a Huawei.   Thus, all of these products would be contexually related to each other, and thus would be captured in the neural embeddings during the training process!



This is essentially the product2vec algorithm.  It is the word2vec skip-gram algorithm, but creatively applied to user behavior data.

I was able to utilize this algorithm with great success at Peerspace.  In the diagrams below, the left listing/product is the the "target" product and the right nine listings/products are the "context" products (i.e. the recommendations).  You can see that the recommendations are extremely relevant, and they were a drastic improvement over existing recommendations.

"photo shoot" spaces



"dinner party" spaces



Keep in mind that the product2vec algorithm does not require any "domain knowledge" to work.  In a nutshell, you learn all of these product relationships by training a neural network off the click stream data



Some applications of product2vec:

Recommendations:
- find top k similar embeddings (by computing the dot product)

Personalized search ranking:
- find the average embedding the last k products users engaged with.  Now boost similar products in that users search ranking.


Additional Resources:
To be continued...

Comments

Popular posts from this blog

grandmaster level chess AI using python - Part 2 (the code)

building a chess ai - part 4: learning an evaluation function using deep learning (keras)

Brief intro to recurrent neural networks