Feeding Word2vec with tens of billions of items, what could possibly go wrong?

Scale
06/12/2017 - 14:00 to 14:40
Moon Lounge
long talk (40 min)
Intermediate

Session abstract: 

Word2vec is an unsupervised algorithm which is able to represent words as vectors, the so-called word embeddings. It computes them so that words with close meanings are close together in the vectorial space. Originally developed for NLP applications, it has recently proven to be extremely relevant in numerous contexts such as biology, speech recognition, recommendation, etc.

At Criteo, an ad retargeting company, we use word2vec to compute product embeddings. This allows us to compute similarities between products and consequently improve the relevance of the products we recommend to our users.

Spark MLlib provides a distributed implementation of Word2vec. However, using it daily on tens of billions of item as we do, raises challenges in terms of scalability and quality. In this talkI will share with you how we moved from POC implementation to ABTest. I will tell you the numerous things we broke during this journey and how we fixed them. Finally, I will expose how we plan to improve our product embeddings and use them in different ways.

Video: 

Slide: