Just recently, the plista Machine Learning Team attended the highly-coveted Neural Information Processing Systems (NIPS) 2016 Conference in Barcelona in quest of the latest technologies and projects within machine learning and computational neuroscience. This 30th annual conference with more than 5000 attendees included invited talks, symposia and presentations from a selection of sought-after papers from machine learning experts.

Semantic recommendations and Word Mover Distance

At plista, we want to give you the best, that is, the most relevant recommendations for you. Let’s say, you have an article on a popular news site and want to generate recommendations on what the user could read next. What other articles do you want to recommend? One possibility is to look for articles that are similar in their content: they may be about the same (or a similar) subject, or they may be written in a similar style or even occur in the same region. If you can define a similarity metric based on the articles you compare, you can choose the one with the highest similarity and recommend it; this is what is called semantic recommendations.

Choosing such a similarity, however, is no trivial thing to do and is still an open problem in research. It is not only the words in the text and their frequency that is important, but also their inherent meaning that each of us attribute to as well as the socio-cultural influence on the word itself. Words (and languages as a whole) are weird and tricky if you think about it because a word itself carries little information. Its meaning is determined by what people associate it with, and that may also change over time.

How do we teach a computer the meaning of a word?

This problem begs the question: How do we teach a computer the meaning of a word? There are several ways, actually. We can, for example, create a so-called concept graph in which we create a network of words and (semantic) concepts to link the words, thus getting an understanding of a word’s meaning by its connections. Microsoft has released such a tool a short while ago (https://concept.research.microsoft.com/Home/Introduction), and it does look interesting, although I personally haven’t tried it.


Text embeddings: Shamelessly ripped from www.tensorflow.org

Word embeddings

Another approach, one that we will talk about today, are so-called word embeddings, which are vector representations of words with a fixed size. For these word embeddings, we do not try to learn an explicit representation of the word in the vector (i.e. we don’t want to do a one-hot encoding or something similar). Instead, we want the distance between a pair of vectors to reflect a meaningful property between the corresponding words. This way, we get vector representations for words that, for themselves, carry no special meaning, but as a pair encode a relationship between underlying words. The vector for “man” and “women” may be completely different, but their distance is similar to the distance of the vectors for the word “king” and “queen”. It is also possible to add or subtract vectors and get to a new result vector: Vec(“Angela Merkel”) – Vec(“Deutschland”) + Vec(“Russland”) = Vec(“Putin”). This approach is called Word2Vec (https://arxiv.org/abs/1301.3781) and works astonishingly well to encode information of words.

Now, the open problem is how to get from the similarity of the word-level to the similarity on a paragraph or document level. Taking the mean of all word vectors may not accurately represent the underlying semantic meaning of the text, nor is it enough to just concatenate all the word vectors into one big document vector and compute the similarity from it, because you end up with vectors of different sizes that are not really comparable (plus you don’t do the intricacies and additional information imposed by the text structure and grammar justice).


Word embeddings in a vector space

There are variants where people try to learn vector embeddings for paragraphs

(https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/44894.pdf) or even go to full document vectors (https://openreview.net/pdf?id=B1Igu2ogg), but instead we want to focus on a more traditional metric for two texts that gets an even better performance.

Word Movers Distance (WMD)

One of the metrics to compare documents on a word-vector level is the Word Movers Distance (WMD) (http://jmlr.org/proceedings/papers/v37/kusnerb15.pdf). The WMD works by summing up pairwise distances from words of one text to the nearest-neighbor word of the second text. In other words: WMD computes the pairwise (euclidean) distance from one word in the first text to each word in the second text, selects the word with the lowest distance, then takes the next word from the first text and computes the pairwise distances and so on, until it finally sums up all the computed shortest distances to get a document distance. I don’t want to get into detail on how to compute the WMD – the avid reader may follow the link to the original paper provided above.

The image below is taken from the original author’s paper and shows the relative error of WMD compared with other methods when trying to do text classification using KNN. These errors are reported as relative to the Bag Of Words (BOW) approach, and you can see that WMD gives a way lower error.

Figure 4 of the original authors paper:


Centroid-based metric (WCD)

Having a good measure is one thing, but in reality you always need another thing: scaling. The authors note that their algorithm is the most accurate but also the slowest, having an algorithmic runtime of O(p³log(p)), where p is the number of unique words in the article. Also, comparing all the documents you have is of order O(n²) where n is the number of documents that you compare. Now, go on and imagine a company like plista with a couple hundreds of thousands of documents it has to compare, each with a lot of unique words and a tight time constraint. The authors address the runtime problem and give a few ways of optimizing it. Instead of naively computing the WMD they compute a centroid-based metric (WCD) which has a quadratic time complexity and is used to rank the existing documents. Only near-ranked documents are then compared with the actual WMD metric. This gives a huge speed boost, provided that you are only interested in comparing the top-k texts.

What comes after WMD?

I am not sure. Even though WMD is a good measure of how similar texts are and are a good way to classify them in different categories, it still does not provide a deeper understanding of the text, in my opinion. Spacial arrangement of words, for example, has a high influence in what we want to express with a text. How do we deal with recursions, word duplications, etc.? To get really good semantic recommendations, you need a system that is able to fully capture all the intricacies of the human language, which we don’t have. Yet.