Calculating word-word similarity in a topic model

User 1933 | 11/20/2015, 3:32:15 AM

My overall goal is this: After learning a topic model using topic_model.create(...), I want to be able to calculate similarity between arbitrary words from the corpus on which I trained the model. Problem is, the word-topic matrix (let's assume there's one row per unique term, one column topic per learned topic) learned in an LDA model is made up of probability distributions over words for each topic. If we take a row from this matrix corresponding to some word, it is not a probability distribution, meaning we can't just use it as a feature vector that we feed to a standard similarity measure (e.g. cosine similarity or euclidean distance). I have a proposed solution, but I'm curious if it seems reasonable to you guys, and/or if you can propose something more reasonable.

My basic idea is this:

1 - Train the topic model:

m = gl.topic_model.create(docs, ...)

2 - Get a document-topic matrix as a numpy array, using m.predict, and then scale the values by the number of tokens in each document. This gives us an estimate of the actual number of occurrences of each topic within each document (this of course means we're drinking the kool-aid on the generative model, where each individual token has an individual topic assignment):

doc_topic_mat = np.array(m.predict(docs,output_type='probabilities'))
token_counts = np.array(docs.apply(lambda row: sum(row.values())))[:,np.newaxis]
doc_topic_mat_freqs = doc_topic_mat * token_counts

3 - Now, take the sum of each column in the document-topic matrix. This should give us the total number of occurrences of each topic, across all documents (more kool-aid, of course):

topic_counts = doc_topic_mat.sum(0)

4 - Get the word-topic matrix (wherein each column is a probability distribution over words for a given topic), scale each probability by the total number of occurrences of the corresponding topic. This leaves us with an estimate of the number of times each word occurs in each topic across the corpus:

word_topic_mat = np.array(m['topics']['topic_probabilities'])
word_topic_mat_freq = word_topic_mat*topic_counts

5 - finally, now that everything has been converted to pseudo-frequencies, we can divide each row in the word-topic matrix by its sum, giving a probability distribution over topics for each word:

word_topic_probs = word_topic_mat_freq / word_topic_mat_freq.sum(1,keepdims=True)

Now, if my logic isn't completely off, we can treat each row of word_topic_probs as a probability distribution over topics for a given word, which means we can effectively use it as a feature array for the purposes of word-word similarity calculations. So for instance, to get the similarity of word 0 and word 1, given the model, we could do something like:

from scipy.spatial.distance import euclidean
sim = euclidean(word_topic_probs[0], word_topic_probs[1])

And of course, to simplify lookup of actual words we could build a dictionary first (or something equivalent):

term_dict = {a:i for i,a in enumerate(m['vocabulary'])}
sim_cat_dog = euclidean(word_topic_probs[term_dict['dog']], word_topic_probs[term_dict['cat']])

What do you think? Lots of assumptions here, but it's the most reasonable approach I've been able to think up so far.


User 1190 | 11/26/2015, 8:54:22 PM


Thanks for the detailed discussion. In general, this is a common question we encountered in feature engineering: how to normalize/standardize the feature vector. The answer always depends on what you do with the feature vector.

In this case, the original feature vector for each word is: f(w) = [P(w | topic1), (P(w | topic 2), ...]

The transformed vector becomes: g(w) = [P(w, topic1), (P(w, topic 2), ...] where each dimension is scaled by the probability of the topic itself.

Using f(w) as feature vector does not count the influence of topic occurrence, and each dimension is treated equally. Using g(w) however, counts the influence of topic occurrence. The effect is that the dimension of rare topics would be weighted less than those of popular topics.

Both feature constructions are valid, it just depends on which one fit you case better :)