Topic modelling and cosine similarity

User 5214 | 5/20/2016, 10:33:51 AM

Hi,

I am using graph lab for topic modelling and followed this tutorial from user guide. Now for new document A I want to calculate cosine similarity of the A to the documents within same topic to which A belongs to. Please help me with this, I've written this code for topic modelling.

docs =  load()
d = gl.text_analytics.tf_idf(docs['Text'])
model = gl.topic_model.create(d)
docs['Topic'] = model.predict(d)

test_docs = loadTest()
test_d = gl.text_analytics.tf_idf(test_docs['Text'])
prediction = model.predict(test_d)
docs[docs['Topic'] == prediction]

//Here I want to calculate cosine similarity to rank the documents

Thank you in advance.

Comments

User 1207 | 5/20/2016, 7:19:47 PM

Hello @aqib,

It's likely possible to easily get something close to what you want using the composite distances in the nearestneighbors module -- see (https://dato.com/products/create/docs/generated/graphlab.nearestneighbors.create.html#graphlab.nearest_neighbors.create, and the final example there). This allows you to weight the topic distance highly, and the document similarity less so, so it would prefer similar topics over different ones.

Hope that helps! -- Hoyt