Topic modeling, words on topic id per document

User 5327 | 6/24/2016, 1:50:44 PM

On a large sample (300,000+) of documents, graphlab.topicmodel.TopicModel.gettopics() returns an SFrame containing a list of words for each topic and a score related to how high that word ranks for that topic. Next, using model.predict(), I get an SArray containing the most probable topic id per document. For each document, however, I would like to know which words are found in that document on that particular topic id. Is there a way to do this? Thanks so much in advance for your reply.

Comments

User 940 | 6/24/2016, 8:22:00 PM

Hi @accounting,

Could you do an SFrame join of .get_topics() and .predict() (https://dato.com/products/create/docs/generated/graphlab.SFrame.join.html) then tokenize the documents and filter the tokens by the word list?

Let us know if this helps.

Cheers! -Piotr