Best way to do phrase (colocation) detection for topic modelling preprocessing

User 2007 | 7/8/2015, 9:00:24 PM

What is the recommenced approach in GLC for doing phrase/colocation detection and extraction preprocessing for use in a topic model?

Up until now I've used gensim's Phrases class for this task. I've taken a look at GLC's count_ngrams function and it seems to be suited for inspecting an entire corpus at once. So, you could concat all the documents, count_ngrams, and then find the set of ngrams that meet a certain threshold. You would then have to count_ngrams on each document individually and filter out all of the ngrams that didn't meet the global threshold. I can think of a few other ways of doing it but none of them seem much better than using gensim so I'm curious to hear what others have done.


User 19 | 7/9/2015, 3:55:30 AM

Hi Ben,

This is definitely on our roadmap. We'd love to hear about your typical use cases, especially in conjunction with topic modeling.

I agree that, at the moment, it's a bit awkward to concat all documents, do count_ngrams, sort, threshold, then do a dict_trim_by_keys. For now, using gensim -- perhaps combined with sf.apply -- is a good workaround.