User 2007 | 7/8/2015, 9:00:24 PM
What is the recommenced approach in GLC for doing phrase/colocation detection and extraction preprocessing for use in a topic model?
Up until now I've used gensim's Phrases class for this task. I've taken a look at GLC's
count_ngrams function and it seems to be suited for inspecting an entire corpus at once. So, you could concat all the documents,
count_ngrams, and then find the set of ngrams that meet a certain threshold. You would then have to
count_ngrams on each document individually and filter out all of the ngrams that didn't meet the global threshold. I can think of a few other ways of doing it but none of them seem much better than using gensim so I'm curious to hear what others have done.