User 1933 | 11/18/2015, 3:01:20 AM
So I've been working on evaluating some topic models as I vary the number of topics, and ran into a problem. I start by generating train and test data like so (using
test as my hold out set):
train,test = gl.SFrame(docs).random_split(0.9,seed=99)
Then, for varying
K, I run:
topic_model = gl.topic_model.create(train,num_topics=k,num_iterations=50,method='cgs') perplexity = topic_model.evaluate(train, test)['perplexity']
What I've run into is monotonically decreasing perplexity as I increase the number of topics, which seems wildly implausible with the data I have (I'm running with
k in np.arange(10,101,5)). Trying to figure out what's happening, I dug into the documentation for
TopicModel.evaluate and found this:
The provided `train_data` and `test_data` must have the same length, i.e., both data sets must have the same number of documents; the model will use train_data to estimate which topic the document belongs to, and this is used to estimate the model's performance at predicting the unseen words in the test data.
Now, if I interpret
train_data to be the data I originally trained the model on, and
test_data to be my hold out set, this can't work (they are obviously not the same size). Now, despite the same size restriction, this doesn't fail, but I think I'm doing something wrong. Most of the examples describe splitting documents in half (i.e. training on the first half of the document, and testing the likelihood of the second half of the documents), but most of what I've seen in the literature, and what I want to accomplish, is the calculate the perplexity of the model on a held-out set of documents.
For my use case, should I just be calling
test_data as a single argument? i.e.
perplexity = topic_model.evaluate(test)['perplexity']
That would seem to imply that
evaluate predicts topics for the new unseen documents, given the trained model, and then calculates the likelihood of those same documents under the model? Am I on the right track?