Calculating perplexity on held-out documents

User 1933 | 11/18/2015, 3:01:20 AM

So I've been working on evaluating some topic models as I vary the number of topics, and ran into a problem. I start by generating train and test data like so (using test as my hold out set):

train,test = gl.SFrame(docs).random_split(0.9,seed=99)

Then, for varying K, I run:

topic_model = gl.topic_model.create(train,num_topics=k,num_iterations=50,method='cgs')
perplexity = topic_model.evaluate(train, test)['perplexity']

What I've run into is monotonically decreasing perplexity as I increase the number of topics, which seems wildly implausible with the data I have (I'm running with k in np.arange(10,101,5)). Trying to figure out what's happening, I dug into the documentation for TopicModel.evaluate and found this:

The provided `train_data` and `test_data` must have the same length,
i.e., both data sets must have the same number of documents; the model
will use train_data to estimate which topic the document belongs to, and
this is used to estimate the model's performance at predicting the
unseen words in the test data.

Now, if I interpret train_data to be the data I originally trained the model on, and test_data to be my hold out set, this can't work (they are obviously not the same size). Now, despite the same size restriction, this doesn't fail, but I think I'm doing something wrong. Most of the examples describe splitting documents in half (i.e. training on the first half of the document, and testing the likelihood of the second half of the documents), but most of what I've seen in the literature, and what I want to accomplish, is the calculate the perplexity of the model on a held-out set of documents.

For my use case, should I just be calling evaluate with test_data as a single argument? i.e.

perplexity = topic_model.evaluate(test)['perplexity']

That would seem to imply that evaluate predicts topics for the new unseen documents, given the trained model, and then calculates the likelihood of those same documents under the model? Am I on the right track?

Comments

User 19 | 11/18/2015, 10:25:09 PM

Hi jlorince,

Evaluating topic models is a bit different than evaluating other kinds of machine learning models. A common way of doing the evaluation is to let the model see a small portion of a document, and then ask it to predict the rest. To facilitate this, we have a function that creates a random split of each document, creating a training set and test set that are both dictionary type where the sum of the counts equal the counts in the original dictionary: graphlab.textanalytics.randomsplit

So changing your first line to the following should fix things: train, test = gl.text_analytics.random_split(.8)

Let me know if that helps, Chris


User 1933 | 11/18/2015, 10:40:46 PM

This I understand, but to my understanding it's more common (and better) to predict unseen documents from the corpus. So my question remains, does running perplexity = topic_model.evaluate(test)['perplexity'], where test is an SArray of the held out documents, accomplish this?


User 19 | 11/18/2015, 11:07:57 PM

No, topic_model.evaluate(test) does not do that.

Predicting unseen documents from the corpus with a topic model does not work very well because you have no information to infer the topics; hence a topic model gives you little benefit over a simple frequency-based model of word counts.


User 1933 | 11/18/2015, 11:20:59 PM

How do you mean? The term frequencies in the unseen document, coupled with the word-topic distribution learned in the model, are what you use to make predictions for new documents. I mean, predicting unseen documents is a pretty fundamental use case for a topic model. That's exactly what (topicModel.predict)[https://dato.com/products/create/docs/generated/graphlab.topic_model.TopicModel.predict.html] does, as far as I can tell. And even if you go back to the original LDA paper, they evaluate models by holding out 10% of the documents in the corpus, and then calculating perplexity on the held out set, given the trained model. This is all very confusing...


User 19 | 11/19/2015, 12:06:59 AM

Not quite. TopicModel.predict will return a probability distribution over the learned topics given the text of the document.

For another example of evaluating topic models in this way, see http://www.sravi.org/pubs/fastlda-kdd2014.pdf.

For more in depth reading on this, see http://www.arbylon.net/publications/text-est2.pdf.

There are more accurate ways of evaluating these models, but they are complicated and time consuming, e.g., http://dirichlet.net/pdf/wallach09evaluation.pdf and http://cseweb.ucsd.edu/~jfoulds/Foulds2014Annealing.pdf.

To be honest, we provide this method so that you can have some quantitative evidence to assess training progress. However, you should always confirm that the learned topics are useful for your particular end task:

  • If it's a classification task, then check the classifier's accuracy.
  • If it's interpretability, then you should read http://www.ics.uci.edu/~newman/pubs/Newman-ADCS-2009.pdf and http://www.cs.columbia.edu/~blei/papers/ChangBoyd-GraberWangGerrishBlei2009a.pdf.

User 1933 | 11/19/2015, 1:26:51 AM

Sorry to beat a dead horse here, but the moral here seems to be that what I'm describing does have precedent. I skimmed through the two papers you link to, and one of the methods described amounts to just what I propose: Train a topic model on, say, 90% of the corpus; predict topic distributions for the training set using the model, and then calculate perplexity of the test set.

I know there are lots of ways to evaluate a topic model, with various costs and benefits...I just want to make sure I'm accomplishing what I think I am here


User 1933 | 11/19/2015, 1:37:43 AM

(oh , and btw I hope I haven't come off as combative here - still a bit of a novice with topic modeling and trying to wrap my head around all this)


User 19 | 11/19/2015, 7:34:17 PM

Hi jlorince,

No worries. The distinction lies in the definition of "90% of the corpus". Those two papers do a 90/10 split within each document, rather than train on 90% of the documents and test on the remaining 10%. The former is sometimes called a "document completion" approach, and this is the way model.evaluate() works. See the section just under Eqn 13 here: http://papers.nips.cc/paper/4883-stochastic-gradient-riemannian-langevin-dynamics-on-the-probability-simplex.pdf.

I agree there are ways of averaging over the topic probabilities for entirely held-out documents; you're correct that the current model.evaluate() doesn't do this operation.

Happy to continue the discussion, Chris


User 1933 | 11/19/2015, 7:59:06 PM

Are you quite sure? Quoting the first paper you linked (emphasis mine): "We use the standard held-out method [10] to evaluate test perplexity, in which a small set of test documents originating from the same collection is set to query the model being trained" (sec. 5.2). That sure sounds like holding some documents, not splitting within documents. The second paper is, as far as I can tell, ambiguous as to what method they use (section 7.1): "A common criterion of clustering quality that does not require a priori categorisations is the likelihood of held-out data under the trained model... i.e., the ability of a model to generalise to the unseen data...The common method to evaluate perplexity in topic models is to hold out test data from the corpus to be trained and then test the estimated model on the held-out data". Blei's original paper from 2003 is also somewhat ambiguous, but to me reads like they do the 90/10 split on documents.

In any case, I'm not really clear on why mode.evaluate doesn't do what I want here. As far as I can tell from looking through the documentation, when I call this method with a single argument (my held out documents) it (a) calculates topic distributions for each document in the test set, and then (b) calculates the perplexity of this test data, given the model.

If not this, what is happening?


User 1933 | 11/19/2015, 8:05:48 PM

Oh, and while we're discussing these issues anyway, do you know Graphlab can properly handle TF-IDF normalized data in a topic model? In my tests it has generally worked alright (though I get errors for small numbers of topics, for some reason). But I don't know how it will handle the document splitting when we have non-integer term frequencies...


User 19 | 11/19/2015, 9:24:33 PM

Hi jlorince,

On a second read, I agree it appears that the first paper seems to be holding out documents, rather than doing document completion. In their case it sounds like they use a heldout set to estimate the topic proportions in the test documents. The Teh 2013 paper would have been a better example of people doing document completion as an evaluation metric. Also see the paragraph before Eqn 7 in http://www.jmlr.org/papers/volume10/newman09a/newman09a.pdf.

As to why: We simply haven't implemented the heldout document case, where you integrate over the predicted topic distribution using some other estimate of topic probabilities. You could do this by using model.predict(heldoutdocs), averaging the topic probabilities, then using model.predict(testset) and model['topics'] to compute the perplexity as described in the Li 2014 paper.

If just one SFrame is provided, we estimate document topic proportions and then compute perplexity with that one data set.

Regarding non-integer input: the values get floored to the nearest integer.

Hope that helps, Chris


User 1933 | 11/20/2015, 1:01:36 AM

Thanks for talking through all this with me. The ironic thing is, after all this hand-wringing, I'm getting damn near the same perplexity measurements with the 90-10, within-documents split (using text_analytics.random_split as when I was (erroneously, it would seem) using the method I described at the beginning of this thread. Oh well...at least this method has precedent, and I'm sure that the code is doing what I actually think it's doing. One small feature request you might pass along, though, is to allow text_analytics.random_split to accept a random seed argument in the same way that SFrame.random_split does. While I found using np.random.seed before the text_analytics.random_split achieves the desired result, the discrepancy is a bit strange.

In any case, I took your earlier comment about letting the ultimate use case of the model dictate the evaluation metric. My end goal with this is actually two-fold: Topic exploration/interpretation, but also (and more importantly) a means to calculate word-word similarities, so I'm working on implementing that now.

Though of course word-word similarity isn't exactly straightforward in a graphlab mode, either, but perhaps that conversation should go to another thread... (while I may be a thorn in your side at this point, I hope other people come across this conversation and find it useful...)


User 19 | 11/20/2015, 1:49:53 AM

I absolutely agree that text_analytics.random_split should have a random seed. I have added an issue to our internal bug tracker.

Your word-word similarity task sounds interesting. I agree it's probably worth another thread; I look forward to hearing more!

No worries: this conversation has been useful to several people already. Thanks for taking the time to dig into the papers and discuss it!


User 1933 | 11/20/2015, 3:33:11 AM

Great! This has been a great discussion. New topic posted here: http://forum.dato.com/discussion/1465/calculating-word-word-similarity-in-a-topic-model