Est.perplexity of Topic model training

User 1221 | 1/22/2015, 5:19:27 AM

When I played with topic model, the progress log shows Est.perplexity reached 0 at the first iteration as follow. Then I use model.evaluate(train, train) to compute perplexity, output is large enough as 2100.80224488. Is that a bug? I tried data from example of graphlab.topic_model.create, same thing happened.

Another confusion is that I evaluate models with different topic number as [50,100,200,300]. Perplexity supposed to go down as the number of topics increases, but it turns out differences are really small and has an increasing trend : [2227.567034643555, 2313.198493959926, 2608.0940621333843, 2954.710286365573]. Is that possible? Or because I got problems with data?

<pre class="CodeBlock"><code>PROGRESS: Learning a topic model PROGRESS: Number of documents 128726 PROGRESS: Vocabulary size 107421 PROGRESS: Running collapsed Gibbs sampling PROGRESS: +-----------+---------------+----------------+-----------------+ PROGRESS: | Iteration | Elapsed Time | Tokens/Second | Est. Perplexity | PROGRESS: +-----------+---------------+----------------+-----------------+ PROGRESS: | 1 | 3.33s | 2.35357e+06 | 0 | PROGRESS: | 2 | 5.02s | 2.21958e+06 | 0 | PROGRESS: | 3 | 6.71s | 2.22786e+06 | 0 | PROGRESS: | 4 | 9.33s | 1.44075e+06 | 0 | PROGRESS: | 5 | 11.97s | 1.42824e+06 | 0 |</code></pre>

Comments

User 19 | 1/22/2015, 5:35:14 AM

The "Est. Perplexity" prints 0 whenever you do not provide a validation set. I agree this is confusing... This will be changed in the future to instead print perplexity estimates regardless of whether a separate validation set is provided.

Yes, a value of 2100 is reasonable.

For your second question: yes, it is possible to observe those values. When using large K, more iterations may be required for convergence, and even then you may observe perplexity estimates that are not as good if the algorithm has not yet found a comparable set of parameters.

Another recommendation: use a held out validation set, e.g. m.evaluate(train, valid). It should give you a better sense of model quality.

If you have any more questions, please do not hesitate to ask.


User 1221 | 1/22/2015, 5:42:47 AM

Thanks for quick response. Very helpful.

Another question: If I don't set a fixed number of iteration, how will the training process terminate? A default number?


User 19 | 1/22/2015, 6:03:48 AM

Yes, the default number of iterations is currently 10.


User 2032 | 6/10/2015, 8:45:04 AM

I wanted to ad to this thread:

When using a topic model without validation set it works very very fast and prints perplexity of 0, when using it with a validation set of 1,5% of documents it becomes prohibitively slow.

Does the topic model even work without the validation set?


User 19 | 6/10/2015, 5:57:10 PM

Hi Johnny,

Can you clarify? You said that the topic model works without a validation set, but then you ask whether the "topic model even works without the validation set".

Computing perplexities on validation sets is an expensive operation (since you must first estimate topic proportions for those documents). You may make this cheaper by using fewer documents in your validation set.

Hope that helps! Chris


User 2032 | 6/15/2015, 9:34:04 AM

Hi Chris,

To clarify:

  • without validation set the model runs iterations but reports 0 perplexity, which is impossible - this raises question if it computes the model correctly without the validation set. In your previous post you explained that it works without the validation set.
  • with the validation set it becomes prohibitively slow

Questions that arise:

If validation set computation is expensive, what should be the right number/percentage of document that are passed via the validation_set? - this should be in the documentation.

Should custom validation_set be even possible if it is so prohibitively slow even with a small number of documents? - in my case it was c.a. 10 000 documents.

How often is the validation evaluated? Every iteration? During SGD? Maybe it should be evaluated every printed iteration since iteration history is not available so non-printed validations go in vain.

Suggestions:

I would separate Est. perplexity and Validation perplexity into separate columns in progress reports. I would let the user specify at what intervals is the Validation perplexity evaluated. I would really make it 100% clear in the documentation that this operation is expensive.

Kind regards, Jan


User 2032 | 6/15/2015, 11:00:54 AM

To give it some more weight I have one example with large vocabulary:

c.a. 14 000 documents c.a. 1 350 000 words topics: 100

where adding a validation set of 200 documents

would change the iteration time from 27 seconds to 36 HOURS.

Imagine the scenario when someones tries topic_model with validation set from the first run...

  • Bonus:

m.evaluate(train, valid) when run without validation_set returns:

{'perplexity': nan}


User 19 | 6/15/2015, 8:26:32 PM

Hi Jan,

Topic model currently reports 0 perplexity even when it's not computing perplexity. This should be fixed in an upcoming release. I also agree with each of your suggestions: users should be aware that this expensive and should have the ability to limit how often it occurs. I've added these items as feature requests.

In your case, I would reduce the size of the validation set to just a few hundred documents (at least to begin with). This can help you understand how many iterations are required for convergence. In practice, you are likely interested in some other task, so I would experiment with a validation set size where perplexity is (ideally) correlated with your desired outcome.

Chris


User 2032 | 6/16/2015, 9:09:47 AM

Hi Chris,

I'm glad you guys are onto this problem.

Your suggestion on reducing the validation_set is invalid in my case, since my validation set was 200 documents for a 14 000 documents model and the associated cost was slowdown of 3 orders of magnitude. From my perspective it is cheaper to run multiple runs of the algorithm with different number of iterations and then evaluate() after each run.

How often is the validation performed anyway? To me it seems that more than once per iteration. Or maybe there is a problem with lazy evaluation of the validation_set SFrame? (this is something suggested by Rajat).

Jan


User 19 | 6/16/2015, 5:47:53 PM

Interesting. We will look into this. The evaluation on the validation set should happen only once per iteration. There shouldn't be an issue with the lazy evaluation, but it doesn't hurt to make sure you're fully materialized ahead of time by using something like sf.tail(). (I'd be interested if this helps, by the way. :smile: )