Improving LDA topic model performance

User 1933 | 11/17/2015, 5:21:32 PM

Hey gang - I'm running some topic models in Graphlab on Google cloud compute, using a node with 32 cores and 120GB ram. What tips can you offer for maximizing performance?

So far I've set:

gl.set_runtime_config('GRAPHLAB_DEFAULT_NUM_PYLAMBDA_WORKERS', 32)
gl.set_runtime_config('GRAPHLAB_DEFAULT_NUM_GRAPH_LAMBDA_WORKERS', 32)
gl.set_runtime_config('GRAPHLAB_FILEIO_MAXIMUM_CACHE_CAPACITY',100000000000)
gl.set_runtime_config('GRAPHLAB_FILEIO_MAXIMUM_CACHE_CAPACITY_PER_FILE',100000000000)

But I still only see ~15GB of RAM usage while the model is running. Anything else I can do to speed the models up? My corpus has ~150k documents, vocab size of ~112k, and ~4 billion tokens.

Of course, the biggest change I can do is change the number of model iterations, but I don't know what the best practices there are. I've done one test model, and I don't see huge changes in perplexity on a hold-out set as I change the number of iterations (I've tried 10,20,30,40, and 50). Generally speaking, am I safe to use the default of 10 iterations?

Comments

User 19 | 11/17/2015, 5:49:25 PM

Hi jlorince,

The amount of RAM is governed by the vocabulary size and the number of topics you are using, so it's reasonable that only 15GB are being used if you don't change those numbers. The runtime configs you've modified look reasonable.

Since the operation scales with the number of tokens, you can speed things up be cleaning and pruning your original documents so that the vocabulary is smaller, e.g., removing stop words.

I'm a bit surprised test perplexity isn't changing with respect to number of iterations. If that's the case, depending on your final goal, you might be able to get away with fewer iterations or fewer topics.

I'd also recommend trying 1.7 when it comes out; you should see some speed improvements during training. We'd be interested to hear how it works for your use case.

Cheers, Chris


User 1933 | 11/17/2015, 5:54:59 PM

Thanks for the input, but can you clarify a bit n your point regarding perplexity? My question remains as to what is (or how to determine) a good number of iterations to run. I presume the default number of 10 iterations that's implemented in the model must have come from somewhere...

And can you say some words on the cost/benefits of using collapsed gibbs sampling vs aliasLDA as the sampling method?

My basic problem here is that I'm trying to do a whole bunch of model runs to explore the parameter space for alpha, beta, and num_topics, so am trying to maximize performance while still getting good results however I can.

Thanks!


User 1933 | 11/17/2015, 6:11:37 PM

(oh, and regarding perplexity, I've actually noticed in some cases that perplexity goes up as I increase the number of iterations...what 's up with that?)


User 19 | 11/17/2015, 6:37:30 PM

Collapsed Gibbs sampling is an MCMC method, and you typically run such algorithms until you conclude that they have converged. In the case of topic models, this usually amounts to monitoring the heldout perplexity. Unfortunately, this is very dataset dependent, so you will have to experimentally find a good number of iterations that tends to produce useful models for your problem.

The aliasLDA variant can have runtime benefits when you're using a large number of topics, e.g. greater than 100.

Perplexity going up as you increase the number of iterations? That's somewhat surprising, but could happen.