Single row (no SFrame) version of TopicModel

User 2032 | 8/18/2015, 6:19:10 PM

Hi guys,

Some of you may be familiar with my converter for RandomForest that works on dicts that resemble rows from an SFrame, does not require SFrame initialization and is therefore super fast (200X) when there are a lot of features.

Now I need to do the same thing for TopicModel. Can you please provide me with directions on how to get model variables from TopicModel and what evaluation function should I use (pseudo code welcome) to replicate predict( ... , output_type='probability').

Quite urgent:(

Kind regards, Jan

Comments

User 19 | 8/19/2015, 1:02:44 AM

Hi Johnny,

I wrote up an initial Python-only implementation here: https://github.com/dato-code/how-to/blob/master/predicttopicmodel.py

If your goal is speed, I suggest you use this as a starting point and feel free to cut corners and see if you sacrifice any prediction quality, e.g., num_burnin=1.

One note: in some topic model implementations, they use a single latent topic assignment (z_ij) for a document i and token j. Here I have j only index over unique words. This is an approximation that speeds up prediction time, but can sometimes slow down convergence.

Hope this helps, and please let me know if you have any questions. Chris


User 2032 | 9/7/2015, 11:28:04 AM

A delayed "Thank you"!

It seems to work and with low burnin is close to acceptable for real-time classification of smaller documents (however I would love to be able to do burnin=10 at less than 5ms). However I have found CGS to be quite unstable (even the main topic can change from time to time at high (>10) burnins) which makes it a bit hard to use topic probabilities for further processing. Would you expect als to be more stable in this regard? Also it would be quite useful to have some sort of a metric for prediction stability since it is a probabilistic algorithm, can you propose something from the literature?

My intuition is to have x runs of the algorithm per each sample and then have some function using the probability of the mode for each rank in argmax over topic probabilities_

here is some example code I came up with:

from functools import partial
def doc_stability(k, trials):
    n = float(len(trials))
    num_topics = len(trials[0])
    topic_counts = np.zeros((k, num_topics))
    for t in trials:
        for i, topic in enumerate((-np.array(t)).argsort()[:k]):
            topic_counts[i][topic] += 1
    modes = topic_counts.argmax(1)
    probability_of_modes = topic_counts.max(1) / n
    cum_stabilities = { 
        ki-1: probability_of_modes[0:ki].sum() / ki
        for ki in xrange(1, k+1)
    }
    stabilities = dict(enumerate(probability_of_modes))

    return {'stability': stabilities, 'cumulative_stability': cum_stabilities}

def stability(model, documents, number_of_trials_per_document=100, max_k=5):
    tf = gl.SFrame()
    tf['bow'] = documents
    for i in xrange(number_of_trials_per_document):
        tf[str(i)] = model.predict(tf['bow'], output_type='probability')

    tf = tf.pack_columns(map(str, xrange(num_trials)), new_column_name='trials')
    tf['trials'] = tf['trials'].apply(partial(doc_stability, max_k))
    tf = tf.unpack('trials', '')
    tf = tf.unpack('stability', 'stability@')
    tf = tf.unpack('cumulative_stability', 'cumulative_stability@')
    return {c: gl.Sketch(tf[c]).mean() for c in (i for i in tf.column_names() if i.find('stability') >= 0)}

Unfortunately the results of this evaluation worry me a lot - for a very simple topic modelling on article titles and their fb description of a major news website I have:

{'cumulative_stability@.0': 0.39036219676549866,
 'cumulative_stability@.1': 0.26541133198562444,
 'cumulative_stability@.2': 0.21079570979335122,
 'cumulative_stability@.3': 0.17972807165318958,
 'cumulative_stability@.4': 0.1601310646900269,
 'stability@.0': 0.39036219676549866,
 'stability@.1': 0.14046046720575026,
 'stability@.2': 0.10156446540880504,
 'stability@.3': 0.08652515723270443,
 'stability@.4': 0.08174303683737646}

Which means that there is only 40% chance that the predicted topic will be the most popular predicted topic in 100 trail runs. This is at burnin=10.

Any advice on how to: a) increase stability b) measure it differently c) increase stability @.1 @.2 and so on? - I would like my cgs to return at least 3 dominating topics in each document.