Topic model predict: how to return top k most likely topics?

User 1221 | 1/24/2015, 12:12:37 AM

After creating a topic model, I want to assign topics to document. Function <code class="CodeInline">graphlab.topicmodel.TopicModel.predict</code> return either most likely topic or probability vector to each topic. But I want to get top k topics for each document. So I try to use<code class="CodeInline">SFrame.topk</code> on output of TopicModel.predict along with topicid. But it seems to slow. Is there any alternative method without full sorting on SFrame?


User 19 | 1/24/2015, 1:43:19 AM

At the moment, the strategy you suggest is the right one.

We would like to also expose m.predict_topk, similar to what is available for multilabel classification. I will add this to our feature request list.

If you have any more questions or suggestions, please let us know!

User 1221 | 1/24/2015, 7:06:00 AM

thanks. I find out that SFrame.append makes it increasingly slow since I have to copy at each loop. Then I decide to use list and general sort to speed up. It works well so far. Just curious why SFrame.append need to be copied every time in loop and so does SArray

User 91 | 1/24/2015, 7:14:04 PM

The sort and append can be very expensive operations if you have a lot of documents. What we really need here is a group_by topk which can be achieved as follows (without a full sort) It uses the stack and unstack operations for list types.

<pre class="CodeBlock"><code> import graphlab as gl

Build a topic model

docs = gl.SArray('') m = gl.topic_model.create(docs)

Make all predictions

predictions = m.predict(docs, 'probability')

Stack these predictions and add a row number.

sf = gl.SFrame({'predictions': predictions}).addrownumber()

def top2_dict(lst): """ Return the top 2 elements of a list along with the indices and then convert the result into a dictionary. For example:

    input  : [10, 5, 8, 20]
    output : {0: 20, 3: 10} 
sorted_lst = sorted(enumerate(lst), key=lambda x: x[1], reverse=True)
# Return the top2
return dict(sorted_lst[:2])

Top-2 predictions

sf['predictions'] = sf['predictions'].apply(top2_dict)

Stack the output into a readable format which is

sf = sf.stack('predictions', ['topic', 'probability']) </code></pre>