Is there a simpler way to get probability predictions for multi-class classification problems?

User 2568 | 1/6/2016, 4:59:52 AM

Background

I'm working on the Telstra Network Disruptions competition at Kaggle https://www.kaggle.com/c/telstra-recruiting-network/data

This is a multi-class classification problem. Submission are the form of a single row per event and columns for each class with a probability, i.e., id predict0 predict1 predict_2 15376 0.00287862461605 0.13514404829 0.861977327094 2427 0.00528076891796 0.335160223404 0.659559007678 7899 0.00112992846288 0.331249324329 0.667620747208

Let's call this the "wide" format.

Even though I'm a noob Dato made working on this problem quick and easy with the exception of getting the predictions in the "wide" format.

I started by using gl.classifier.create to create the model. I then expected it to be a simple matter of using gl.classifier.predict with output_type='probability’ to get the table of probabilities in the "wide" format.

However outputtype='probability’ is not supported for multi-class and so I had to use predicttopk, with k=3 instead. This yields a "tall" table like this rowid class probability 0 predict0 0.506812495659 0 predict1 0.327388975331 0 predict2 0.165798529009 1 predict0 0.448240375335 1 predict1 0.348303754599 1 predict2 0.203455870065 2 predict0 0.581301979379 2 predict2 0.225835109983 2 predict1 0.192862910638 3 predict_0 0.633464550761

Questions

I have two questions: 1. Is there a simple way to transform from the "tall" format to the "wide". My code is below and it seems overly complex. 2. Would you consider extending gl.classifier.predict to work this way for multi-class problems. It seems a reasonable way to show the data and the natural extension of the output for a binary class.

My transformation

# add row numbers so we can match the training data to the output of predict_topk
    d=test_data.add_row_number(column_name='row_id') 

# create the predictions in "wide" format.
    p = model.predict_topk(d, k=3) 
    p.rename({'id': 'row_id'})  # to match the test_data row_id

# I convert from the "tall" format to the "wide" format - this seems like a hack and I could not find a simpler way. 
    p = p.unstack(column=['class', 'probability'])
    p = p.unpack('Dict of class_probability', column_name_prefix='')

# I  join the training data and prediction on 'row_id' to get the 'id' in the prediction
    p = d.join(p, on='row_id')  #so we get the id from the data
    p.remove_column('row_id')

Comments

User 940 | 1/6/2016, 6:13:19 PM

Hi @Kevin_McIsaac ,

What type of classifier are you using? Most of our classifiers have a 'output_vector' option. So the code would look a bit like this:

python p = model.predict(d, output_type='probability_vector')

Does this help? Let me know if you have any more questions.

Cheers! -Piotr


User 2568 | 1/7/2016, 6:22:35 AM

Hi, @piotr thanks for the reply, I'm using a BoostedTreesClassifier and the target has 3 classes, "predict0", "predict1", "predict_2"

Yes this looks close! I tried

model.predict(test_data[0:3], output_type='probability_vector')

And got this SArray

[array('d', [0.7197003767620572, 0.2474543673404403, 0.03284525589750258]),
 array('d', [0.4049101500423102, 0.3933889438829593, 0.20170090607473046]), 
 array('d', [0.4041601113664395, 0.4349432249349724, 0.16089666369858802])]

So... to get the format I need, I then tried

test_data['predict'] = p
test_data['id', 'predict']

which returns

+------+-------------------------------+
|  id  |            predict            |
+------+-------------------------------+
| 6597 | [0.719700376762, 0.2474543... |
| 2597 | [0.404910150042, 0.3933889... |
| 5022 | [0.404160111366, 0.4349432... |
+------+-------------------------------+

and so I thought I could then get in the format I need with

test_data['id', 'p'].unpack['p']

but I got this error

TypeError: 'instancemethod' object has no attribute '__getitem__'

So I'm not sure how to unpack this into the form I need to write out a csv file


User 940 | 1/7/2016, 6:59:27 PM

Hi @Kevin_McIsaac ,

Ah, I think you're close! I think the trick is that unpack is a method where you are passing parameter 'p', and not an object with attribute 'p'.

So to unpack, this should work:

python test_data['id', 'p'].unpack('p')

instead of

python test_data['id', 'p'].unpack['p']

Let me know if this works!

Cheers! -Piotr


User 2568 | 1/7/2016, 9:57:57 PM

Agggh... got it, () not []. Going back thought this my code now looks like this:

p=model.predict(test_data, output_type='probability_vector').unpack()
p.rename({'X.0':'predict_0', 'X.1': 'predict_1', 'X.2':'predict_2'}) #This is only necessary as I need a separator of "_" not "."
test_data.add_columns(p)
test_data['id', 'predict.0', 'predict_1', 'predict_2']

thanks for your help and encouragement