boosted_trees_regression = how to implement runtime prediction in other languages (i.e. c#)

User 2955 | 1/5/2016, 7:11:56 AM

Hi,

I have a client who i'm build an app for; I have created a linearregression model using Dato tool via iPythonNotebook - To implement runtime prediction was pretty straight forward by exporting the linearregression model['coefficients']

boostedtreesregression is performing much better than linear_regression so I'm exploring implementing that but I don't fully understand how to export and predict at run time in other languages (i.e. away from DATO or a real time service)

How would i go about doing this?

Comments

User 940 | 1/5/2016, 9:02:06 PM

Hi @"Joe Booth" ,

There are two ways to run at run time in other languages. One is using our product Predictive Services, where you can access your trained models via Rest API. We also have clients for several languages.

Alternatively, the tree is encoded in JSON within the model, you can access it via model['trees_json'].

I hope this helps! Cheers! -Piotr


User 2955 | 1/6/2016, 12:49:35 AM

@piotr thank you.... i need this to run real time within the app without a service call; performance is key.

Where do I find the algorithm / example code for how to predict at run time

also, where can i find comparative benchmarks for comparing prediction time between different algorithms / ML approaches - i'm trying to compare and struggling to find articles on this (i'm just looking for directional information, obviously there are many factors that will influence it)

Many thanks


User 940 | 1/6/2016, 7:56:56 PM

@"Joe Booth" ,

I'll get a sample code for you, and post it by tomorrow.

As for your second question, I'll try to answer it in very general terms.

In very general terms, non-parametric models like Nearest Neighbors can be very slow at predict time. They are computing pair-wise distances between most of the points in the training set, and this can get very expensive. Alternatively, most parametric models are simply matrix multiplications. So it depends on how big/how many multiplications there are. Linear models, like logistic regression, are very fast. OTOH Neural networks depend tremendously on depth and size.

I hope this helps!

Cheers! -Piotr


User 2955 | 1/7/2016, 6:00:37 AM

Great, thank you on both points! Look forward to the code!!


User 940 | 1/7/2016, 7:38:59 PM

Hi @"Joe Booth" ,

Here's a pseudo-code sample for python. You could use any other language once you've pulled out the json_encoded trees.

`python

from math import exp

def predictsingletree(tree, input): # get margin for single tree # traverse tree to leaf node, return value

First get the trees as JSON

model = graphlab.randomforestclassifier.create(training, features=['feature list'], target='valid') json = model.get('trees_json')

now JSON will return K trees as a list, where K is number of classes in classifier

interpret JSON (turn into several tree-like structures)

so if you have 3 iterations and 3 classes, you will have 9 trees in json:

they are organized as follows:

0 1 2 <- iteration 1, class 0, class 1, class 2

3 4 5 <- iteration 2, class 0, class 1, class 2

6 7 8 <- iteration 3, class 0, class 1, class 2

def multiclasspredict(trees, classes, input): ''' trees as a list (0, 3, 6) from above example classes is a list of classes: [0, 1, 2] from above example input is the input to predict

output is a list of probabilities by class, should sum to 1
'''
k = len(classes)
margin = [] # could be numpy array instead
for c in classes:
for i in range(c, len(trees), k): # should go 0, 3, 6 for class 1, 4, 7 for class 1
        margin[c] += predict_single_tree(trees[i], input)

# normalize
soft_max = map(exp, margin) # exp(margin) for each margin
soft_max_sum = soft_max.sum()
prob = []
for c in classes: # could be done with generators / list comprehension
    prob[c] = exp(margin[c]) / soft_max_sum

return prob

` Let me know if this is what you were looking for.

Cheers! -Piotr


User 2955 | 1/16/2016, 4:51:57 AM

@piotr - this is perfect, thank you!!!