How to calculate predictions from exported trees?

User 2448 | 12/22/2015, 3:47:25 PM

I'm struggling to understand the relationship between the trees shown in random forest regression and boosted trees regression and the score returned by .predict()

I've made this ultra-simple example with two features, and one tree, to demonstrate:

import pandas as pd
import graphlab as gl
df = pd.DataFrame({'one': [1,3,2,4,3,5,4,6,5,7,6,8,7,9,10], 
                   'two': ['a', 'a', 'a', 'b', 'a', 'a', 'b', 
                           'a', 'b', 'b', 'a', 'b', 'a', 'b', 'b'],
                   'target': [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14]})
sf = gl.SFrame(df)
model = gl.random_forest_regression.create(sf, 
                                           target='target', 
                                           features=['one', 'two'],
                                           num_trees = 1)

model.predict({'one': 1, 'two': 'a'})
# returns 2.75

model.show(view="Tree", tree_id=0)

Why does model.predict return 2,75, when the tree does not contain a node with this value?

Comments

User 2448 | 12/22/2015, 4:07:58 PM

Update: it appears the only two possible outputs of model.predict() are 2.75 and 8.5625, both of which are 0.5 higher than the values of the tree leaves. Where does this 0.5 come from?


User 2156 | 12/22/2015, 8:37:51 PM

Hi,

The leaf node is a regressor. For any specific testing tuple such as model.predict({'one': 1, 'two': 'a'}) the score output is generated by the regressor while the regressor is trained off the data tuples of that branch.

The score on the leaf node is generated by all the tuples within that particular branch (average of the target I believe), therefore it can be different unless there is only one tuple in the training set for that branch.