In Random Forest Classifier.create(), Is num_iterations wrongly named as num_trees?

User 2724 | 2/19/2016, 8:34:51 AM

I was exploring random forest classifier and I couldn't find numiterations in create function just like boosted trees. Then I saw numtrees, so i thought lets give it a try and after running modelling on different values I got a feel that there is something wrong. When I ran with default num_trees, model was created in a short span of time. However, as and when I changed its value, time taken increased quite dramatically. After a bit of exploration, I felt it might have been missed out. So thought of clarifying it on this forum.

https://dato.com/products/create/docs/generated/graphlab.randomforestclassifier.create.html?highlight=ensemble

model = gl.randomforestclassifier.create(train,features=['features'], target='label', maxdepth = 30, numtrees = 100) a) print model Class : RandomForestClassifier Schema - Settings


Number of trees : 2400 Max tree depth : 30

b) print model.getcurrentoptions() {'classweights': None, 'columnsubsample': 0.8, 'maxdepth': 30, 'minchildweight': 0.1, 'minlossreduction': 0.0, 'numtrees': 100, 'randomseed': None, 'rowsubsample': 0.8, 'step_size': 1.0}

If you see Number of Trees in (a) and num_trees in (b), they are different. Please correct my understanding if it is wrong.

Vaibhav

Comments

User 12 | 2/19/2016, 7:35:16 PM

Hi Vaibhav, I can't seem to reproduce the issue you're seeing. With GL Create version 1.8.2 on linux mint 17, I tried the following

>>> url = 'http://s3.amazonaws.com/gl-testdata/xgboost/mushroom.csv'
>>> data = graphlab.SFrame.read_csv(url)
>>> train, test = data.random_split(0.8)
>>> model = graphlab.random_forest_classifier.create(train, target='label', num_trees=100)
>>> model.summary()
Class                         : RandomForestClassifier

Schema
------
Number of examples            : 6179
Number of classes             : 2
Number of feature columns     : 22
Number of unpacked features   : 22

Settings
--------
Number of trees               : 100
Max tree depth                : 6
Train accuracy                : 0.9994
Validation accuracy           : 1.0
Training time (sec)           : 0.4851

>>> model.get_current_options()
{'class_weights': None,
 'column_subsample': 0.8,
 'max_depth': 6,
 'min_child_weight': 0.1,
 'min_loss_reduction': 0.0,
 'num_trees': 100,
 'random_seed': None,
 'row_subsample': 0.8,
 'step_size': 1.0}

The num_trees argument matches in both cases. What version of GLC are you running - it could be an old bug that's been fixed already.

Thanks, Brian


User 2724 | 2/22/2016, 3:58:59 AM

Hi Brian,

I am currently using 1.8.1. I will update GLC at my end and will check.

Thanks, Vaibhav


User 2724 | 2/23/2016, 4:52:39 AM

Hi Brian,

I tested with your dataset, it worked fine. Number of trees were matching in both cases.

However, when again I tried the same on my data set (which contains images and features are extracted from imagenet model), number of trees were not matching with modelling parameters.

1) model = gl.randomforestclassifier.create(trainingimages, features=['features'], target='label', maxdepth = 30, numtrees = 100)

print model Class : RandomForestClassifier Schema


Number of examples : 14231 Number of classes : 23 Number of feature columns : 1 Number of unpacked features : 4096

Settings

Number of trees : 2300 Max tree depth : 30 Train accuracy : 0.9937 Validation accuracy : 0.8869 Training time (sec) : 8394.2116

2) model = gl.randomforestclassifier.create(trainingimages, features=['features'], target='label', maxdepth = 30, numtrees = 500)

print model Class : RandomForestClassifier Schema


Number of examples : 14171 Number of classes : 23 Number of feature columns : 1 Number of unpacked features : 4096

Settings

Number of trees : 11500 Max tree depth : 30 Train accuracy : 0.9944 Validation accuracy : 0.8948 Training time (sec) : 36241.1966


User 2593 | 2/23/2016, 6:20:09 PM

@goelsvaibhav,

The number of trees in a model will end up being num_trees x number of classes. So this number of final trees in both a) and b) seem to be correct (500x23=11500 in b compared to 100x23=2300 in a).

Let me know if this clarifies the problem.

Thanks!

Charlie


User 2724 | 2/24/2016, 4:18:17 AM

@cloofa

Thanks for clarifying the calculation behind Number of trees. If Number of trees = numtrees x number of classes, then in case of example shared by Brian Number of trees should have been 200( 100*2) instead of 100. Also, both Number of trees in model.summary() and numtrees in model.getcurrentoptions() match in case of Brian's data set.

However, in my data set, Number of trees is 11500 or 2300 in case of model.summary() for respective models, however in case of model.getcurrentoperations(), num_trees is 500 or 100 resp.

Thanks, Vaibhav


User 2593 | 2/24/2016, 4:35:55 PM

Hi @goelsvaibhav,

When there are only 2 classes, then you need 1 tree (per iteration) to give you a class prediction. When you have more than 2 classes, then for each class you have to build a tree of this class vs all other classes and therefore you end up with num classes * num_trees.

Let me know if this clarifies the issue.

Charlie