Reproducibility of models

User 3247 | 2/25/2016, 10:39:28 PM

I am using random seed to reproduce results from a model graphlab.create()

Here is my function

def myTrainingFunc(traindata, testdata, inpfeatures, vbs=False): myModelBDT = gl.boostedtreesclassifier.create( traindata, target='result', features = inpfeatures, randomseed = 1, maxiterations = 20, maxdepth = 10, verbose = vbs) cfm = gl.evaluation.confusionmatrix(testdata['result'], myModelBDT.predict(test_data)) return cfm.sort(['targetlabel','predictedlabel'])

On calling it TWICE on same (traindata, testdata), I get different output of confusion matrix. I am probably missing something here. Why is my result on reproduced on second call of the same function?

Any input is appreciated.

Thanks.

Comments

User 3247 | 2/26/2016, 12:42:22 AM

Pictures may be better.

Here is the function.

First Run

myTrainingFunc(traindata, testdata, [ 'loc','counts', ], vbs=False)

Second Run.

myTrainingFunc(traindata, testdata, [ 'loc','counts', ], vbs=False)

I think I use the same seed as is defined in the function as random_seed = 1. Should I get different outputs? If not, than any clue?

Thanks.


User 19 | 2/26/2016, 4:50:38 PM

Hi vikasapr,

By default, many GraphLab Create models perform a train/test split so that you can monitor performance during training. The random_seed argument controls the training procedure, but it doesn't control the randomization of the internal train/test split (though it should). If you set validation_set=test_data then you will get reproducible results and you will be able to monitor the progress of accuracy metrics on your test data.

Thanks for reporting the issue! Chris


User 3247 | 2/27/2016, 12:51:35 AM

Thanks Chris. I understand.