boosted trees parameter search doesn't like validation set

User 1201 | 1/14/2015, 10:57:14 PM

I'm using this code to run a parameter search.

try:
    env = gl.deploy.environment.Local('local env')
except:
    pass
train, test = data.random_split(.9)
train_file_path = '/Users/.../train_file.sf'     #full path specification
test_file_path  = '/Users/.../test_file.sf'      #full path specification
save_file_path = '/Users/.../save_file.sf'      #full path specification
test.save(test_file_path)
train.save(train_file_path)
static_params = {'target': 'leads', 'features':features, 'max_iterations':10}
search_params = {
                 'max_depth': [4, 6 , 10,25], 
                 'step_size': [.01,.05,.1,.3,.5, 1], 
                 'row_subsample':[.8,1],
                 'column_subsample' : [.3,1.0]
                 }
features = ['unigram_binary','bigram_binary','user_type']

job = gl.toolkits.model_parameter_search(gl.boosted_trees_classifier.create,
                                         train_file_path,
                                         save_path = save_file_path,
                                         test_set_path = test_file_path,
                                         standard_model_params = static_params,
                                         hyper_params = search_params,
                                         max_num_models=5,
                                         environment=env)



print 'Done!'

result['model_details'][0] shows this:

{'classweights': {0: 1.0, 1: 1.0}, 'classes': array('d', [0.0, 1.0]), 'columnsubsample': 1.0, 'features': ['unigrambinary', 'bigrambinary', 'usertype'], 'maxdepth': 4, 'maxiterations': 10, 'minchildweight': 0.1, 'minlossreduction': 0.0, 'numclasses': 2, 'numexamples': 1992, 'numexamplesperclass': {0: 1748, 1: 244}, 'numfeatures': 3, 'numtrees': 10, 'numunpackedfeatures': 2777, 'numvalidationexamples': 0, <<<<<<<<<------------- No validation examples 'rowsubsample': 0.1, 'stepsize': 1.0, 'target': 'leads', 'trainingaccuracy': 0.9954819277108434, 'trainingtime': 0.222652, 'trees_json': job shows this:

No sign of validation in PROGRESS

PROGRESS: Boosted trees classifier:

PROGRESS: --------------------------------------------------------

PROGRESS: Number of examples : 1992

PROGRESS: Number of classes : 2

PROGRESS: Number of feature columns : 3

PROGRESS: Number of unpacked features : 2777

PROGRESS: Starting Boosted Trees

PROGRESS: --------------------------------------------------------

PROGRESS: Iter Accuracy Elapsed time

PROGRESS: 0 9.905e-01 0.05s

PROGRESS: 1 9.950e-01 0.07s

PROGRESS: 2 9.970e-01 0.09s

PROGRESS: 3 9.950e-01 0.11s

PROGRESS: 4 9.955e-01 0.12s

PROGRESS: 5 9.940e-01 0.14s

PROGRESS: 6 9.945e-01 0.16s

PROGRESS: 7 9.950e-01 0.18s

PROGRESS: 8 9.960e-01 0.20s

PROGRESS: 9 9.955e-01 0.21s

Job seems to show something

Job: Model-Parameter-Search-1421271929.34

Tasks: ['Model-Parameter-Search-1421271929.34-Pipeline']

Metrics:

{ 'task.Model Train Test 1421271929.34-0-0.lastrun': '2015-01-14 13:45:29.349656', 'task.Model Train Test 1421271929.34-0-0.metrics': { }, 'task.Model Train Test 1421271929.34-0-0.runtime': 0.5408799648284912, 'task.Model Train Test 1421271929.34-1-0.lastrun': '2015-01-14 13:45:29.929706', 'task.Model Train Test 1421271929.34-1-0.metrics': { }, 'task.Model Train Test 1421271929.34-1-0.runtime': 0.5672690868377686, 'task.Model Train Test 1421271929.34-2-0.lastrun': '2015-01-14 13:45:30.535593', 'task.Model Train Test 1421271929.34-2-0.metrics': { }, 'task.Model Train Test 1421271929.34-2-0.runtime': 0.6107909679412842, 'task.Model Train Test 1421271929.34-3-0.lastrun': '2015-01-14 13:4Htmlx�I�M! ��7# ++����FYI: If you are using Anaconda and having problems with NumPyHello everyone,

I ran into an issue a few days ago and found out something that may be affecting many GraphLab users who use it with Anaconda on Windows. NumPy was unable to load, and consequently everything that requires it (Matplotlib etc).

It turns out that the current NumPy build (1.10.4) for Windows is problematic (more info [here](ht

Comments

User 1190 | 1/15/2015, 6:30:06 PM

Hi,

"validationset" is used for monitoring the loss on a validation dataset during training. For example, <pre><code>m = gl.boostedtreesclassifier.create(dataset=..., validationset=...)</code></pre>, you will see both training loss and validation loss for each iteration as progress. Typically, at the end you will have test_set, and you can evaluate the error on the test set using <pre><code>m.evaluate</code></pre>

When using modelparametersearch, we currently do not support "validationset" option. But if you set the "testsetpath", the error on test data will be shown in the "testmetrics" column of the result SFrame. Please see the bottom of the API doc for more details: https://dato.com/products/create/docs/generated/graphlab.toolkits.modelparametersearch.html#graphlab.toolkits.modelparametersearch

jay