Parameter search job.get_status() crashes the notebook kernel and the search is effectively lost

User 2568 | 3/16/2016, 10:23:15 PM

This notebook explains the problem and shows the steps to reproduce the bug. All the necessary files are in this repo

This is a very painful bug as my parameter searches run for about 45 on a large EC2 server.

This started as this discussion before I realised what the problem really was


User 2568 | 3/17/2016, 5:08:23 AM

I ran a few more jobs and noticed that some of the jobs failed before they completed. In the "metric" file of one of the jobs I found the following. I suspect its related to this problem I reported a couple of days ago.

It looks like GraphLab does not release all memory between job execution and runs out of memory.

{"status": "Failed", "exception": "MemoryError", "exception_traceback": "Traceback (most recent call last):\n  File \"/home/ec2-user/anaconda2/envs/dato-env/lib/python2.7/site-packages/graphlab/deploy/\", line 241, in _run_task\n    result = code(**inputs)\n  File \"/home/ec2-user/anaconda2/envs/dato-env/lib/python2.7/site-packages/graphlab/toolkits/model_parameter_search/\", line 328, in _train_test_model\n    model = model_factory(training_set, **model_parameters)\n  File \"/home/ec2-user/anaconda2/envs/dato-env/lib/python2.7/site-packages/graphlab/toolkits/classifier/\", line 668, in create\n    **kwargs)\n  File \"/home/ec2-user/anaconda2/envs/dato-env/lib/python2.7/site-packages/graphlab/toolkits/\", line 453, in create\n    options, verbose)\n  File \"/home/ec2-user/anaconda2/envs/dato-env/lib/python2.7/site-packages/graphlab/toolkits/\", line 60, in run\n    (success, message, params) = unity.run_toolkit(toolkit_name, options)\n  File \"graphlab/cython/cy_unity.pyx\", line 81, in graphlab.cython.cy_unity.UnityGlobalProxy.run_toolkit\n  File \"graphlab/cython/cy_unity.pyx\", line 86, in graphlab.cython.cy_unity.UnityGlobalProxy.run_toolkit\nMemoryError: std::bad_alloc\n", "output_path": "/home/ec2-user/.graphlab/artifacts/results/job-results-82c9113f-8caa-4e42-b8e9-fb370979f83f/output/", "run_time": null, "task_name": "_train_test_model-0-0", "start_time": 1458187589, "exception_message": "std::bad_alloc"},

User 1190 | 3/17/2016, 11:30:54 PM

Thanks for creating this bug report. Please see my responses to and