Isn't the ability process large (which can't fit in memory) data available for boosted_trees?

User 690 | 12/31/2014, 3:42:05 AM

Hi Everybody, Isn't the ability process(train with) large (which can't fit in memory) data available for boostedtrees? (I have able to (and told) process Logisticregression with large data). I get a malloc error. I have pasted the stack trace below.

Thanks, Sunil.

features : ('CreativeId', 'PublisherId') PROGRESS: Boosted trees classifier: PROGRESS: -------------------------------------------------------- PROGRESS: Number of examples : 356914653 PROGRESS: Number of classes : 2 PROGRESS: Number of feature columns : 4 PROGRESS: Number of unpacked features : 4 PROGRESS: Starting Boosted Trees PROGRESS: -------------------------------------------------------- PROGRESS: Iter Accuracy Elapsed time Traceback (most recent call last): File "../ctr.py", line 118, in <module> models = {tuple(catfeatures):createmodel(data,catfeatures,contfeatures,'Click') for catfeatures in categoricalfeaturepreference} File "../ctr.py", line 118, in <dictcomp> models = {tuple(catfeatures):createmodel(data,catfeatures,contfeatures,'Click') for catfeatures in categoricalfeaturepreference} File "../ctr.py", line 42, in createmodel model = trainermethod(modeldata,target=target,features=list(features),maxiterations=12) File "/home/sunil/graphlab/lib/python2.7/site-packages/graphlab/toolkits/classifier/boostedtreesclassifier.py", line 612, in create verbose = verbose, **kwargs) File "/home/sunil/graphlab/lib/python2.7/site-packages/graphlab/toolkits/supervisedlearning.py", line 335, in create ret = graphlab.toolkits.main.run("supervisedlearningtrain", options, verbose=verbose) File "/home/sunil/graphlab/lib/python2.7/site-packages/graphlab/toolkits/main.py", line 57, in run (success, message, params) = unity.runtoolkit(toolkitname, options) File "cyunity.pyx", line 70, in graphlab.cython.cyunity.UnityGlobalProxy.runtoolkit File "cyunity.pyx", line 74, in graphlab.cython.cyunity.UnityGlobalProxy.runtoolkit MemoryError: std::badalloc [INFO] Stopping the server connection.

Comments

User 690 | 12/31/2014, 3:45:29 AM

The unity_server log is here https://gist.github.com/c3c0e36e65822a77753c


User 14 | 12/31/2014, 6:12:39 AM

From the log it seems you have created boostedtreesclassifier multiple times with data of similar size. Does it only fail on the last call? What's your memory usage look like for the successful calls? My guess is that you might have hit a bug rather than out of memory.

If it's truly being memory intensive, you can try using the "row_subsample" option, which uses subset of rows for tree construction, but all rows for leaf value estimation.

http://graphlab.com/products/create/docs/generated/graphlab.boostedtreesclassifier.create.html#graphlab.boostedtreesclassifier.create


User 690 | 12/31/2014, 8:22:27 AM

Yes Jay, I am training multiple trees .. But however, I did find that there were pieces of code which can use SFrame as opposed to using python-data-structures... I will change that and see if things get better. Thanks for the info about row_subsample. I think that can be useful.


User 1375 | 5/22/2015, 12:48:23 AM

Just bumped into this beauty or one of its relatives. Please advise.

PROGRESS: Boosted trees classifier: PROGRESS: -------------------------------------------------------- PROGRESS: Number of examples : 15191045 PROGRESS: Number of classes : 2 PROGRESS: Number of feature columns : 62 PROGRESS: Number of unpacked features : 62 PROGRESS: Starting Boosted Trees PROGRESS: -------------------------------------------------------- PROGRESS: Iter Accuracy Elapsed time PROGRESS: (training) (validation) Traceback (most recent call last): File "src/non_blank_trainer.py", line 185, in <module> main() File "src/non_blank_trainer.py", line 151, in main gbt = train_boosted_trees_classifier(train_valid, test_allfeats) File "/root/src/utils.py", line 41, in timed result = func(*args, **kw) File "/root/src/base_trainer.py", line 162, in train_boosted_trees_classifier verbose=True) File "/root/anaconda/envs/graphlab/lib/python2.7/site-packages/graphlab/toolkits/classifier/boosted_trees_classifier.py", line 636, in create verbose = verbose, **kwargs) File "/root/anaconda/envs/graphlab/lib/python2.7/site-packages/graphlab/toolkits/_supervised_learning.py", line 394, in create ret = _graphlab.toolkits._main.run("supervised_learning_train", options, verbose=verbose) File "/root/anaconda/envs/graphlab/lib/python2.7/site-packages/graphlab/toolkits/_main.py", line 64, in run (success, message, params) = unity.run_toolkit(toolkit_name, options) File "cy_unity.pyx", line 70, in graphlab.cython.cy_unity.UnityGlobalProxy.run_toolkit File "cy_unity.pyx", line 74, in graphlab.cython.cy_unity.UnityGlobalProxy.run_toolkit MemoryError: std::bad_alloc


User 91 | 5/22/2015, 1:32:23 AM

Can you give us some more info?

  • What are the specs of the machine?
  • What are the types of the columns? Are they categorical variables with a lot of categories?
  • When the SFrame is saved, how large is it?

The boosted tree is definitely a bit of a memory hog and requires a small subset of columns to be in memory. Depending on your situation, we have a few possible outcomes.