memory error in running models?

User 2785 | 4/1/2016, 6:14:26 AM

Hiya!

I'm trying to run a couple models on some aggregated data and am running into this error (starting out with logistic regression):

` PROGRESS: Creating a validation set from 5 percent of training data. This may take a while. You can set validation_set=None to disable validation tracking.

WARNING: The number of feature dimensions in this problem is very large in comparison with the number of examples. Unless an appropriate regularization value is set, this model may not provide accurate predictions for a validation/test set. WARNING: Detected extremely low variance for feature(s) 'untilcutoff32', 'untilcutoff64', 'untilcutoff128' because all entries are nearly the same. Proceeding with model training using all features. If the model does not provide results of adequate quality, exclude the above mentioned feature(s) from the input dataset. Logistic regression:


Number of examples : 21650497 Number of classes : 2 Number of feature columns : 16 Number of unpacked features : 16 Number of coefficients : 42598744 Starting L-BFGS


+-----------+----------+-----------+--------------+-------------------+---------------------+ | Iteration | Passes | Step size | Elapsed Time | Training-accuracy | Validation-accuracy | +-----------+----------+-----------+--------------+-------------------+---------------------+ Traceback (most recent call last): File "processactivitywindowdata.py", line 38, in <module> m = gl.logisticclassifier.create(train, 'target', features=features) File "/home/lisafeets/dandelion/churn27/lib/python2.7/site-packages/graphlab/toolkits/classifier/logisticclassifier.py", line 308, in create classweights = classweights) File "/home/lisafeets/dandelion/churn27/lib/python2.7/site-packages/graphlab/toolkits/supervisedlearning.py", line 453, in create options, verbose) File "/home/lisafeets/dandelion/churn27/lib/python2.7/site-packages/graphlab/toolkits/main.py", line 60, in run (success, message, params) = unity.runtoolkit(toolkitname, options) File "graphlab/cython/cyunity.pyx", line 81, in graphlab.cython.cyunity.UnityGlobalProxy.runtoolkit File "graphlab/cython/cyunity.pyx", line 86, in graphlab.cython.cyunity.UnityGlobalProxy.runtoolkit MemoryError: std::badalloc

`

any idea why i'm having issues? the data set is appx. 4 gb in size.

Comments

User 19 | 4/1/2016, 2:44:12 PM

Hi wallawall,

While you only have 21M examples and 16 columns, it looks like the model requires 42.6M variables. This number increases when you have a lot of unique categorical values, e.g. one or more of your columns is string-typed, and has many unique values.

In order to train quickly and predict quickly, these variables typically need to fit into RAM. If I recall correctly, we store these variables as a double internally (i.e. 8 bytes), so that would mean the model requires ~300MB, but then we need multiple copies of this vector in order to properly train the model. Also, if any of the strings that are being used are really long, that can be problematic, too; we have to store the mapping between these string values and their internal identifiers, so that can cause some memory consumption, too.

How much RAM do you have on that machine? For this model, I think 4gb should be sufficient, but I can't be entirely sure without investigating with your actual data. Having 8gb or 16gb would be a safer bet, I think.

Please let us know a bit more about the nature of these columns and the size of your machine. That will help us narrow down the possibilities.

Chris


User 2785 | 4/1/2016, 5:05:51 PM

Hi Chris!

Thank you for the insights- turns out that two of my features are datetimes (redundant as I have that information coded into other features) and were therefore counting as a buttload of unique categorical values. I've removed those fields and am now able to run the models. Thanks!!


User 19 | 4/1/2016, 5:09:10 PM

Ah, that would definitely explain it! Glad that fixed things.

Let us know if you run into any other issues; we're happy to help!