Number of features supported for classification

User 117 | 10/21/2014, 4:40:53 PM

Hello,

I asked before in this forum how many features we can use for classification and was told that GL should be able to handle thousands of features.

I'm now trying to run an experiment where I'm training a classifier with ~9000 features, but having access to only ~500 training samples.

A few seconds after the initiation of the training I lose connection to the local GL server and get an error. This happens with svm and logistic regression classifiers.

I'm trying to run the training locally on a machine with 8GB of RAM, using the default parameters for the models.

Any idea what might be causing the connection loss?

I can provide the data if requested.

Comments

User 91 | 10/22/2014, 1:35:34 AM

We have run GLC on dataset with millions of features.

Can you provide the data? We can look into the issue.


User 117 | 10/22/2014, 8:20:56 AM

Hello,

you can find the data here: https://www.dropbox.com/s/lybbdprwyl8otz1/complete.tar.gz?dl=0

The archive contains a binary SFrame created with sf.save().

The target column is named 'class', all the other columns are features.

I have tried different settings for L1 regularization in order to reduce the number of features used but I am still getting the same problem.

Do you have any indications on how much memory is needed in respect to no. of samples and features? I have experienced cases in the past where GL algorithms would fail the same way as this does on smaller machines but complete successfully in machines with lots of memory.


User 117 | 10/22/2014, 11:40:34 AM

I should also note that I'm trying to run this in a virtualenv.


User 91 | 10/22/2014, 7:55:03 PM

The issue is not the memory but the number of files that your system can open. Another user had a similar issue (http://forum.graphlab.com/discussion/comment/946/#Comment_946). This can happen if you use SFrames with a very large number of columns. Setting the file limit to a larger using "ulimit" can help. One reason why things may have worked on your server was because it may have had a larger default file limit set.

Instead, I would recommend using dictionaries and lists for encoding your features. (See this picture for an illustration: http://graphlab.com/images/userguide/supervised-learning-list-variables.png). They keep the SFrames compact and ensure high throughput for model training.

The code snippet that I used to get the model to train using your data is as follows:

import graphlab as gl import array

Load the SFrame

sf = gl.SFrame('complete')

Columns to pack into lists

cols = sf.column_names() cols.remove('class')

Pack the columns into arrays.

sf = sf.pack_columns(cols, dtype=array.array)

Train the model using the arrays as features.

model = gl.logistic_classifier.create(sf, target = 'class')


User 117 | 10/23/2014, 8:08:04 AM

Thanks, I remember having a similar problem in the past with factorization of sparse data.

The solution there was to use dicts to store the sparse data which, as does this solution, feels a bit unnatural and breaks the natural ETL flow within GL.

Hopefully it will get fixed in a future release.


User 91 | 10/23/2014, 3:37:57 PM

There are several dictionary and list utilities in the SArray that help you work easily with sparse data i.e dictionaries or dense data i.e lists. Check out packcolumns, unpack, stack, unstack, dicttrimbykeys, slice etc. Do let us know if these help with your ETL workflow.

We encourage you to use dictionaries when your data is sparse. It works out to be more efficient in terms of storage and computational cost.