Subset data without creating new SFrame

User 984 | 11/27/2014, 4:31:29 AM

I have two categorical classes I would like to group by and then select a subset of one of the classes to address a class imbalance in my data.

I'm currently doing: <pre class="CodeBlock"><code>success = train.filterby([1], "feature") failure = train.filterby([0], "feature") no_skew = success.append(failure.sample(0.15))</code></pre> Is there a more idiomatic way to accomplish this?

Comments

User 18 | 11/27/2014, 5:18:01 AM

Hmm, I'm not aware of a way that avoids constructing a new SFrame, but you can probably do it faster (and preserve the original row ordering) by constructing a new column of random selectors and then doing a logical filter:

<pre> import numpy as np

First, generate a random float for each row in the dataset

train['selector'] = graphlab.SArray(np.random.randomsample(train.numrows()))

Now select rows that either belong to class 1, or is selected by the selector

no_skew = train[ (train['feature'] == 1) | (train['selector'] < 0.15)] </pre>


User 91 | 11/27/2014, 8:57:38 PM

Graphlab 1.1 (released yesterday) now comes with weighted class for classification. It should be able to handle imbalanced data out of the box. Do check it out and let us know what you think!


User 984 | 11/28/2014, 10:40:20 PM

@srikris, I noticed that the 1.1 release notes said the weighted class feature only applies to the SVM and logistic classifiers. I can't use an SVM in this case because I need to predict class probabilities and the logistic model underperforms the boosted trees classifier.


User 91 | 11/28/2014, 10:42:16 PM

That is right. The option has not yet been added to the boosted trees classifier. We are on it!

Meanwhile, the user guide has an example of sub-sampling for imbalanced datasets (in the excercises).