Removing redundant columns in dataset, i.e., constant columns or linear combinations

User 2568 | 3/14/2016, 4:57:59 AM

The data for the Santander Kaggle competition has many redundant columns, i.e., some are constant and some pairs of columns are a linear transforms of each other. The raw data set has 370 columns and 64 are either constants or the paris are equal

When preparing the data for a classifier, especially boostedtreesclassifier.create, it is worth removing these redundant columns.

To identify a constant column, I use

[ col for col in train_data.column_names() if train_data[col].var() == 0] and for 

and to identify linear transforms I use:

[(col2,col1) for col1, col2, in pairs if abs(pearsonr(train_data[col1], train_data[col2])[0]) > 0.99]

For very large datasets this is slow and I wondered if there were standard tools for this with faster methods?

For columns that aren't constant, or pairs that are not a linear combinations, it's likely only a small number of rows need to be compared and so conceptually a loop like this should be quicker, def is_const(sa):

    c = sa[0]
    for value in sa:
        if value != c:
            return False

    return True

drop_col = [ col for col in train_data.column_names() if is_const(train_data[col])]

However on my data set is was about 4x slower. I then wondered if I could exploit lazy evaluation and wrote:

[ col for col in train_data.column_names() if all(train_data[col] - train_data[col][0] == 0)]

This was about 7 x slower.


User 1207 | 3/14/2016, 6:20:29 PM

Hello @Kevin_McIsaac:

In this case, the key is that some of the operations -- e.g. all or pearsonr -- cause a cast to a python iterator and then back to an sframe. If you are able to use either apply(...) or what is implemented as a native SFrame operation, then things will be much faster as all of these are run in parallel and without converting everything to python first.

One thing you could try is the following:

gl.regression.linear_regression.create(X[[col1, col2]], target = col2, validation_set = None, l2_penalty=0).get("training_rmse")

If this is very close to zero, then one column can be predicted linearly from the other. This would also work for multiple column groups.

In your last statement , for the last one, all(...) is a python builtin that takes any iterable, and this means python does all the evaluation. In your case, it would be much faster to do (abs(train_data[col] - train_data[col][0]).sum() == 0), which is done entirely in the c++ lazy evaluation framework.

Hope that helps! -- Hoyt

User 2568 | 3/14/2016, 8:06:52 PM

hoytak thanks for the reply. I tried your solution of (abs(traindata[col] - traindata[col][0]).sum() == 0) vs the others I'd thought off and here is the result 1. traindata[col].var() == 0 1.02 sec 2. abs(traindata[col] - traindata[col][0]).sum() == 0 4.21 s 3. all(traindata[col] - train_data[col][0] == 0) 6.61 s

so calculating the variance is the quickest.

However, my point is all the solution must process the entire column of data and don't stop early. That is the lazy evaluation does not helps as the operation must be performed on all the elements.

It seem that GraphLab does NOT have a language element that suites this kind of computation.That is, both constant and linear combination detection would greatly benefit from an SFrame native operation that terminates early if the operation on a specific row was false.

The solution could be an SFrame native version of ALL (and OR) that works with lazy evaluation and terminate as soon as it encounters a False (True) statement, eg., .

gl.ALL(train_data[col] - train_data[col][0] == 0)

So, if the second element is not the same as the first it stops at the second evaluation!

Another approach would be an SFrame native version of reduce with an early stopping condition, say like this.

gl.reduce(and, lambda x:  x- train_data[col][0] == 0, x- train_data[col][0] != 0)

User 2568 | 3/14/2016, 8:08:24 PM

Can you comment on the first part of my question too? When preparing data for a classifier, specifically boostedtreesclassifier.create, it is worth removing these redundant columns.

User 1207 | 3/14/2016, 9:12:03 PM

Ah, I see. And... I realized I didn't explain that aspect clearly. We do implement a .all() method that does this, as well as a .any() method that does the complement. I used .sum() thinking of a threshold first (e.g. `(..).sum() < 1-4'), then changed it to your approach.

As for the classifier, sorry, I didn't realize that statement was part of your question. Yes, it can help speed up the training to remove these columns, but it shouldn't affect the accuracy of the boosted tree model. Constant valued columns are effectively ignored even if they are indexed, and it would be essentially random which column of two identical columns would be picked. The exception to this is if column_subsample is on, in which case one of the duplicated columns would be more likely to show up in the subsampled column list.

-- Hoyt

User 2568 | 3/15/2016, 1:45:57 AM

So I thought I should be able to write

not (train_data[col] - train_data[col][0] != 0).any()

and if this does lazy evaluation and early termination, this should be faster than

 train_data[col].var() == 0

if the array is not a constant. My experiment below shows its slower. Any thoughts? 1. Have I written this to take advantage of lazy evaluation. 2. Does any() terminate early?

sa_const = gl.SArray(data = [1 for _ in xrange(100000000)])
sa_rand =  gl.SArray(data = [i for i in xrange(100000000)])

#time how long it take to determing this is a constant
%timeit sa_const.var() == 0
%timeit not ((sa_const - sa_const[0]) != 0).any()

#time how long it take to determing this is NOT constant. 
    #If any() does early termination and lazy evaluation this iwll be much quicker
%timeit sa_rand.var() == 0
%timeit not ((sa_rand - sa_rand[0]) != 0).any()

1 loops, best of 3: 441 ms per loop
1 loops, best of 3: 1.23 s per loop

1 loops, best of 3: 478 ms per loop
1 loops, best of 3: 1.26 s per loop

User 1207 | 3/17/2016, 7:26:40 PM

Hey Kevin,

Indeed, in this case, we've noted that it doesn't do early stopping due to a quirk in our lazy evaluation framework that we just uncovered. However, it should, and we'll be looking into fixing it. I've captured the issue here:

Thanks! -- Hoyt