User 2568 | 3/14/2016, 4:57:59 AM
The data for the Santander Kaggle competition has many redundant columns, i.e., some are constant and some pairs of columns are a linear transforms of each other. The raw data set has 370 columns and 64 are either constants or the paris are equal
When preparing the data for a classifier, especially boostedtreesclassifier.create, it is worth removing these redundant columns.
To identify a constant column, I use
[ col for col in train_data.column_names() if train_data[col].var() == 0] and for
and to identify linear transforms I use:
[(col2,col1) for col1, col2, in pairs if abs(pearsonr(train_data[col1], train_data[col2])) > 0.99]
For very large datasets this is slow and I wondered if there were standard tools for this with faster methods?
For columns that aren't constant, or pairs that are not a linear combinations, it's likely only a small number of rows need to be compared and so conceptually a loop like this should be quicker, def is_const(sa):
c = sa for value in sa: if value != c: return False return True drop_col = [ col for col in train_data.column_names() if is_const(train_data[col])]
However on my data set is was about 4x slower. I then wondered if I could exploit lazy evaluation and wrote:
[ col for col in train_data.column_names() if all(train_data[col] - train_data[col] == 0)]
This was about 7 x slower.