how to remove dupliacte columns in SFrame?

User 5143 | 4/24/2016, 1:50:06 PM

pls provide efficeint equlent code in graphlab

remove duplicate columns

colsToRemove = [] columns = trainDataFrame.columns for i in range(len(columns)-1): v = trainDataFrame[columns[i]].values for j in range(i+1,len(columns)): if np.array_equal(v,trainDataFrame[columns[j]].values): colsToRemove.append(columns[j])

trainDataFrame.drop(colsToRemove, axis=1, inplace=True)

Comments

User 16 | 4/25/2016, 8:40:28 PM

Hi Sevla,

I'd recommend you start by taking a look at https://dato.com/learn/translator/

The corresponding code using SFrame is pretty similar, only a few differences. To get column names call column_names rather than access the column member. You don't call values to get the values from an series/sarray. To remove columns, call remove_columns, which is inplace by default.

Here is how I would rewrite the code to use SFrame:

colsToRemove = [] columns = trainDataFrame.column_names() for i in range(len(columns)-1): v = trainDataFrame[columns[i]] for j in range(i+1,len(columns)): if np.array_equal(v,trainDataFrame[columns[j]]): colsToRemove.append(columns[j]) trainDataFrame.remove_columns(colsToRemove)


User 5143 | 4/26/2016, 12:16:08 PM

Hi Toby I tried with same code, for 7000 X 371 data very slow


User 16 | 4/27/2016, 12:53:42 AM

How big the dataset, int terms of megabytes or gigabytes? How much slower is it than pandas?