feature_engineering.create does not work when using long chains of transforms.

User 2568 | 2/14/2016, 5:02:38 AM

I have a TRAINING data set with 7381 rows and a TESTING with 11171 row. They have 14 features, most of which are dicts.

I want to use Quadratics on 12 of these features. If I create the transformation in the usual way, this takes a second or so, ie.. from graphlab.toolkits.feature_engineering import *

chain = QuadraticFeatures(features=all_features)
quadratic = gl.feature_engineering.create(new_train_data,chain)
new_train_data = quadratic.transform(new_train_data)
new_test_data = quadratic.transform(new_test_data)

However, the problem all the new features are in a single column which means I use all or none in my models. I thought I could chain the transformations pair-wise so each has its own column, which I can then choose to use or not use. I wrote: import itertools from graphlab.toolkits.featureengineering import * newtraindata, newtestdata = initialise.loaddata(reload_data=False)

chain = [QuadraticFeatures(features=pair, output_column_name=",".join(pair)) 
             for pair in itertools.combinations(all_features, 2)]

Which works quickly to create 66 QuadraticFeatures. When I try to create the transform quadratic = gl.featureengineering.create(newtrain_data,chain)

My server runs at 100% and after 5 min I give up and restart the kernel


User 91 | 2/15/2016, 10:17:50 PM

User 2568 | 2/16/2016, 2:11:17 AM

A tar archive of the SFrame binary save is in S3 here

s3://kmc-data-science/Telstra Network/bug_sample.tar

Is data was created from the Telstra Network Disruptions Kaggle competition https://www.kaggle.com/c/telstra-recruiting-network

Commands to reproduce bug are:

import graphlab as gl
from graphlab.toolkits.feature_engineering import *
import itertools

all_features=set(new_train_data.column_names())-set(['fault_severity', 'fault_severity#',  'id'])

new_train_data= gl.SFrame('bug_sample');

#this creates 91 transformations 
chain = [QuadraticFeatures(features=pair, output_column_name=",".join(pair)) 
             for pair in itertools.combinations(all_features, 2)]

#Creating the transformation loops indefinitely. 
quadratic = gl.feature_engineering.create(new_train_data,chain)

User 2568 | 2/20/2016, 12:00:05 AM

I've been experimenting with this a little more. I created each transformation and fit one after the other. There is a sudden slow down when I get to about 35+ columns in the SFrame.

User 940 | 3/3/2016, 7:00:20 PM

Hi @Kevin_McIsaac ,

Sorry for the delay, we're looking into this now and will keep you posted.

Thanks you for your patience! -Piotr

User 940 | 3/3/2016, 11:40:41 PM

Hi @Kevin_McIsaac ,

I'm trying to download the data you provided with aws cli.

aws s3 cp 's3://kmc-data-science/Telstra Network/bug_sample.tar' . A client error (NoSuchKey) occurred when calling the HeadObject operation: Key "Telstra Network/bug_sample.tar" does not exist Is the file still present and public?

Thanks for your patience!

Cheers! -Piotr

User 940 | 3/3/2016, 11:48:59 PM

Oh, I see. It's not a .tar but a directory/prefix.

Thanks, I'll continue looking into this!

User 2568 | 3/4/2016, 12:22:40 AM

Oops, I must have deleted it. Try this https://s3-ap-southeast-2.amazonaws.com/kmc-data-science/Telstra+Network/bug_sample.tar

User 940 | 3/5/2016, 3:43:35 AM

Hi @Kevin_McIsaac ,

We've confirmed the bug and are investigating. There is a workaround though. Basically, you calculate ALL the interactions but filter out the ones you don't want with a stack and unstack operation. Here's an example for interactions that contain the log_feature:volume features.

`python import graphlab as gl from graphlab.toolkits.feature_engineering import * import itertools

newtraindata= gl.SFrame('bug_sample')

allfeatures=set(newtraindata.columnnames())-set(['faultseverity', 'faultseverity#', 'id']) allfeatures=list(allfeatures)

transformer = graphlab.featureengineering.QuadraticFeatures(features=allfeatures) newtraindata = transformer.fittransform(newtrain_data)

Stack features to grab all unique interactions

stacked = newtraindata.stack('quadraticfeatures', ['feature','value']) uniquefeatures = stacked['feature'].unique()

Only care about interactions with 'log_feature:volume', so filter

filteredfeatures = uniquefeatures.filter(lambda x: "logfeature:volume" in x) stacked = stacked.filterby(filtered_features, 'feature')

Unstack into original sparse form

quadraticsubsetsf = stacked[['id','feature','value']].unstack(['feature','value'], newcolumnname='quadraticfeaturesubset') `

I hope this gets you unblocked! Let me know if you need anything else.

Cheers! -Piotr