Fast operations on dictionary-type SFrame columns

User 761 | 11/6/2014, 6:39:56 PM

Hello (again)!

I have a heavily sparse dataset, around 200k samples and 30k features. If I store it normally it takes up a lot of space on my disk; don't have exact numbers but once unity server disk usage reached 30GB, my meagre free disk was full. So I'm storing the dataset in a dictionary-type SFrame column. One column, each row being a dictionary. Element wise transformations (log,sqrt etc) on the column are spectacularly fast using the apply-lambda method. However, when I want to do operations on a per-feature level, I'm not sure how to go about it.

Eg: Say I want to do 0-1 scaling for each feature. The fastest implementation I could manage was the following:

dmax={} # dictionaries to store max and min of each feature dmin={}

for row in sf['X1']: #X1 is the column of type dictionary for k in row: try: #update max and min for each feature dmax[k]=row[k] if row[k]>dmax[k] else dmax[k] dmin[k]=row[k] if row[k]<dmin[k] else dmin[k] except KeyError: #initialise max and min when a previously unseen feature is encountered dmax[k]=row[k] dmin[k]=row[k]

X1scaled=sf['X1'].apply(lambda x: { k: (v-dmin[k])/(dmax[k]-dmin[k]) for k,v in x.iteritems() if dmax[k]!=dmin[k]}) sfscaled=SFrame({'X1':X1_scaled})

This entire process takes about 140 seconds, which is fine but once the data becomes much larger it may be a concern.

Are there better alternative ways to do this?



User 91 | 11/6/2014, 7:03:10 PM

I believe there is another way to do it. It uses the specialized SFrame and SArray operations called stack and unpack. Do check out stack, unstack, packcolumns, and unpackcolumns. They are some of the most amazing features in the SFrame.

Let us start off with a sample SFrame

sf = gl.SFrame({'a': [{'key%s' % i:i, 'key%s' % (1+i): i + 101} for i in range(100)]})

First perform a stack operation

sf2 = sf.stack('a', newcolumnname=['key', 'value'])

Now perform a group-by by on the dictionary key with an aggregate of MAX on the dictionary value

tempoutput = sf2.groupby('key', {'scale': gl.aggregate.MAX('value')}).unstack(['key', 'scale'], newcolumnname = 'max') maxcolumndict = tempoutput['max'][0]

Do the same with min

tempoutput = sf2.groupby('key', {'scale': gl.aggregate.MIN('value')}).unstack(['key', 'scale'], newcolumnname = 'min') mincolumndict = tempoutput['min'][0]

Now you can do the scaling with your code.

I understand that there are some operations being chained here but once you get a grip on stack, pack_columns, unstack, and unpack on dictionary operations. You will love it!

User 18 | 11/6/2014, 8:01:39 PM

Following up on what Krishna said, stack/unstack/pack/unpack are some of the most powerful and yet most mysterious features in SFrame/SArray. They can take a little getting used to. But once you understand what they do, it becomes easier and easier to envision what else they might be useful for. For example, I use a combination of stack and groupby to calculate document frequency for words. (How-to example forthcoming!)

One thing to note is that unpack should be used very carefully, as it essentially expands out a sparse dictionary into a dense representation (one column per feature) and SFrame doesn't deal well with large numbers of columns (>= 5K).

User 761 | 11/7/2014, 11:44:09 AM

Works great. Got a 6x speedup. Man, I should've thought of it myself. stack/unstack makes a lot of things easier.

" SFrame doesn't deal well with large numbers of columns" @alicez: Will keep this in mind.

Thanks srikris and alicez :)