DataFrame.cumsum equivalent with groupby support

User 2032 | 6/10/2015, 10:31:40 AM

The pandas cumsum feature is very usefull for creating "running snapshots" for example user profiles >>at time<< of event that we wan't to use as learning example for prediction. It would be amazing if graphlab could support it. Even better if it would support arbitrary accumulation functions - for example accumulating the ids of visited items into a list. It would also be important for it to work with group by - so one could cumsum consequtive items for each key (e.g. user) separately

Example

id;a;b 1; t; 5 2; t; 6 3; s; 1 4; s; 4 5; s; 4

	gf = sf.groupby('a', operations={
	    'cumsum': gl.aggregate.CUMSUM('b')
	})

would yield:

id;a;cumsum 1; t; 5 2; t; 11 3; s; 1 4; s; 5 5; s; 9

and

	gf = sf.groupby('a', operations={
	    'cumsum': gl.aggregate.CUMSUM('b', initial_acc=[], operation=lambda acc, x: acc.append(x))
	})

would yield:

id;a;cumsum 1; t; [5] 2; t; [5, 6] 3; s; [1] 4; s; [1, 4] 5; s; [1, 4, 4]

Comments

User 19 | 6/10/2015, 5:53:46 PM

Thanks for the feature request! Your suggested API is clean and very intuitive.

We'd also be interested in hearing more about your use case. It sounds like you are looking to predict future user behavior from past future behavior via some online counts? In my experience, I sometimes get better results from partitioning time and doing feature engineering with respect to the previous window, e.g. window t, user u, somefeaturecountfrom{t-1}, etc. It's easy to analyze and interpret, but I agree it doesn't have the same cleanliness as rolling counts (which cumsum would provide).

We'll keep you posted, and keep the feature requests coming!


User 2032 | 6/15/2015, 1:46:24 PM

Hi Chris,

I'm happy that my requests has found fertile ground.

My key argument is that users have different user activity time spans/frequency so one user can perform 20 actions in 1 h and another in 1 month - what would then be a good t-1 for the whole population of users? Hard to define. Cumsums guarantee the state of the user profile as seen by the prediction algorithm at t.

In my opinion this is a killer feature, especially with custom lambda functions.

Small tip regarding aggregations: - you should extend the documentation/tutorials of CONCAT as it is arguably the most powerful operator that lets people create arbitrary groupby functions - this is was not 100% clear for me from day 1 and was kind of a discovery and "aha" moment.


User 2032 | 6/15/2015, 1:50:40 PM

alternative lambda api (more in line with what you have right now would be)

gf = sf.groupby('a', operations={ 'cumsum': gl.aggregate.CUMSUM(initial_acc=[], operation=lambda acc, row: acc.append(row['b'])) })

and arguably it would be more powerful and provide more flexibility.


User 2032 | 8/3/2015, 3:55:14 PM

"+1" on this again.

All my distributed versions (using aggregate.CONCAT and then flatmap or map) are running into OOM on 50 cpu 60GB RAM server if the aggregating column has more than 1000 entries per key and you try to cumsum lists. Within the framework provided by GLC I couldn't think of any version that would not use O(OMPTHREADSNUM * K ** 2) memory, where K is the number of entries per grouping key, because flatmap needs to materialize the list and is not equipped to handle generators.

The dataset is c.a. 100M rows.

Here is a snippet that works fine on my machine for < 500 entries per key. It should work fine for much larger number of entries per key if flatmap would accept generators.

https://gist.github.com/jmizgajski/760386ced2ea43d67929