User 2568 | 7/8/2016, 3:03:36 AM
I'm predicting store inventory demand from historical store sales. This is an interesting Kaggle competition as the training set is 74M rows and the test set is 7M rows, which means that designing for performance is key. My first thought is to predict demand based the average for the last 4 week. To do this I:
My questions are:
I noticed a similar solution written in R. To improve performance they had used 'setkey' on the data frame. My understanding is this sorts the dataframe by these keys, marks them as sorted, then uses this to speed up aggregation and joins.
Would sorting the SFrame by these columns increase performance of aggregation and joins?
For very large tables, what is the fastest way to do a lookup in a reference table. Is it
I've done some performance testing seems which indicate the joins and aggregates scales linearly or better, so this seems to be the right approach but I wanted to check