SFrame aggregation and join performance. Its this the best approach

User 2568 | 7/8/2016, 3:03:36 AM

I'm predicting store inventory demand from historical store sales. This is an interesting Kaggle competition as the training set is 74M rows and the test set is 7M rows, which means that designing for performance is key. My first thought is to predict demand based the average for the last 4 week. To do this I:

  1. Calculate product average sales per store for the last 4 weeks using an aggregation
  2. Join this to the test data to create the prediction.

My questions are:

  1. I noticed a similar solution written in R. To improve performance they had used 'setkey' on the data frame. My understanding is this sorts the dataframe by these keys, marks them as sorted, then uses this to speed up aggregation and joins.

    • Does GraphLab have a similar optimisation? My impression is this is done automatically, but I wanted to confirm please
    • Would sorting the SFrame by these columns increase performance of aggregation and joins?

  2. For very large tables, what is the fastest way to do a lookup in a reference table. Is it

    query.join(reference, how='left) 

I've done some performance testing seems which indicate the joins and aggregates scales linearly or better, so this seems to be the right approach but I wanted to check

Comments

User 91 | 7/8/2016, 3:34:41 PM

Currently, we don't have a sort-merge-join implemented so sorting the keys may or may not help significantly. Other that than, it seems like a good approach. If you do run into performance issues send us a code snippet and we can see if there is a faster way to do the same thing.


User 2568 | 7/10/2016, 2:18:32 AM

The core of the code is

        forecast = train.groupby(features, {"Pred2": gl.aggregate.MEAN('Demanda_uni_equil')}) 
        test = test.join(forecast, on = features, how='left') 
        test['Pred'] = gl.SArray.where(test['Pred'] == None, test['Pred2'], test['Pred']) 

I'm running on an 8 core machine and I notice for a while it runs on all core at 100%, then on 1 for 100% then back to all cores again

If you want to run it the data is on kaggle here the full code is

def median(sa):
    return sa.sketch_summary().quantile(0.5)

def predict_v3(train, test, lst, verbose=False):

    train = train.copy()
    test  = test.copy()

    if 'Demanda_uni_equil' in test.column_names():
        test.rename({'Demanda_uni_equil':'Actual'})
        test['logActual'] = test['Actual'].apply(log1p)

    train['Demanda_uni_equil'] = train['Demanda_uni_equil'].apply(log1p)
    test['Pred'] = None

    for features in lst:
        forecast = train.groupby(features, {"Pred2": gl.aggregate.MEAN('Demanda_uni_equil')}) 
        test = test.join(forecast, on = features, how='left') # Warning - join does not return the dataframe in the original column order of the left table!
        test['Pred'] = gl.SArray.where(test['Pred'] == None, test['Pred2'], test['Pred']) 
        test.remove_column('Pred2')
        if verbose:
            if 'Actual' in test.column_names():
                print "logrmse {:6.4f}".format(gl.evaluation.rmse(test['Pred'], test['logActual'])), 
            print "{:8,} ({:4.2%}) missing predictions after forcasting by {}".format(test['Pred'].num_missing(), test['Pred'].num_missing()/float(len(test)), features)
            sys.stdout.flush()

    # As a last resort use the overall median
    forecast = median(train['Demanda_uni_equil'])
    test['Pred'] = gl.SArray.where(test['Pred'] == None, forecast, test['Pred']) 

    test['Pred'] = test['Pred'].apply(lambda l: round(expm1(l)), dtype=int)
    if verbose and 'Actual' in test.column_names():
        print "logrmse {:6.4f}".format(logrmse(test['Pred'], test['Actual'])) 

    test.rename({'Pred':'Demanda_uni_equil'})
    return test