Suggestion for extending aggregate.CONCAT

User 2568 | 3/2/2016, 5:22:49 AM

I've been using GraphLab Create for the Kaggle "Telstra Network Disruption" and found the various features/function makde this quite simple. From my expereince in this competition I have simple suggestions to extend CONCAT would have made the coding simpler.

Background

Most of the data in this competition are categorical features, which GraphLab features and the create methods simplify dealing with.

The data was provided as a table of observations (train.cvs and test.csv ) with a unique identifier for observation (id). These observation have to be combined with three different feature tables, each of which also included 'id'.

A complicating factor is the feature tables are many-to-one with train and test and can't just be joined to train/test. Instead they need to be group by 'id' using CONCAT aggregation of the feature. GraphLab made this quite easy, however there are a couple of wrinkles that could be smoothed out.

The log_feature table is of the form 'id', 'feature', 'volume", so:

groupby('id', {'log_features': gl.aggregate.CONCAT('feature', 'volume')})

creates a table of 'id', 'logfeatures', which is one to one with test/train. The 'logfeatures' column is a dictionary (i.e., {feature:value, feature 2: value, ...}, which is a simple, sparse representation of "n in k encoding". When joined with the train observation on 'id', 'log_feature' is properly handled by the create methods. So far so good.

The two other feature tables did not have a value associated with the feature. For example eventtype is just "id", "eventtype", so I used

groupby('id', {'event_types': gl.aggregate.CONCAT('event_type'')})

This creates a one to one table with train/test, where 'eventtypes' is a list of features. The problem is this is not compatible with the create methods and requires another step to turn a dictionary. This is not difficult, but is not as clean as logfeatures.

The other issue is I want a second log_feature column using log1p(volume) and a third using volume2. To do this I first needed to add a new column = log1p( volume) and another =volume, then I could use CONCAT like above. Again not hard, but more complex than might be needed and takes attention away from the feature creation.

Suggestion

It struck me that extending CONCAT' to accept a function to be applied to the row being grouped would unify and simplify these three cases, i.e.,

The first groupby stays as is and the second and third simplify to

groupby('id', {'event_types': gl.aggregate.CONCAT('event_type', {lambda row: 1)})
groupby('id', {'log_feature_log_volume': gl.aggregate.CONCAT('log_feature''), {lambda row: log1p(row('volume')}
   				 'log_feature_volume**2': gl.aggregate.CONCAT('log_feature''), {lambda row: row('volume')**2})

I think this might be a simple, logical and powerful extension of CONCAT that is worth considering.

Comments

User 1359 | 3/2/2016, 6:31:10 PM

Thanks for the very nice suggestion, Kevin.

We do, in fact, have new aggregator functions coming in the next release: frequency count and distinct item.

For technical reasons, it is unlikely there will be a general callback function parameter, however you can use .apply() to implement any custom logic on the values.

Thanks again!


User 2568 | 3/2/2016, 8:12:46 PM

Dick, thanks for the reply. Frequency count would be most useful in this competition, which I'd raise earlier. Do you have the documentation for this, I'd be interested to see what you are proposing.

You are quite right that it is possible to use .apply(), however this is less expressive. For the competition I have to write:

    #log_features is of the form "id", "feature", "volume'. 
    features = gl.SFrame.read_csv('data/log_feature.csv', verbose=False);

    # These two lines become unnecessary with the new aggregate FREQUENCE feature
    freq = features.groupby('log_feature', {'freq':gl.aggregate.COUNT()})
    features = features.join(freq, on='log_feature', how='left')

        #This line is necessary as CONCAT does not take a function.
    features['log(volume)']   = features['volume'].apply(log1p)

    features = features.groupby('id', 
                 {"log_feature#number":      gl.aggregate.COUNT('id'),
                  "log_feature:":                      gl.aggregate.CONCAT("log_feature"),
                  "log_feature:volume":         gl.aggregate.CONCAT("log_feature", 'volume'),
                  "log_feature:log(volume)": gl.aggregate.CONCAT("log_feature", 'log(volume)'),
                  "log_feature:freq":                gl.aggregate.CONCAT("log_feature", 'freq'),                 
                  "log_feature#volume_sum":  gl.aggregate.SUM('volume'), 
                 })

    # If classify.create() accepted list features or CONCAT took a function, this would not be needed.
    features['log_feature:'] = features['log_feature:'].apply(lambda lst: { k:1 for k in lst})

while is less clear than proposed.

       features = gl.SFrame.read_csv('data/log_feature.csv', verbose=False);

   features = features.groupby('id', 
                 {"log_feature#number":          gl.aggregate.COUNT('id'),
                  "log_feature:":                          gl.aggregate.CONCAT("log_feature"),
                  "log_feature:volume":             gl.aggregate.CONCAT("log_feature", 'volume'),
                  "log_feature:log(volume)":     gl.aggregate.CONCAT("log_feature", lambda r: log1p(r['volume']),
                  "log_feature:freq":                   gl.aggregate.FREQUENCY("log_feature"),                 
                  "log_feature#volume_sum":  gl.aggregate.SUM('volume')})