Stratified split/KFolds

User 1319 | 9/20/2015, 3:21:19 AM

Hi, My understanding is that Graphlab Create currently does not support Stratified split/KFolds. Do you have any plans in the near future to provide these features? I am used to do all my predictive modeling using stratified sampling (using R caret package). So, I've implemented two simple functionsstratified_split and stratified_Kfolds, which utilize sklearn.cross_validation StratifiedShuffleSplit and StratifiedKFold. I've just recently started using Python (mainly used R). So, do you suggest any improvements.

Thanks Tarek

My functions:

from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.cross_validation import StratifiedKFold

def stratified_split(sf, target, train_size = 0.8, seed = None):
    """
     Provides Stratified random split for SFrames,(train, test). It uses sklearn's StratifiedShuffleSplit 
    :param sf: data SFrame 
    :param target: str, target's column, which contains classes 
    :param train_size: float, should be between 0.0 and 1.0 and represent 
           the proportion of the dataset to include in the train split [default = 0.8],
    :param seed: int, Pseudo-random number generator state used for random sampling [default = None].
    :return: a list of SFrames [train, test]
    :rtype: list
   """    
    #Add row_no column, the column is returned with splits in case
    #it is needed to identify the samples in the original SFrame
    sf = sf.add_row_number('row_no')
    index = StratifiedShuffleSplit(sf[target], n_iter = 1, test_size = 1-train_size, random_state = seed)
    split = []
    for train_index, test_index in index:
        split.append(sf[sf.apply(lambda x: x['row_no'] in list(train_index))])
        split.append(sf[sf.apply(lambda x: x['row_no'] in list(test_index))])
    return split

def stratified_Kfolds(sf, target, folds = 5, seed = None):
    """
     Provides Stratified KFolds for SFrames. It uses sklearn's StratifiedKFold  
     :param sf: data SFrame 
     :param target: str, target's column, which contains classes 
     :param folds: int. number of re-shuffling & splitting iterations[default = 5].
     :param seed: int, Pseudo-random number generator state used for random sampling [default = None].
     :return: a list of SFrames [fold1_train, fold1_test, ....]
    :rtype: list
   """    
    #Add row_no column, the column is returned with splits in case
    #it is needed to identify the samples in the original SFrame
    sf = sf.add_row_number('row_no')
    folds_idx = StratifiedKFold(sf[target], n_folds = folds,  random_state = seed)
    Kfolds = []
    for train_index, test_index in folds_idx:
        Kfolds.append(sf[sf.apply(lambda x: x['row_no'] in list(train_index))])
        Kfolds.append(sf[sf.apply(lambda x: x['row_no'] in list(test_index))])
    return Kfolds

Download data

import graphlab as gl
s3_path = "http://s3.amazonaws.com/dato-datasets/freddie_mac/"
sf = gl.load_sframe(s3_path + 'train_set')

Stratified Split

train, test = stratified_split(sf, target='in_default', train_size = 0.75, seed=1261)

Testing the function for stratified random train, test splits

print "train percentage: " , train.num_rows()/float(sf.num_rows())
print "test percentage: " , test.num_rows()/float(sf.num_rows())
#Class distribution
train_class = train.groupby(['in_default'], {'class_count': gl.aggregate.COUNT()})
train_class['class_percentage']= train_class.apply(lambda x: (x['class_count']/float(train.num_rows())))
print "\ntrain class Distribution: \n" , train_class
test_class = test.groupby(['in_default'], {'class_count': gl.aggregate.COUNT()})
test_class['class_percentage']= test_class.apply(lambda x: (x['class_count']/float(test.num_rows())))
print "\ntest class Distribution: \n" , test_class

Stratified KFolds

Markdown`�I�M!	��7#	++����FYI: If you are using Anaconda and having problems with NumPyHello everyone,

I ran into an issue a few days ago and found out something that may be affecting many GraphLab users who use it with Anaconda on Windows. NumPy was unable to load, and consequently everything that requires it (Matplotlib etc).

It turns out that the current NumPy build (1.10.4) for Windows is problematic (more info here).

Possible workarounds are downgrading to build 1.10.1 or forcing an upgrade to 1.11.0 if your dependencies allow. Downgrading was easy for me using conda install numpy=1.10.1

Thanks for your attention!

RafaelMarkdown558,824,8414L���4L���179.110.206.156179.110.206.1564P�}��Xj�8\j�1str�"��\j�Xj��\j�8bj�րi�1(׀i��g��b�j����Xj�\j�Xj�8\j�1.hpp(decrementdistributedcounter:787): Distributed Aggregation of likelihood. 0 remaining. INFO: distributedaggregator.hpp(decrementdistributedcounter:793): Aggregate completion of likelihood Likelihood: -3.22336e+08 INFO: distributedaggregator.3HLABDISABLELAMBDA_SHM"] = "1" os.environ["GRAPHLABFORCEIPCTOTCP_FALLBACK"] = "1" import graphlab as gl

3. Test out your lambda worker code in this environment. If it works, then you can make the above configuration perma

Comments

User 940 | 9/23/2015, 6:24:54 PM

Hi @tabdunab ,

I do not believe we have stratified sampling on our road map, but we're always open to suggestions. What is your particular use case for stratified sampling, if you don't mind me asking?

As for your code, it looks reasonable. Only thing is, I might suggest using filterby's instead of lambda's in the following code snippet. It should be faster that way:)

Kfolds.append(sf[sf.apply(lambda x: x['row_no'] in list(train_index))]) Kfolds.append(sf[sf.apply(lambda x: x['row_no'] in list(test_index))])

Cheers! -Piotr


User 1319 | 9/25/2015, 12:41:01 AM

Thank you @piotr for your reply.

As you know, Stratified sampling is used to keep the same Class distribution in the training and testing datasets. This is particularly important in the case of imbalanced datasets. So, if you use just simple random split (without taking into account the response), there is a high chance that the class distribution in the training and testing datasets are not the same, or in the case of an imbalanced dataset, we many not even have the minority class(es). So, the goal of using Stratified sampling is to reduce the variability in models' predictive performance when using a train/test split. This is also very important when using CV for model selection, all folds should have the same class distribution by using Stratified Kfolds.

Thanks Tarek


User 19 | 9/25/2015, 2:27:11 AM

We agree that it's very useful to have stratified KFolds for multi-class situations, especially when things are imbalanced. It is indeed on our roadmap!

Thanks, Chris


User 1319 | 9/25/2015, 4:08:40 AM

Thanks @ChrisDuBois for your reply. Tarek