Sample sframes according to a distribution

User 2356 | 10/13/2015, 6:44:01 AM

How can I sample a sframe based on a class\label distribution values eg: I want to sample an sframe to select rows such that each class label is equally fetched thereby having a similar frequency for each class label. Or best would be to get samples according to the class distribution we want.


User 91 | 10/16/2015, 5:56:06 PM

Currently, there isn't an easy way of doing that (in a single line). You would need to select (with a logical filter sf[sf['class'] == 0] and then random sample for each class and then combine them.

If you are using the classifier, then you can use the class_weights='auto' for re-weighting the data points.

User 2356 | 10/20/2015, 8:56:04 AM

@srikris using pandas this is possible:

import pandas as pd

data = pd.DataFrame({'cols1':[4, 5, 5, 4, 321, 32, 5], 'clol2':[45, 66, 6, 6, 1, 432, 3], 'class':['A', 'B', 'C', 'C', 'A', 'B', 'B']})

freq = pd.DataFrame({'class':['A', 'B', 'C'], 'nostoextract':[2, 2, 2], })

def bootstrap(data, freq): freq = freq.set_index('class')

# This function will be applied on each group of instances of the same
# class in `data`.
def sampleClass(classgroup):
    cls = classgroup['class'].iloc[0]
    nDesired = freq.nostoextract[cls]
    nRows = len(classgroup)

    nSamples = min(nRows, nDesired)
    return classgroup.sample(nSamples)

samples = data.groupby('class').apply(sampleClass)

# If you want a new index with ascending values
# samples.index = range(len(samples))

# If you want an index which is equal to the row in `data` where the sample
# came from
samples.index = samples.index.get_level_values(1)

# If you don't change it then you'll have a multiindex with level 0
# being the class and level 1 being the row in `data` where
# the sample came from.

return samples