Generalisation of SArray.filter to SFrames

User 2568 | 3/14/2016, 12:02:00 AM

I wanted to exclude all rows from an SFrame where a specific condition (lambda x: x >= 0) was met on a specific column "var15".

I'm aware of: 1. SArray.filter for SArray, i.e., sa = sa.filter(lambda x: x >= 0.0), but this only works on an SArray 2 SFrame.filter_by but this only takes a list of values.

But could not find the equivalent of filter for SFrame

I could achieve this by creating a new Boolean column then using filter_by on that., i.e.,

sf['FILTER'] = sf["var15"].apply(lambda x: x >= 0.0),
filter_by(1, "FILTER")

but that seems clunky, how about extending filter_by to take a function, not just a list.


User 15 | 3/14/2016, 2:26:25 AM

Hi @Kevin_McIsaac

This has been talked about for a long time, but no one felt a strong enough need to implement it I guess. Good thing SFrame is open source now, maybe someone from the community will do it! :)

The way people normally do this is essentially what you describe, but there's a shorthand for it: <pre> sf[sf['var15'] >= 0.0] </pre>

This way will actually be much faster for a few reasons: 1. Using the comparison operators on SArrays invokes native C++ code, so it avoids the overhead of invoking and running Python to filter 2. It does a logical filter instead of the heavyweight filterby. The filterby function is simply a wrapper around our join function, which implements a hash join. That's why it takes a list of values. 3. The chained operations of creating a boolean SArray and then filtering allows for query optimization by the engine.

So if someone were to implement filter for SFrame to match the SArray implementation, it should essentially be a wrapper around this code.

Hope this helps!


User 2568 | 3/14/2016, 3:02:24 AM

Evan thank you for explaining this and how its implemented.

User 1207 | 3/14/2016, 7:44:05 PM

One more note to add to this: X[X["a"].apply(lambda x: x >= 0) ] also works (and the lambda can be any boolean function).

-- Hoyt