SFrame and moving average on different time length

User 958 | 11/19/2014, 11:57:59 AM

How can SFrame handle features creation like time lag data calculation over different time length (ie. current values + moving average over the last 6 hours + min & max over the last month + ...) without losing the advantage of SFrame (processing data as a stream, not loading data in memory) ? Using a solution with SArray would force to load everything in memory :(.



User 958 | 11/24/2014, 8:16:31 PM

Any idea ?

May I rephrase: can SFrame handle sliding windows calculation ?

ie. for each line of a big dataset: time + current temperature + avg temperature over past 1 hour + min/max temp. over past 1 hour + temp. avg over past 24 hours + temp. min/max over past 24 hours + trend over past 24 hours + avg temp. over past 30days...).

Thanks a lot

User 91 | 11/24/2014, 8:47:01 PM

To clarify, the SArray operations are not always loaded in memory. They get flushed to disk when required. There should be no requirement even when operating with SArray.

For a moving window calculation, you have 2 options

(a) Using a combination of Group-by and Join operators, you should be able to achieve what you need to. The functions that you might be interested are:

(i) gl.SArray.split_datetime: Split a datetime into hour, month, day etc. (ii) gl.SFrame.groupby: This lets you do min, max, averages at an hour/day/month level. (iii) gl.SFrame.join: Join the results from the group-by operation to your original table to get what you need.

In addition to that, the SFrame provides some useful operations (pack_columns, unpack) which let you move from dictionary/list i.e. composite types to simple types. Using those tools, I am sure you will be able to solve your problem.

Option (b)

(b) The Graphlab Create SDK is coming out very soon. It gives complete access to all the underlying data-structures and that should help you write anything that you need to, if you aren't satisfied with a solution that uses (a).

User 958 | 11/27/2014, 7:28:57 PM

Thanks a lot.

It seems like all of those options will compute the function each new line on the full moving window timeframe. It is far more efficient to compute rolling data only by managing delta calculation with previous data and old data getting out of the rolling window.

If I process those rolling functions myself line by line, would it be faster and memory efficient to: create a new SFrame on the fly, or, to modify the original SFrame ?

User 18 | 11/29/2014, 1:58:33 AM

Hi @doxav‌,

We are planning to support rolling time-window feature computation, exactly as you described. In the mean time, the SDK would be your best bet. It is not yet released but should be out soon.

SFrames allow for a very limited amount mutability. You can add columns (SArrays), append to existing SArrays, or select columns and/or rows. You can't modify individual elements of an SArray in-place. So when doing this on your own, you'll have to create a new SArray(s) to hold the features you compute.