Small SFrame takes 100% of CPU and RAM resources

User 4707 | 4/13/2016, 2:13:38 PM

Hi I have a problem with performance

I have SFrame with 5000 rows and 50 columns. Each cell contains a number in range -127 — 128. Actually it's really small array of data, approx. 250kB:

But when I try to save it on HDD (via .save method) python.exe takes all CPU and memory resources and then I have only one decision: hard reset.

The same behaviour is when I try to transform this SFrame to numpy array via method .to_numpy().

What I do wrong?

PS I have 16 Gb of RAM and core i5-3570

PPS I use just sframe module instead of the whole graphlab


User 4 | 4/13/2016, 6:49:59 PM

Hi @archon, is it possible that what is happening here is a series of computationally intensive operations that are being deferred until save? By default, SFrame uses a lazy evaluation strategy to defer execution of operations until the result is needed. On save, an entire queue of operations might be performed (which could be quite CPU-intensive, even on a small dataset, if it's a lot of operations). To know if that is happening here, I would need to see the whole script (and not just the save method).

If that's not the case, it sounds like we might have a bug. If you can provide your dataset (or a representative dataset with this issue) I would be happy to try to reproduce it so we can fix it.

User 4707 | 4/13/2016, 8:00:54 PM

SFrame uses a lazy evaluation strategy I think it's the answer and not a bug in SFrame code. From the beginning I have 2.1 Gb of data saved on disk. It looks like this: And this dataset is much bigger, because each cell in column "tracedata" contains an array with 400002 elements. Actually it's my features for future machine learning, but I need reduce it. I make it with the following command: Where "bestindices" is the list with 50 indices which I want extract from each array. After this line of code which executes pretty quick, I have new column with the data from my first post (where I just unpacked the arrays) and troubles with perfomance. I can't export this data to numpy array for later use in sklearn, because of not enought memory.

May be it's exist some trick to export this data into numpy array?

User 4 | 4/13/2016, 8:26:08 PM

Hi @archon, this makes sense -- if you are starting with lots of data and filtering or aggregating, the operations will be lazily evaluated and you'll only see a large amount of CPU usage upon materializing those values (on save, or on training a model).

If you are using GraphLab Create, the to_numpy function will actually return a scalable implementation of numpy, which can handle datasets larger than RAM (and these can be used directly in scikit-learn). This is a feature of GraphLab Create and is currently not available in the open source SFrame package. Note that because of the lazy evaluation strategy mentioned above, this operation may take a while (since it's actually computing -- "materializing" -- the operations specified before the conversion).