Memory explosion at SFrame.random_split

User 2032 | 8/10/2015, 4:01:30 PM

Hi guys,

I have a problem with using randomsplit with a large SFrame - (300k rows , 4k columns). I have the SFrame generated by one very complex routine safely stored on disk. I wish to use it for classification so I called randomsplit to have train and test frames. The routine takes ages (I'm still waiting for it to finish) and the memory explodes - to over 150GB and still counting. This does not seem very out of core (not to mention lazy) to me.

Any idea on what might be causing this? Random split to me seems just like creating a virtual index and should not at all depend on the size of the SFrame, not to mention that for 300k rows it should take milliseconds. Is this caused by materialising the split frames on disk?


User 1359 | 8/10/2015, 9:30:33 PM

Hi Johnny,

When you said that the memory increases to 150GB, can you be more specific about what type of memory or where you are measuring it? It would be normal for the virtual memory to increase to a very large size, however this is not an accurate measure of the amount of RAM (or any other resource) being used.

random_split does not create a virtual index of the parent SFrame, as SFrame is not optimized for random access on the data structure, but rather for sequential access. Separate structures are created for each of the two returned SFrames.

After splitting the data, it would be best to save the train and test SFrames for future use.

Let me know if I can be of more help, Dick

User 2032 | 8/11/2015, 7:45:16 AM

My bad on not claryfying it - the server has 100GB of RAM and 60GB of swap and 50 CPU cores. The random_split did finish but it consumed all RAM and most of swap along the way and took considerable time (> 20 min for 300k rows). I am worried what would this mean for kFold validation.

User 2032 | 8/11/2015, 10:09:42 AM

And to touch on random access shouldn't SFrames have different implementation for SATA and SSD? As far as I am aware on SSDs random access is not more expensive that the sequential one.

User 1189 | 8/11/2015, 5:44:58 PM


Also, I am assuming each row is quite large?

We increased the memory caching thresholds to be a lot more aggressive this version and that might be causing some issues. Alternatively, the large number of CPU cores might also be causing memory issues (with regards to per-thread memory usage).

Can I take a look at the output of gl.getruntimeconfig() ? Specifically, with the entries: GRAPHLABFILEIOMAXIMUMCACHECAPACITY and GRAPHLABFILEIOMAXIMUMCACHECAPACITYPERFILE.

A workaround might be to decrease these two values to something much smaller (say just a few GB?) modifiable with: mem_limit = [some value in bytes] gl.set_runtime_config('GRAPHLAB_FILEIO_MAXIMUM_CACHE_CAPACITY', mem_limit) gl.set_runtime_config('GRAPHLAB_FILEIO_MAXIMUM_CACHE_CAPACITY_PER_FILE', mem_limit)

You can also change the effective number of threads by running this before starting python export OMP_NUM_THREADS=16 # or any other value

User 2032 | 8/12/2015, 9:03:55 AM

Hi Youcheng,

I might have been to liberal with the settings you provided giving them 100GB and 50GB respectively. I have tuned the settings to:

gl.setruntimeconfig('GRAPHLABDEFAULTNUMPYLAMBDAWORKERS', 50) gl.setruntimeconfig('GRAPHLABFILEIOMAXIMUMCACHECAPACITY', 2147483648 * 35) # 70GB gl.setruntimeconfig('GRAPHLABFILEIOMAXIMUMCACHECAPACITYPERFILE', 134217728 * 500) # ~50GB gl.setruntimeconfig('GRAPHLABODBCBUFFERMAXROWS', 10000) gl.setruntimeconfig('GRAPHLABODBCBUFFERSIZE', 2147483648 * 5) # 10GB gl.setruntimeconfig('GRAPHLABSFRAMEJOINBUFFERNUMCELLS', 52428800 / 100) gl.setruntimeconfig('GRAPHLABSFRAMESORTPIVOTESTIMATIONSAMPLESIZE', 2000000 / 100 )

but I still see graphlab hitting swap.

The option with limiting number of threads is not really something I am keen on and I would treat it as last resort - some routines work fine and utilise all 50 cores for long periods of time and then there are other that hit memory problems - I would rather work out a solution that lets me use all the cores - 2GB per core is quite a lot after all...

When I was reporting the problem each row had c.a. 3500 columns (but only with primitive types in them and no strings).

User 1189 | 8/12/2015, 4:39:10 PM

Firstly, thanks for raising this performance issue: I should make the random_split lazy. That should improve performance.

I am unable to reproduce the problem effectively. There are some possibilities though. I am suspecting memory fragmentation. How large is the SFrame on disk? For now, try this workaround:

For a,b = sf.random_split(percent)

Lets write it as p=gl.SArray([random.random() > percent for i in range(len(sf))]) a = sf[p] b = sf[1-p]

User 2032 | 8/12/2015, 6:12:19 PM


the SFrame had 35GB on disk when saved. You can reproduce it by creating an SFrame with 300k rows and 3k columns.

In the meantime I managed to get around it by reducing the number of columns to 200 (loosing some information that might have been an overkill). With 200 columns everything works like a charm.

User 1189 | 8/12/2015, 6:35:23 PM

I tried it with 4K columns of integers but I couldn't quite repro it. Are all the columns numbers? Maybe I will try again simulating 50 cores.

Does the random split alternative I suggested work?

User 2032 | 8/13/2015, 10:19:02 AM

Hi Youcheng,

I did not need to test the alternative because I limited the number of columns and the results were satisfactory - I'm under a tight schedule.

I believe I had one large dict column at the time, c.a 100 entires per row. Simulating 50 cores would be useful in tracing the problem.