User 2032 | 8/10/2015, 4:01:30 PM
I have a problem with using randomsplit with a large SFrame - (300k rows , 4k columns). I have the SFrame generated by one very complex routine safely stored on disk. I wish to use it for classification so I called randomsplit to have train and test frames. The routine takes ages (I'm still waiting for it to finish) and the memory explodes - to over 150GB and still counting. This does not seem very out of core (not to mention lazy) to me.
Any idea on what might be causing this? Random split to me seems just like creating a virtual index and should not at all depend on the size of the SFrame, not to mention that for 300k rows it should take milliseconds. Is this caused by materialising the split frames on disk?