slow SFrame loading with Sparse Data

User 4898 | 4/16/2016, 4:37:41 PM

Hello, I am having a difficult time loading a sparse matrix. I have written the code below:

mat = sparse_mat.toarray() df = pd.DataFrame(mat) sf = gl.SFrame(df) the last step is the bottle neck. The previous steps took ~1min to perform and the last step is taking over an hour and still running. I am running this on a m4.10xlarge aws instance that has 160 GiB of ram. The data frame has the following dimensions: rows = 12821, cols = 1517490. Any ideas?

Comments

User 4 | 4/18/2016, 1:28:11 AM

Hi @CrystalHumphries, I think the underlying issue here may be that SFrame is not optimized for so many columns. SFrame is a column-oriented format, so its Achilles' Heel is row-oriented access or write patterns (it is optimized for single-column reads, and single-column writes at a time). There may be some room for improvement here, however (there may be some performance bugs we haven't encountered yet). I'll bring this up with the team and see if there is anything more we can do here.


User 1189 | 4/18/2016, 5:34:02 PM

Hi,

toarray() turns it into a dense matrix of 12821 * 1517490 elements. A quick back of the envelope puts it at 145 GB of RAM if represented completely. Linux copy on write will mean that the dense matrix conversion will still succeed (since most of the memory is zero, assuming your matrix is sufficiently sparse.

So the SFrame conversion will then try to serialize the entire dense matrix and send it to SFrame for conversion. This is going to be really slow since the full representation is 145 GB, and it is likely to fail before it gets there.

If your data is sparse to begin with, you do not want to convert to a dense representation. You want to maintain sparsity. The SFrame does have dict column type for that purpose.

Yucheng