How to improve np.array and SArray interoperability

User 1933 | 6/26/2015, 5:48:58 PM

This is partly related to another question (http://forum.dato.com/discussion/1092/numpy-style-array-indexing-of-sarray), but covers some other issues.

Here's a simplified version of what I'm trying to accomplish. I'm running a monte carlo sampling algorithm using SFrame.apply(). Leaving the details, aside, one apply function generates an array for each row in the input frame, which is obviously represented as an SArray:

S = frame_1.apply(s_sampler)

And S ends up something like this:

dtype: array
Rows: 3401
[array('d', [1.02022062491536, 3.0147649122050932, 1.0500130204085698, 2.1777397625595167, 3.3712806239872464, 1.1700452954567813, 2.2128725061551973, 2.1496240573205694, 0.5718586621263493, 0.9764561976230315]), array('d', [2.031935481993924, 0.7936121753778989, 2.6800444317340153, 1.9886172229772074, 2.976646416375046, 1.6785833362083178, 0.28737094622472703, -0.6851623647457843, 0.8573206535415058, 2.276469051715521]), array('d', [-0.03293520982946774, 0.4812990606673828, 1.1408418306672157, 0.6854475487916877, 1.041087464234364, 2.1177105583320297, 0.643928572804609, 1.624788167942495, 2.2383750718336572, 2.84988423023362]), ... ]

That's fine, but I need to then use S in another apply function, where I do some linear algebra. There's more going on, but here's the relevant bit:

 def(alpha_sampler):
     users = row['users'] # this is a list
     # some other stuff...
     result = ... + np.inner(Z[j],S[users]) + ... #Z[j] is a standard numpy array
     # and some more stuff...

So, there are two issues here. First, I can't index S with a list, because it is an SArray. @punit informed me that I could generate a boolean to accomplish the indexing, but that's actually pretty computationally expensive (I think), especially when cases where <code>len(users)</code> is large (it can be in the thousands). But even assuming I had the indexing ironed out, the fact that I need to do an inner product means that S[users] has to get converted to a standard numpy array anyway, which is itself pretty expensive (keeping in mind that all this sampling needs to happen many times).

My current solution is to convert S to an np.array at the point that it's generated, like this:

S = np.array([row for row in frame_1.apply(s_sampler)])

This makes the indexing and inner product in <code>alpha_sampler</code> fairly fast, but the actual conversion to the np.array is painfully slow (again, since it has to be done many times). I have analogous problems throughout the model I'm building, and need to optimize this. Any ideas?

Comments

User 1178 | 7/8/2015, 6:15:02 PM

Hello,

SArrray currently do not support indexing, and any workaround to pretend the index is not going to be very performed.

If you need to do a lot of operation on numpy, maybe the best way is to first convert SArray to numpy first and then do all your operations on numpy array.

Thanks!

Ping