Create generator to loop through SArray

User 2167 | 8/14/2015, 8:39:11 PM

I'm trying to create a python generator to loop through all the values in an SArray

However, if I use:

for x in sarray:
                yield x

My memory fills very fast (24 G) and breaks

If I use:

for  x in range ( 0 , len(sarray) ):
                yield sarray[x]

The loop finishes without failing but it is extremly slow

Is there any other way to loop faster through all the values in the SArray without filling the memory?

Cheers!

Comments

User 1592 | 8/15/2015, 7:26:38 AM

Hi Alan To what purpose are you iterating over the SArray values? A better approach would be to use our lambda function functionality, in case you like to compute some transormation over the SArray values, For example

sarray = sarray.apply(lambda x: round(x,3))

Will run in parallel over all values in the sarray and round them to 3 decial digits.

From the other hand, if you are looping over all values to compute some aggregate, look at our aggregate documentation.


User 1262 | 8/18/2015, 9:48:43 PM

Hi Danny, It is not a transformation what I'm trying to do. I need to create an object that can iterate through all the values in my SArray. I use the object as an argument for a third party Python library.

I tried building this object like:

def __iter__(self):
		for x in self.sarray:
                		yield x

But as mentioned my RAM will fill super fast

I ended up using:

def __iter__(self):
		for  x in range ( 0 , len(self.sarray) ):
                		yield self.sarray[x]

This one works fine but it's very slow.

Just wanted to see if there was another option? (it would be nice if the SArray had a function that returns a Python Generator object for the values so RAM doesn't fill while you iterate through them)


User 954 | 8/18/2015, 10:56:30 PM

Hi Alan,

can't you pass "SArray._iter_()" as an argument for the third party python lib?

<pre> sa = gl.SArray([1,2,3]) it = sa.iter() print it.next() </pre>


User 1262 | 8/19/2015, 11:29:15 PM

I get an error: "You can't pass a generator as the sentences argument. Try an iterator."


User 1262 | 8/19/2015, 11:54:43 PM

It works if I only pass the SArray but I get the same problem I was having originally where my memory fills. It seems the only feasible option it's still:

def __iter__(self):
        for  x in range ( 0 , len(self.sarray) ):
                        yield self.sarray[x] 

User 954 | 8/20/2015, 1:45:06 AM

gl.SArray._iter_() is a python generator object. What does your third-party python library expects to receive? The following code would be slow because each time you are doing a random access (self.sarray[x]) to SArray. <pre> def iter(self): for x in range ( 0 , len(self.sarray) ): yield self.sarray[x] </pre>


User 1262 | 8/20/2015, 1:49:01 PM

it expects an iterable (that is why passing the SArray by itself works fine) but it eventually crashes when my 24G of RAM get filled very quickly (which doesn't happen with the random access alternative)


User 15 | 8/20/2015, 5:27:40 PM

Hey Alan,

That definitely looks like a bug. The generator we use pulls 262,144 elements at a time. Is it possible that 262,144 elements of your SArray would fill up your memory? If not, then we must not be freeing that memory (or not freeing it fast enough).

Evan


User 1262 | 8/20/2015, 7:12:45 PM

It might, each element of the column is a big list of strings. Is there anyway I can set that limit to be less than 262144?


User 15 | 8/20/2015, 8:49:49 PM

Unfortunately the only way to set it is modifying the Python code in-place in your GraphLab Create installation (at least it's not compiled into a binary!). Open (YOURPYTHONINSTALLATION)/lib/(possibly "python2.7", depending on your system, otherwise nothing here)/site-packages/graphlab/data_structures/sarray.py. It is line 585 in my source, so it's probably around there. Just grep the code for "262144" and change it to whatever you want.

In case the thing about the paths is confusing, here would be the correct path on my linux machine: /home/evan/anaconda/envs/evan-default/lib/python2.7/site-packages/graphlab/data_structures/sarray.py

and here would be the correct path on my Windows machine: C:\Users\Evan\Anaconda\Lib\site-packages\graphlab\data_structures\sarray.py

Hope that helps!

Evan