SArray.apply() calls only 100 times

User 5167 | 5/1/2016, 8:52:42 PM

Hi,

I'm battling for hours with the possibility to use the .apply() method with a predefined function, on an SArray that contains more than 180,000 rows, and it seem to call my function only 100 times. can anyone help me with this? I attach the actual code I'm using.

this is the function definition to check whether a word exists in a dictionary.

if so, append the number value of the dict word, otherwise, append 0 to a given list

def dictwordcount(dDict, stWord, lstList): # first, define a funtion for a single dict element if stWord in dDict: lstList.append(dDict[stWord]) else: lstList.append(0)

now, use the above function in the .apply() method for the SA column (create a resulting list)

sa = products['wordcount'] # 'products' is an SFrame; 'wordcount' is a dictionary column listawesome = [] # create an empty list sa.apply(lambda x: dictwordcount(x, 'awesome', listawesome)) products.addcolumn(graphlab.SArray(listawesome), name='awesome2')

Thanks,

Comments

User 16 | 5/1/2016, 11:49:19 PM

Hi @AtsmonY -

Sorry to hear you're having this issue. The first 100 rows of an apply are evaluated in Python (in order to determine type information) the rest of the rows are evaluated in C++ by our lambda workers. It sounds like the lambda workers are failing. Please checkout this post: http://forum.dato.com/discussion/1946/how-to-generate-lambda-worker-test-report/p1


User 1189 | 5/2/2016, 7:01:46 PM

Hi,

Lambda's cannot capture mutable state this way. The lambda's are executed in subprocesses in parallel. So it will not be able to modify the "list_awesome" variable you are trying to capture.

Yucheng