Modify global variable in SFrame.apply?

User 1933 | 7/1/2015, 11:53:01 PM

Given some global variable (in my cases a MxN numpy array stored in memory) and some arbitrary SFrame, is there anyway to modify the globally defined array within an SFrame.apply operation? I.e.

myArray = np.zeros((MxN))

def myApplyFunction(frame_row):
    x = <<result of some operation on this row of the SFrame (integer)>>
    myArray[some_row_idx, some_col_idx] += x

mySFrame.apply(myApplyFunction)

I've tried this, and some values in the array get updated, but certainly not all those that should. This may very well be impossible, but hopefully there's a way to accomplish this. If not, here's a more detailed explanation of what I'm trying to accomplish:

I'm working on an LDA-esque model, and as part of it I need to calculate two matrices: A word-topic matrix (Nuniquewords x Ntopics) and an item-topic matrix (Ndocuments x N_topics). I need fast indexing along rows and columns of these matrices, so need to store them in memory as numpy arrays.

Now, I need to derive these from an SFrame that looks like this:

print grp_item[['url_idx','term_topic']]
+---------+-------------------------------+
| url_idx |           term_topic          |
+---------+-------------------------------+
|   5302  | [[14579.0, 15.0], [14579.0... |
|   232   | [[298.0, 14.0], [298.0, 7.... |
|   2238  | [[6469.0, 14.0], [6469.0, ... |
|   738   | [[6034.0, 10.0], [6034.0, ... |
|   1860  | [[3252.0, 7.0], [3252.0, 7... |
|   2623  | [[208.0, 14.0], [208.0, 17... |
|   4127  | [[80.0, 12.0], [80.0, 9.0]... |
|   926   | [[1249.0, 9.0], [1249.0, 9... |
|   4392  | [[10867.0, 18.0], [10867.0... |
|   4100  | [[8496.0, 19.0], [2708.0, ... |
+---------+-------------------------------+
[5344 rows x 2 columns]

The urlidx column is just a document ID, and the termtopic column is a list of all the words in the document and the topics they've been assigned to, stored as [wordid, topicid] pairs. My current approach is this:

word_topic = np.zeros((vocab_size,K))
item_topic = np.zeros((n_items,K))       
for row in grp_item:
    j = item_idx[row['urlid']]
    for term,topic in row['term_topic']:
        word_topic[term,topic]+=1
        item_topic[j,topic]+=1

This is simple and works, but is too slow, so I'd ideally like to parallelize the process via an apply on the SFrame, but so far haven't had luck. My big hope was to accomplish in a way more or less as described in the beginning of the question, but that doesn't seem to work. An alternative is to define new matrices within an apply function, then sum them all up, but this ends up using too much memory (i.e. storing as many instances of the matrices as we have lambda workers isn't feasible). Ideas?

Comments

User 1178 | 7/6/2015, 4:39:03 PM

Hi,

SFrame lambda operations can access global variables but are not designed to modify the global variables. The reason behind this is that each lambda worker takes a copy of the global variables and use it in its own process. The modification of the global variable will not be propagated back to main process. The reason you see it sometimes modified is because we will do try to run the lambda function in main process to determine the output type.

There is no simple way to do this directly in python side. But you may try to achieve this with our SDK. In SDK, you are programming against our C++ interface directly and you can iterate through the SFrames and do parallel operations and record global variables, etc. The documentation is here.

Thanks!

Ping