SFrame and multiprocessing

User 1912 | 5/12/2015, 12:59:42 PM

I have essentialy the same question and same problem started in this post. The answer to the mentioned post redirected into another direction so I post this new question.
I am currently sending the path to saved SFrame to all my workers, and they read them back. but it causes my program to stuck; No error, no output. Also looking at the htop the cpu usage is 0%. Any idea how to fix it?

Comments

User 91 | 5/12/2015, 3:15:10 PM

Without seeing the code, I can't exactly say what the issue is. There are many possibilities.

Our backend is C++ and we burn all the cores that we have access too so using multi-processing is not necessarily going to help you. In fact, it might make your code run slower. Could you elaborate on the use case?


User 1912 | 5/12/2015, 4:27:47 PM

I trained a ItemSimilarityRecommender and I want to simulate users behaviour when reciving a recommendation. They can accept or reject a recommendation based on some criteria. So I need to iterate through users one by one and check my scenarios. Since the number of users is large, I want to split the users into different subsets and workers produce the recommendations for a specific set in parallel . Results are stored in a multiprocessing.Manager.dict() object where key is userid and value is a list of recommended itemid. I hope my code here clear things.


User 91 | 5/12/2015, 5:45:43 PM

The filter_by operator that you use will use all cores and that could potentially hurt you. I believe that the various batch operators of the SFrame can be used for the same code. Here is a sample

` import graphlab as gl import numpy as np import random as random

Train a model

sf = gl.SFrame.readcsv('ratings.small') model = gl.itemsimilarityrecommender.create(sf, 'userid', 'movie_id', 'rating')

Get recs

recs = model.recommend()

Let us assume that you have a list of editors.

editors = [101076, 103306, 105395, 108027, 109383, 105361]

The recommended list for the editors.

editorrecomlist = recs.filterby(editors, 'userid')

Say your simulation generates this

numeditorrecs = editorrecomlist.numrows() editorrecom_list['shouldaccept'] = gl.SArray(np.random.rand(numeditorrecs, 1)) > 0.5 editorrecom_list['random_movie'] = [random.choice(recs['movie_id']) for i in range(100)]

editorrecomlist['itemsrecommended'] = editorrecom_list.apply( lambda x: x['movie_id'] if x['should_accept'] == 1 else x['random_movie'])

`

At this point you have all the items recommended for each of the editors. You can then view them as a list (optionally) using the stack operator

` editorrecomlist[['userid', 'itemsrecommended']].unstack( 'itemsrecommended', newcolumnname = 'listof_items')

Data: Columns: userid int listof_items list

Rows: 10

Data: +---------+-------------------------------+ | userid | listof_items | +---------+-------------------------------+ | 101076 | [8116, 7158, 8497, 15307, ... | | 103306 | [15307, 16272, 12508, 3463... | | 105395 | [7158, 8707, 1719, 1902, 4... | | 108027 | [14403, 9149, 17085, 15803... | | 109383 | [13378, 13207, 15662, 1639... | | 105361 | [11775, 17112, 5296, 11164... | | 108317 | [5421, 3521, 15689, 10778,... | | 103719 | [9628, 2564, 6424, 10464, ... | | 101815 | [14218, 9584, 11512, 3376,... | | 106952 | [13835, 10088, 13584, 1588... | +---------+-------------------------------+ [10 rows x 2 columns] `

The operators that are available with the SFrame are very extensive and if your goal is to make sure things run very fast (in parallel) and can scale easily, it would be best if you could express your code using them. It takes a bit of time, but once you get there, its really amazing!


User 1912 | 5/13/2015, 9:19:14 AM

thanks for the explanation but I am not sure if I can follow one step. The line

editorrecomlist['random_movie'] = [random.choice(recs['movie_id']) for i in range(100)]

produces 100 random movies from the set of all available movies. but why 100? I image it should also be numeditorrecs.

Then something that I can not see is that when you want to assign a randommovie to a user , you should check if he/she already ranked/liked that movie or not. same behaviour as excludeknown. Can you tell me which line is taking care of that? The above line of code should produce random movies for each user based on his/her list of recommended movies not the whole set of recommended items.


User 91 | 5/13/2015, 2:52:29 PM

The 100 was a typo. I have corrected it. As you said, it should be 'numeditorrecs'

I suppose I misunderstood your intentions. You should be able to modify the code to do the same. Do let me know if you have any issues!

Thanks.


User 1912 | 5/15/2015, 9:50:20 AM

with this I again come back to main problem which is one by one user recommendation since I need to first extract the available items for each user and then assign a random items from this set. I can not see any other option here. I appreciate if you share some idea.


User 1912 | 5/15/2015, 11:59:30 AM

this is how I manage to do it so far:

#get the list of available items for each user useritems=numeditor_recs[['userid', 'itemid']].unstack('itemid', newcolumnname = 'listof_items')

#for each user, choose a random item from his available items numeditorrecs['random_movie'] = [random.choice(useritems.filterby(record['userid'],'userid')['list_of_items'][0]) for record in numeditorrecs]

doesn't look like the fastest way


User 91 | 5/15/2015, 6:42:25 PM

The following line is going to be very slow because you are using a for loop which loops data to and from python num_editor_recs['random_movie'] = [random.choice(user_items.filter_by(record['user_id'],'user_id')['list_of_items'][0]) for record in num_editor_recs] You can do an apply function (I could not quite follow exactly what your code is trying to do so this is just an example) num_editor_recs['random_movie'] = num_editor_recs.apply(lambda x: random.choice(x['list_of_items']))


User 1912 | 5/18/2015, 6:40:05 AM

The line you suggested is different from what I intended to do. My code filter one SFrame (useritems) based on the entry of another SFrame (numeditor_recs).

I'm afraid what I am trying is achieve is not possible with build-in graphlab functions. So I should either stick to serial process of my data or implement my own panellization. Still not sure how to use SFrame inside my worker function.