Speeding up recommendations

User 3218 | 6/29/2016, 8:05:21 PM

I am working with a data set where users rate other users. This data is used to recommend users to each other. There are around 10M users and a billion ratings among the users (yes, some of our users are highly engaged!).

Right now, creating a factorization recommender to produce just 20 recommendations per user takes 18+ hours on a C3.xlarge instance. We can bump up the cores in other bigger C3 instances but the time taken pretty much similar order of magnitude. The code is textbook:

data = gl.SFrame.read_csv(local_ratings_file, header=False, column_type_hints=int, verbose=False) model = gl.recommender.create(data, user_id='X1', item_id='X2', target='X3', verbose=True) results = model.recommend(users=None, k=total_recs)

Is there a way to speed things up?

Comments

User 19 | 6/29/2016, 8:39:05 PM

Hi Delip,

It sounds like you're in a batch setting where you want to make recommendations for all your users at once. This operation will greatly depend on the number of unique items in the dataset and the number of factors in the model. You may want to check that all of these items are worth recommending. You can easily recommend a subset of them via the items argument.

Otherwise, I would recommend simply splitting your users into groups and parallelizing across the groups of users. If you're using Dato Distributed we have tools to make this a bit easier; otherwise I would copy this model onto multiple machines, make recommendations for each batch, and then combine them. I'd recommend using multiple machines because our recommend function uses as many cores as possible.

Please let us know if you have any additional questions! Chris


User 3218 | 7/8/2016, 11:39:03 PM

Chris, thanks for the reply. There's no straight forward way to items are worth recommending or not, except that we don't want to recommend previously recommended items. I think Graphlab already takes care of that. I cannot seem to find an items argument in the create() API. Would you mind posting a link to the relevant document?

Is there an example of GraphLab dirstributed recommender? As per the user guide there's no mention of it: <pre> The toolkits currently supported to run in a distributed execution environment are:

Linear regression Logistic classifier SVM classifier Boosted trees classifier Boosted trees regression Random forest classifier Random forest regression Pagerank Label propagation </pre>

https://turi.com/learn/userguide/deployment/pipeline-dml.html