get_similar_users()-how does it work?

User 4685 | 4/13/2016, 7:51:48 AM

Hello, I am having little trouble in findng a way to use getsimilarusers(), not sure if can actually use in the way I want as per my scenario, and couldn't find much in Graphlab documentation link or any other helpful links as well. So, given the data(that has userId's, items, ratings, loccoordinates), I want to find similar users based on a filtering criteria, say by loccoordinates. How can I use getsimilarusers() in this scenario? Also, does this method/function/lib only work with recommendation methods (as noticed in graphlab links)?


User 4 | 4/13/2016, 6:30:54 PM

Hi @sha3, I think you could try using a [nearestneighbors]( model to find the most similar users for your dataset. With this model, you can choose a set of columns and a distance function, which will allow you to express what "similarity" means with respect to your data.

See the userguide chapter on nearest neighbors for more information on how to do this. This type of model can be used to make recommendations (effectively "similarity" recommendations) but has lots of other applications as well -- anywhere a notion of distance between items is useful.

One caveat is that the nearest neighbors model will expect distance to be computable with respect to all the columns, and will expect userId to be unique within the dataset, so you will need to do some pre-processing if you have multiple (userId -> itemId) rows and want to take itemId into account. If you can reduce the multiple rows of (userId -> itemId) to a single row of (userId -> itemSet) where the itemSet distance can be computed for two users, then the model should give the results you want. You could try some code like this to get an itemSet for each user (if there are other columns that represent user metadata, you could use a similar approach to aggregate them, or if they are consistent across all the rows for the same user, drop them and join them back in after aggregating items):


transforms multiple rows of (userId -> itemId) into single rows of (userId -> itemIds)

sf2 = sf.groupby('userId', gl.aggregate.DISTINCT('item'))

transforms the itemIds (list) into a dictionary representation (better understood by nearest neighbors)

sf2['itemSet'] = sf2['Distinct of item'].apply(lambda row: {item: 1 for item in row}) `