Recommender with side data

User 332 | 6/1/2014, 6:59:00 PM

Which models are specific to the inclusion of either user and/or item side data? The recommender tutorials do not show this aspect, and I would like incorporate item data in my model(s).

"item_data : SFrame Side information for the items. This SFrame must have a column named with the item column name given above. It provides any amount of additional item-specific information."

Can you provide guidance?

Comments

User 18 | 6/2/2014, 5:22:23 AM

Hi,

It's great to see people trying the new side data feature in our recommender toolkit! Our new 0.3 release last week contained a lot of added features, and we'll be putting out a bunch of new iPython notebooks to illustrate how to use the new functionality. Please bear with us while we catch up on documentation.

Meanwhile, here's a quick code sample to help get you started:

import graphlab as gl

observation data. The column names are: userid, itemid, rating

sfobs = gl.SFrame.readcsv('./recommendersidedata.csv', columntypehints=[str, str, float])

user data. The column names are: user_id, gender, age

userdata = gl.SFrame.readcsv('./userdata.csv', columntype_hints=[str,str,int])

item data. Column names: item_id, genre

itemdata = gl.SFrame.readcsv('./item_data.csv')

create a training/testing split by randomly picking at most 1000 users and selecting 30% of the observations for test

trainset, testset = gl.recommender.randomsplitbyuser(sfobs, 'userid', 'itemid', maxnumusers=1000, itemtestproportion=0.3)

train a factorization machine model with default hyperparameter settings

m = gl.recommender.create(trainset, 'userid', 'itemid', 'target', userdata=userdata, itemdata=itemdata, method='factorizationmodel')

take a look at the model

m

score the training and test sets

scores = m.score(testset) scores = m.score(trainset)

now create a new fake user_id 6041 and give it some ratings information

newdata = gl.SFrame({'userid': ['6041', '6041', '6041'], 'item_id' : ['200', '535', '400'], 'target': [3.0, 2.0, 4.0]})

append it to the test set

newtestset = testset.append(newdata)

the model can score new users on the designated items (the target column in the input is ignored)

scores = m.score(newtestset)

we can also checkout the rmse for the test set, this time using the input target column as ground truth

evalresults = m.evaluate(newtest_set)

give the new user some side info

newuserdata = gl.SFrame({'user_id': ['6041'], 'gender':['F'], 'age':[21]})

a quick fix to get the columns in the right order

newuserdata2 = gl.SFrame() newuserdata2['userid'] = newuser_data['userid'] newuserdata2['gender'] = newuserdata['gender'] newuserdata2['age'] = newuser_data['age']

make recommendations for new and existing users alike

recs = m.recommend(['6041', '1', '533'], k=20, newobservationdata=newdata, newuserdata=newuser_data2)

create a new item

newitemdata = gl.SFrame({'item_id': ['3953'], 'genre':['Action|Comedy']})

include this item in our tests

newtestset2 = newtestset.append(gl.SFrame({'userid':['6041'], 'itemid':['3953'], 'target':[5.0]})) newdata2 = newdata.append(gl.SFrame({'userid':['6041'], 'itemid':['3953'], 'target':[5.0]})) recs2 = m.recommend(['6041', '1', '533'], k=20, newobservationdata=newdata2, newuserdata=newuserdata2, newitemdata=newitem_data)

exercise some additional options in recommend: make recommendations for only the first 2000 items, excluding the user-item pairs in the test set

recs = m.recommend(['6041', '1', '533'], k=20, items=gl.SArray(range(2000)).astype(str), newobservationdata=newdata, newuserdata=newuserdata2, exclude=testset)


User 332 | 6/2/2014, 6:39:08 AM

Thank you for the prompt and detailed response. I look forward to employing the new GL features in my projects!


User 318 | 6/2/2014, 12:47:29 PM

Thanks for @alicez!

I would like to learn more details about how the recommendations are made for existing users alike.


User 18 | 6/2/2014, 5:59:24 PM

Sure. Just to be clear, I think there are two questions here: how are new users handled, and how are side features handled. For both questions, the answer is: each model has its own ways of dealing with new users and side features.

Side features: Linear models, matrix factorization, and factorization machines have a regression model for side features as well as observation data. Essentially, they learn coefficients for each side feature. The predicted item rating score is a weighted combination of side features and user-item interaction terms.

Popularity and item means are the simplest models; they do not take side features into account when scoring items.

The Item Similarity model could take side information into account through similarity measures. But this is not yet implemented in GraphLab Create. It will be included in the next release (due out in a couple of months).

New users: Item Similarity handles new users by measuring the similarity between existing items and whatever is known about the new user's preferred items (passed to recommend() through newobservationdata). Mathematically, this is equivalent to taking the vector-matrix product between the new user's preference vector and the pairwise item similarity matrix obtained during training.

Linear models, matrix factorization, and factorization machines revert to a "background model" of essentially just the constant item bias (or offset) terms and the global bias. If the new user has side info, then these will be incorporated into the prediction as well. They do not take newobservationdata into account. In other words, recommend() does not update the existing model with newobservationdata; the user and item latent factors do not change, neither do the user and item biases or side feature coefficients.

It's hard to be precise without resorting to latex equations. I hope this helps!

Alice


User 318 | 6/3/2014, 5:51:19 AM

Thank you, Alice! Love and can't wait to see more new ipython notebooks and get to know these new features deeper!


User 332 | 6/6/2014, 4:34:59 AM

Alice, for MF and FMs, are categorical variables automatically transformed to binary, or they required to be entered as such?


User 18 | 6/6/2014, 4:37:03 AM

Categorical variables are automatically translated to one-hot encoding, i.e., if there are five categories, then they are represented as a 5 binary variables, only one of which can be "on" at a time.


User 117 | 6/19/2014, 3:05:52 PM

I have two questions about using FMs for a context-aware application.

For the examples you have listed we can give more information for either the items or the users, but not for user-item pairs. A typical example for a context-aware recommendation would be having an extra dimension in the data that describes the context in which a user interacted with an item, i.e. a triple of (user, item, context). An example would be (userid, itemid, "weekend") Is there a way to model such data in the current implementation? As I understood it we can only provide static variables for users and items, such as demographic data for users and descriptive non-changing features for items.

My second question regards the applicability of the method to implicit interaction data, is there support for such data currently, or should just transform implicit data to explicit "ratings" in order to test the method?


User 18 | 6/20/2014, 1:47:22 AM

Hi guys,

rock24, sorry about the delay in responding. Your question somehow slipped off of our radar. Here are the answers.

for MF and FMs, are categorical variables automatically transformed to binary, or they required to be entered as such?

Categorical variables are automatically encoded into one-hot binary form. Right now, strings are automatically considered as categorical, and floats and ints are numeric.

side features for user-item observation pairs

Yes, we do support side features for observations. Any extra columns in the input training SFrame are automatically considered side features. Suppose your SFrame training_data contains the following columns:

userid, itemid, rating, weekend

then the weekend column will be treated as a side feature for each observation. The trained model can then score test data that has the same schema:

m = gl.recommender.create(trainingdata, 'userid', 'itemid', 'rating') m.score(testdata) # testdata must contain userid, item_id, and weekend columns

Note that m.recommend() right now does not like models trained with side features for observations. So if you were to do

m.recommend(['Alice'])

the model would be very very unhappy. This is a small glitch in the current version that will be fixed in our next release. (The problem is that recommend() can take new data (new users, new items, or new observations) and we needed some extra checks to ensure that new observation data conform to the same schema as that used for training.)

handling implicit interaction data

Yes, we do handle implicit data. If you call create() without any target column, then it trained an ItemSimilarityModel with Jaccard similarity, which handles implicit interactions. It assumes that all the user-item pairs given in the training dataset denote that the user "likes" that item.

Hope that helps!

Alice


User 117 | 6/20/2014, 10:04:22 AM

Hello Alice,

thanks for the quick reply. Is it possible to use the FM model with implicit data if I do some preprocessing, like mapping interactions to estimated ratings?

Are there any plans to support implicit data for more algorithms other than item similarity and popularity?


User 18 | 6/21/2014, 7:13:31 AM

Yes on both questions. You can convert implicit to explicit data and then use FM. Just make sure you include 0's as well as 1's in the observed ratings.

We are planning to release another ranking model in the upcoming release. It will be useful for implicit data as well as applications that care about precision-recall as the principal evaluation metric.


User 1349 | 3/4/2015, 6:32:30 AM

Does the item_data SFrame support data in the format: [itemid,featurename,feature_value]? For example: [1,color,'red'],[1,articlename,'shirt']...This is required in the case when the feature space is large and for every itemid there might be a large number of features.


User 1349 | 3/5/2015, 6:44:42 AM

Thanks a lot Brian, that's great !


User 1349 | 3/6/2015, 6:45:38 AM

I am unable to save the model (though I am getting recommendations from the model) while using side-information from items. Here is the trace:

Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/ashay.tamhane/anaconda/lib/python2.7/site-packages/graphlab/toolkits/model.py", line 513, in save return glconnect.getunity().savemodel(self, makeinternalurl(location)) File "cyunity.pyx", line 78, in graphlab.cython.cyunity.UnityGlobalProxy.savemodel File "cyunity.pyx", line 92, in graphlab.cython.cyunity.UnityGlobalProxy.savemodel IOError: Unable to save model to /Users/ashay.tamhane/CollaborativeFiltering/xyz: Cannot open /Users/ashay.tamhane/CollaborativeFiltering/xyz/m_8cf830e06b6c04be.0249 for write. std::exception

I am able to save models that don't use side information.


User 2263 | 10/26/2015, 8:28:08 PM

Hi Nice People,

I would like to know if you can provide a formal mathematical expression on how Graphlab's Factorization Recommender is dealing with side data.

So far, the API documentation page establishes that

" ... when side data is not present, the predicted score for user i on item j is given by ..."

However, there is no information about how your algorithm predicts a score when side data is used in a model.

It would be great to have this information in order to get deeper insights from the data.

Thanks in advance!


User 2696 | 11/30/2015, 7:00:01 AM

I am also facing vigliensoni's problem. No information about how the score is computed when the side data is present.


User 2263 | 12/1/2015, 1:07:20 AM

Hey @李雷

It seems (Rendle, 2010) is what Graphlab is using

http://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf

Not sure about about the difference in using side_data_factorization=False and using just user and side data

Hope it helps.


User 1207 | 12/2/2015, 1:50:55 AM

Hello,

Sorry to be away from this discussion -- we must have missed the post from earlier. Yes, the way we are computing the side factors for the recommender is as given in the Rendle paper, http://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf. Specifically, the score computed is equation 1 of that paper. If sidedatafactorization = False, then all the side terms only get linear weights, not factors that interact with each other.

Hope that helps. I'll monitor this thread in case you have any further questions, and we'll work on updating the relevant parts of the documentation as well.

Thanks! -- Hoyt


User 2570 | 12/4/2015, 12:51:28 PM

what happens if your data set don't have item/user side information


User 1207 | 12/5/2015, 5:55:12 PM

Hi @erigits,

If there is observation side information -- information besides the user/item/target that is particular to each observation -- than that is handled the same way. Just user/item information, or user/item/target information, is handled using a straight matrix factorization.

-- Hoyt


User 2570 | 12/6/2015, 9:19:59 AM

Thank you @hoytak for the good explanation