side feature for factorization_recommender

User 1068 | 12/16/2014, 12:46:32 AM

Hi, when I tried to add side features to the factorization_recommender, I met a problem. say if the data format is "rating, user, items, [a list of implicit data]", how to input the list of implicit data? the data length varies for different cases, for example, it could be the implicit data before this rating? really appreciate your help.

Comments

User 6 | 12/16/2014, 6:33:31 AM

Hi, Implicit rating we call user actions without score, for example item viewed/ not viewed. Explicit ratings are scores, for example user liked this movie and rated it 3 starts out of 5.

Side features can be fed into our factorization recommenders in two ways 1) Additional information about the user or item. For that you will need an SFrame which contains the userid, and list of fields which characterize the user like age, zip code, number of kids, expected income etc. The same with information about the item like shape, color, size, brand, price etc. See example <a href="http://graphlab.com/products/create/docs/generated/graphlab.recommender.factorizationrecommender.create.html">here</a>. 2) Additional information about the rating. For example: time, ip address of the user, type of mobile phone, transaction amount , method of payment etc.

For the side features to be considered you simply need to store those side feature into the sframe you feed into the algorithm. Note that sframe column could contain a sparse vector or a python dictionary, so it is possible to squeeze all the side information into a single column. See an example <a href="https://github.com/graphlab-code/how-to/blob/master/sframe_pack.py">here</a>.


User 1068 | 12/17/2014, 6:33:57 AM

Hi, thanks for your quick reply. I tried the methods you mentioned. but the results were quite different according to different types of input. All the models were the same, factorization_recommender. the side feature I used here is user interests.

the original rating data is like <pre> | rating | userid | itemid | | 1 | 43751 | 6874199 | | 1 | 30947 | 418829 | </pre> then the validation rmse is 0.61. when I added the user interests for each rating, <pre> | rating | userid | itemid |interest 1|interest2...|interest 20 | 1 | 43751 | 6874199 |1|0...|1 | 1 | 30947 | 418829 |1|0...|0 </pre> the results was quite good, the rmse was 0.48. but as you know, there were a lot of redundancy. then I moved the them to another SFrame(For that you will need an SFrame which contains the userid, and list of fields ) as you said, like <pre> userid|interests 43751|[id1,id2,..id3] </pre> however, this case the score dropped to 0.59. seems the algorithm does not consider the side feature. How do you compute the interactions among each interest id, users and items in the vector format. I also tried dict format, doesnot work either.

the results I want is second type. but you know, when I have high dimension data, like 10k interests, I cannot list them there 1 by 1. what should I do? really appreciate your help!!


User 18 | 12/20/2014, 1:16:54 AM

Hi @hdtxjtu‌,

If the side information is user-specific, then don't pass them in as observation side data (i.e., included as columns in the observation rating SFrame). Instead, pass them into recommender.create as userdata. See the last example in the <a href="http://graphlab.com/products/create/docs/generated/graphlab.recommender.factorizationrecommender.create.html">API doc for recommender.create()</a>.

In your case, encoding the user interests as a dense array might be problematic because each of those numbers are being interpreted as a numeric feature, whereas semantically they are actually categorical. It's also not space efficient, as you pointed out. Try encode the user interests as sparse dictionary features:

<pre> user_id | interests 43751 | {'interest1': 1, 'interest8': 1} 38103 | {'interest5': 1, 'interest6': 1, 'interest11': 1} </pre>

Suppose your observation rating SFrame above is called obs_sf, then you can obtain the dictionary representation of user info with something like this:

<pre> import graphlab.aggregate as agg

First, un-duplicate the user info from the ratings table.

You can skip this step if you already have a user_info SFrame with one row per user id.

userinfo = obssf.groupby('userid', {'interest1': agg.SELECTONE('interest1'), 'interest2': agg.SELECTONE('interest2'), ..., # you'll have to write out all the SELECTONE operations for each column here 'interest20': agg.SELECT_ONE('interest20')})

Use SFrame.pack() to pack all the interest* columns into a sparse dictionary

userinfo = userinfo.pack(columnprefix='interest', newcolumnname='interestdict', dtype=dict) userinfodict = user_info[['userid', 'interestdict']]

Build a recommender

recmodel = graphlab.recommender.create(obssf, target='rating', userdata=userinfo_dict) </pre>

You might need to use factorizationrecommender.create() and try out different hyperparameter settings. Adding side features can sometimes require tuning. Also, play with sidedata_factorization=False, which is the command to train a plain matrix factorization model as opposed to the more general factorization machine model; depending on your data, a simpler model can sometimes lead to better performance.

Let us know if you are able to get better rmse this way.

Alice