graphlab.recommender.factorization_recommender.create Missing values for user side data

User 2129 | 8/4/2015, 5:09:47 PM

I have tried to load the side data into the recommender, but my side data is kind of spotty with missing values.

If the recommender still cant use missing values as of yet, is there a command I can utilize to Sframe.fillna for all columns?

Comments

User 1207 | 8/4/2015, 9:11:58 PM

Hi Michael,

What kind of data is it? If it is numeric data, then setting all the NaNs to 0 has the effect of them being ignored the linear model (though this isn't really a true missing data solution, it may be enough for your case).

The true way to handle missing numeric data in the recommender -- albeit slightly more expensive -- is to replace the numeric column containing NaNs to a dictionary with one value being the numeric one and the other being the missing value. To do this, you can convert the numeric column to a pair of variable with sa.apply(lambda x: {"value" : x} if x is not None else {"None" : 1}).


User 2129 | 8/6/2015, 2:40:58 AM

It is user demographics, as well as some attitude questions on music,. Naturally, there will be some who don't put anything for their age, region or occupation status.

For categorial variables, do I need to dummy it first? (It seems that it works fine without, but just to be sure)

If i decided to cull some side info rows, what will happen to users without side_information? E.g. User 1-10 has observations, but only 5-10 has side information. Will this introduce bias?

After I have got my model, is there a way I could see whether a certain side_information has been used and how important is it? The get('coefficients') command only gives the numerical values, while m3['usersidedatacolumnnames'] is just names.


User 1207 | 8/7/2015, 9:45:36 PM

Hi Michael,

It should just work if you have categorical variables, as those will just show up as their own category.

To frame the answers for the rest of your question, there are two types of models used for the side information depending on whether sidedatafactorization is True or not at model creation. If sidedatafactorization is off, it fits all of the side information using a linear model, which then causes each category or term in the side data to weight the answer positively or negatively depending on what fits the data the best. Generally speaking, most of the power of this model is in the latent factors associated with the users and items, and the side terms give slight adjustments.

You can see the weighting in the side data terms with the side data entry of m.get("coefficients"). The linear terms give these weights, with larger magnitude weights having a more significant effect.

If side data factorization is on, then the effect of the side features is more difficult to interpret. In this case, they also interact with the user and item latent factors, and as a result, it isn't really possible to say how much of an effect the user/item tags have on the model without looking at the associated user and item factors as well.

The model I recommend for this is to use the linear model for the side features (so sidedatafactorization = False), but then to take numeric features and bin them using one of the feature transforms. This tends to give a good model and then also have interpretable side coefficients.

-- Hoyt


User 2129 | 8/13/2015, 5:32:21 AM

Hi Hoyt,

Thanks for the heads up =)

So just to paraphrase, if I set sidedatafactorization = False, Predicted score of an item = Factors from user factors from item + linear weighteach side data factor?

Whereas if it is true: Predicted score of an item = Factors from user * factors from item Where the side data are "merged" into the factors.?

Another follow up qns: Currently num_factors is set as a parameter in the model, so does the factorization create command makes only that number of factors (default 8), or it generates a list and selects the best 8?


User 1207 | 8/13/2015, 5:37:48 PM

Hi Michael,

In the case of side_data_factorization = False, your interpretation is correct.

A latent factor is just a vector of length numfactors, with one latent factor associated with each user, each item, and, if sidedata_factorization is true, each side feature category and numerical column. In the side_data_factorization = True case, the score is generated by summing the dot product of the latent factors of every term with every other term. For example, if you have a categorical side feature "genre", we would assign a different latent factor to each distinct value in genre, then the calculated score would be the linear model in the no-factorization case, plus the dot product of the genre latent factor with the user latent factor and the dot product of the genre latent factor with the item latent factor.

Does that answer your question?

-- Hoyt


User 2129 | 8/15/2015, 4:27:43 AM

Yup that really answers it =)