Does side data for Factorization Matrix Recommender result in an overfitted model?

User 3228 | 2/19/2016, 7:39:54 PM

Hey everyone, just started using Graphlab create for our data science immersive course at Galvanize. We used the factorization matrix model on the movie ratings dataset. Including demographic side data seems to hurt our predictions, perhaps it is overfitting the model? But why is the side data overfitting instead of augmenting the model?

On another topic, I also used fillna on the kaggle BNP Paribas dataset to fill all missing values with zero, and it deleted some rows. Couldn’t figure out what was going on.

Any thoughts would be appreciated.

Comments

User 1207 | 2/19/2016, 11:40:52 PM

Hey heyengel,

There can be some challenging issues inherent with how our current model handles side information, and it seems you may be hitting some of these. We are working on a much better way to model side information internally, which would hopefully resolve some of these issues, but it's not currently in GLC.

First, a surprising amount of information is contained in the user and item interaction pairs. Typically, the model finds the side information much less useful than one would expect -- many types of side information don't tend to help that much.

Second, while it may indeed overfit -- I've definitely seen that happen -- what actually ends up happening is that the model is more complicated, and this tends to result there being a lot more local minima coming up in the problem. A local minima is a set of factor values which may not be close to the best values, but where there is no obvious way for the optimization to improve them. Thus it gets stuck at a value that is not really optimal, and your model is worse than the MF model.

The main thing that I have found helpful in addressing this is to use categorical features instead of numerical side features -- if you have numerical side features, try binning them using one of the feature transformers, which puts them the values into bins of numbers instead of working with the value directly. Often the best model is one with all numerical features binned, sidedatafactorization set to False, and linear_regularization adjusted to a value comparable to the regularization value. Still, don't be surprised if your model is not substantially better, as that info is contained in the user interaction data already.

Hope that helps! -- Hoyt


User 12 | 2/20/2016, 1:43:41 AM

Hi @heyengel, To your second issue, can you post a bit more information about what happened? I did the following snippet on train.csv from the kaggle page and I don't see any rows missing.

import graphlab sf = graphlab.SFrame('train.csv') for c in sf.column_names(): sf = sf.fillna(c, 0)

Thanks, Brian