FactorizationRecommender: Handling missing values in side data

User 3030 | 1/15/2016, 3:38:04 PM

I have a ratings table containing the variables: (User, Item, Rating, Quantity). The Quantity variable is a side data feature and is missing for some observations in training dataset. This variable provides relevant info regarding clusters of users and hence is too valuable to discard from training. FactorizationRecommender isn't allowing undefined values in Quantity variable. How do I handle this scenario? Also, how would you handle it, if Quantity was a categorical variable.


User 1207 | 1/18/2016, 9:37:23 PM

Hello Nullstellensatz,

By default, numeric side features are treated as numeric variables, and strings are treated as categorical values. Given your use case, and the presence of missing values, this is likely not what you want. If you convert Quantity to a string first using .astype(str), the missing values should be fine.

The other thing to be aware of is that your model may not perform well if you have a lot of different values here, as each value can score the model in a slightly different way. If you don't have much training data for each, then it can run into issues -- ideally, len(X["Quantity"].unique()) should be much smaller than the number of observations you have. If this is not case, then you will want to bin the values so that they provide adequate information to the model but don't allow it to overfit, for example by putting 10-49 in one bin, 50-99 in another bin, 100-499 in another, etc.