Rationale for evaluation suggestion in documentation

User 3218 | 2/19/2016, 1:07:21 AM


I noticed this recently in the user guide for recommender evaluation:

-- begin quote -- To accurately evaluate the precision-recall of a model trained on explicit rating data, it's important to only include highly rated items in your test set as these are the items a user would likely choose. Creating such a test set can be done with a handful of SFrame operations and gl.recommender.randomsplitby_user:

high_rated_data = data[data["rating"] >= 4] low_rated_data = data[data["rating"] < 4] train_data_1, test_data = gl.recommender.random_split_by_user( high_rated_data, user_id='user', item_id='movie') train_data = train_data_1.append(low_rated_data) -- end quote --

I don't get this. If your goal is to build a recommender on explicit ranking data, doesn't it make sense to test how well the model does on low rating examples too? I agree, the high rating examples are important, but by this construction, a model that always spit out high ratings will "win" even if it sucks on those low rating examples. Perhaps, I am totally missing some point here. Can someone clarify?


User 12 | 2/19/2016, 11:26:20 PM

Hi @delip, I think the key distinction here is that the user guide is talking (in that paragraph) about evaluating the recommendations, not the predicted ratings. In real life a user is likely to choose movies with a 4 or 5 star rating, so it makes sense to only include those in the test set. It's true that a model that does poorly at predicting low ratings might still "win", but that's ok, if it's making the correct recommendations for the movies the user would actually watch.

Thanks, Brian

User 3218 | 2/22/2016, 5:23:35 PM

@brian thanks for the comment. Going by this logic, wouldn't it be a "winning" model if it just blindly output a high rating for everything? That model will be a winner in the way this test set is constructed, i.e. will produce a very low or zero RMSE or 100% accuracy, but in practice will perform poorly in production. I'm not sure if this is the right thing to suggest in the documentation. At least from an ML theory POV, this evaluation is not kosher.

User 1207 | 2/22/2016, 5:48:18 PM

Hi @delip,

One clarification on what Brian said -- precision/recall as an evaluation metric does not consider the actual scores, just which items were chosen by recommend and whether they were also items in the test set.

Consider a quick example -- suppose a user only watches action movies, and thus gives all action movies a high score and all other movies a low score. Then, a test set constructed using the above method would be full of action movies for this user. Then, the precision/recall metric for that user would be high only if recommend() recommended only action movies, but this would happen only if it scored action movies higher than all the other movies. If the model outputs a high rating for everything, then the precision/recall would be quite low even if the RMSE is high.

In other words, you would not want to use a test set constructed in this way to test RMSE, as, like you said, it would be artificially good. But precision/recall does not work that way, making this the correct way to construct such a test set.

Hope that helps clarify things! -- Hoyt