Finding "true" evaluation metrics in RankingFactorizationRecommender

User 3230 | 3/1/2016, 9:58:28 AM

https://dato.com/products/create/docs/generated/graphlab.recommender.rankingfactorizationrecommender.RankingFactorizationRecommender.evaluate.html#graphlab.recommender.rankingfactorizationrecommender.RankingFactorizationRecommender.evaluate

After experimenting with the evaluation outputs, I realize that the metrics of precision and recall do not really match the "true" metrics.

For example, user X has only 6 items, and is part of the test set. 50% (3) of his items are withdrawn.

For cut-off 1 and 2 these do not really reflect the true picture as the "recommender" has not been given sufficient chance to make 3 correct guesses. Assuming at K= 10, all 3 items are guessed, the precision will taper off as there are "no more correct answers" to guess.

An example output from Graphlab's evaluate function:

What I did instead is to calculate the TP,FP, TN, FN from scratch, assuming that the maximum guesses is the "total catalogue size" I have done a new cutoff which is "number of guesses - number holdout items). My ROC curve can then be plotted based on cutoffs bigger or equal to zero.

1) Is the solver in the ranking factorization engine taking care of this? I am not sure at the current moment whether it is truly minimizing the right values or not.

2) Is there a way I can implement this modified TP/FP/TN/FN into Graphlabs to obtain the correct precision and recall metrics?

Comments

User 1359 | 3/7/2016, 4:39:54 AM

Thanks for both questions! I'm looking into this and will get back to you on it soon.


User 1207 | 3/7/2016, 7:10:59 PM

Hello Razorwind,

I'm not sure I understand your particular example -- in your output, the TP goes up to 4, indicating there are 4 items in the test set. At 4 items, the recall goes up to 1 and the precision starts to decrease. This is what I would expect to see.

Now, to your specific questions. None of the models directly optimize precision recall, as this property requires a test set. The objectives they optimize, however, are known to be good proxies for obtaining a good precision recall. The averaged precision and recall statistics simply average these values, and I believe this is the standard way of aggregating PR measures. However, if you have a use case for an adjusted measure, then I would be happy to explore it.

As for adjusting the precision and recall to your proposed scale, this could be done by first getting the per-user item counts in the training set by using groupby with a count aggregator to count how many items are present for each user. evaluate_precision_recall() gives back the per-user precision and recall, so you could easily adjust it to use your scale, then average the resulting values.

Hope that helps! -- Hoyt