question on pearson correlation metric

User 1285 | 2/15/2015, 7:39:30 AM

Hi all,

I have a question regarding the computation of pearson correlation. Assume that we have the following item,user,rating matrix: item,user,rating 1,2,1 1,5,1 1,12,1 880,2,1 880,5,1 880,12,1

I am running the example trainingdata = graphlab.SFrame.readcsv("matrix", columntypehints={"rating":integer}) model = graphlab.recommender.itemsimilarityrecommender.create(trainingdata,userid="user", itemid="item", target="rating", similaritytype="pearson") simitems = model.getsimilar_items(items=None, k=50)

The simitems matrix is empty. If we apply to the above the formula for the pearson correlation here: Then for the above example there is a division with zero going on there I think, correct? The average ratings are 1 for all items and the actual ratings are all 1. However, I would expect that these items should have a high similarity value.

Thank you


User 1207 | 2/17/2015, 8:58:46 PM

Hello PKouki,

Thanks for the precise example. I'm also seeing this behavior -- we're looking into this, and we'll get back to you shortly.

Thanks! -- Hoyt

User 1207 | 3/5/2015, 6:39:47 PM

Hi PKouki,

Sorry to not get back to you earlier. The behavior you are seeing is actually consistent with the Pearson metric, as it works by the distances from a user's mean ratings over all the items they have rated, which in your example are uniformly 0. As a result, the scores end up being 0 for all items for these users. In GLC 1.3, the item similarity algorithm ends up filtering out all the scores with 0, but this behavior is something we'll look at in detail for our next release.

Thanks, -- Hoyt