item_similarity with pearson correlation

User 902 | 11/4/2014, 12:11:03 PM

Hello, I was trying Graphlab Create itemsimilarityrecommender feature. Jaccard and cosine metrics work nice. But pearson for binary implicit input has strange behaviour. Next code, for instance, gets no results: <code>
sf = graphlab.SFrame({'userid': ['0', '0', '0', '1', '1', '2', '2', '2'], 'itemid': ['a', 'b', 'c', 'a', 'b', 'b', 'c', 'd']}) m = graphlab.itemsimilarityrecommender.create(sf, similaritytype='pearson') m.getsimilar_items(['a']) </code> It seems that all scores are equal to 0. Maybe I'm missing something here, but I think Pearson correlation between items 'a' ([1,1,0]) and 'c' ([1,0,1]) should be -0.5 in this example. Shouldn't it?

Regards

Comments

User 19 | 11/4/2014, 5:38:35 PM

The way we have implemented Pearson similarity is <a href="http://graphlab.com/products/create/docs/generated/graphlab.recommender.itemsimilarityrecommender.ItemSimilarityRecommender.html#graphlab.recommender.itemsimilarityrecommender.ItemSimilarityRecommender">documented here</a>. As you'll see, the mean rating is subtracted from each rating, and the sum is only over the users that items i and j have in common, Uij. This means that for implicit data, all of the ratings are 1 and all of the mean ratings are 1, leaving a numerator of 0. We chose this definition to be consistent with <a href="http://en.wikipedia.org/wiki/Collaborativefiltering">Wikipedia's description</a>.

Thank you for getting in touch. Please feel free to ask more questions!


User 902 | 11/5/2014, 11:39:05 AM

Thanks for your quick answer!

I understand. Your Pearson similarity don't consider missing values as 0 and ignores them for calculus. I suppose this provides an efficient way to deal with very big sparse vectors. However I think this behaviour has 2 flaws: - It's not useful for implicit binary data - It doesn't consider all item/user vectors with the same dimension, and so it's not consistent with other Pearson correlation implementations like scipy.stats pearsonr or R cor


User 1768 | 12/21/2015, 6:16:48 PM

Hi I have an Sframe with n number of users and each user has a time-series vectors of values. I need to then for pair of users, to compute the correlation between their time-series vectors over the fixed time period. Their values are continuous. Is there this possibility in graph lab create to calculate the Pearson’s rho? As I know this package of pearson correlation in Graphlab create only support one feature value (not a vector for each user) and only categorical values. Is it correct? Please guide me, if there is this possibility. Thanks