Matrix factorization get_similar_items broken?

User 1933 | 2/23/2016, 3:19:01 PM

Hey guys - I've found what I think are two bugs with the factorization recommender's get_similar_items methods.

I trained the model no problem, using:

model = gl.recommender.factorization_recommender.create(observation_data=data,target='rating',num_factors=k,solver='als')

But when I try to get similar items using the model I run into two problems:

PROBLEM 1:

If I run

result = model.get_similar_items(items=None,k=1000)

I just get an empty SFrame as the result. Manually specifying items seems to work well enough, so I can get the desired behavior by using

result = model.get_similar_items(items=range(n_items),k=1000)

But given that the docs say If ‘None’, then return the k most similar items for all items in the training set, this looks like a bug.

PROBLEM 2:

This is the really problematic one. It looks like the similarity ranking is not properly sorting results. If we look at result as defined above:

In [31]: result
Out[31]:
Columns:
        item_id int
        similar int
        distance        float
        rank    int

Rows: 112312000

Data:
+---------+---------+---------------+------+
| item_id | similar |    distance   | rank |
+---------+---------+---------------+------+
|    0    |  63843  | 1.88831686974 |  1   |
|    0    |  107668 | 1.88170653582 |  2   |
|    0    |  100487 | 1.86177480221 |  3   |
|    0    |  86830  | 1.85725367069 |  4   |
|    0    |  112107 | 1.84893345833 |  5   |
|    0    |  69697  | 1.83206981421 |  6   |
|    0    |  25821  | 1.83021712303 |  7   |
|    0    |  49364  | 1.82699304819 |  8   |
|    0    |  40848  | 1.82070058584 |  9   |
|    0    |  86387  | 1.82024693489 |  10  |
+---------+---------+---------------+------+
[112312000 rows x 4 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

The problem is evident. get_similar_items is returning the items with greatest cosine distance from the target items, which is of course the opposite of what we want. The most similar items are those with the smallest distance. I confirmed this by actually computing the cosine distance manually between the factors for item 0 and item 63843, which matched the distance above.

As I see it, the only workaround I can imagine for the moment would require something like

result = model.get_similar_items(items=range(n_items),k=n_items)

To get all the similarities, but then I'd have to do a crazy groupby operation on a 12B row SFrame, which I'm not looking forward to. Is there an alternate workaround you can think of? I'm on 1.8, so perhaps this is fixed in 1.8.2?

Thanks!

Comments

User 19 | 2/23/2016, 6:58:49 PM

Hi jlorince,

Yes, this is a bug. We agree that we should either be returning the items with the highest cosine similarity (or the smallest cosine distance). Thanks for reporting this!

As a workaround, you could get the user factors and item factors from model['coefficients'] and create a NearestNeighborsModel to return the most similar items (using either cosine or Euclidean distance).

Cheers, Chris


User 1933 | 2/24/2016, 1:29:55 AM

Yup - that's exactly what I ended up doing! Thanks!