User 1933 | 2/23/2016, 3:19:01 PM

Hey guys - I've found what I think are two bugs with the factorization recommender's `get_similar_items`

methods.

I trained the model no problem, using:

`model = gl.recommender.factorization_recommender.create(observation_data=data,target='rating',num_factors=k,solver='als')`

But when I try to get similar items using the model I run into two problems:

PROBLEM 1:

If I run

`result = model.get_similar_items(items=None,k=1000)`

I just get an empty SFrame as the result. Manually specifying `items`

seems to work well enough, so I can get the desired behavior by using

`result = model.get_similar_items(items=range(n_items),k=1000)`

But given that the docs say `If ‘None’, then return the k most similar items for all items in the training set`

, this looks like a bug.

PROBLEM 2:

This is the really problematic one. It looks like the similarity ranking is not properly sorting results. If we look at `result`

as defined above:

```
In [31]: result
Out[31]:
Columns:
item_id int
similar int
distance float
rank int
Rows: 112312000
Data:
+---------+---------+---------------+------+
| item_id | similar | distance | rank |
+---------+---------+---------------+------+
| 0 | 63843 | 1.88831686974 | 1 |
| 0 | 107668 | 1.88170653582 | 2 |
| 0 | 100487 | 1.86177480221 | 3 |
| 0 | 86830 | 1.85725367069 | 4 |
| 0 | 112107 | 1.84893345833 | 5 |
| 0 | 69697 | 1.83206981421 | 6 |
| 0 | 25821 | 1.83021712303 | 7 |
| 0 | 49364 | 1.82699304819 | 8 |
| 0 | 40848 | 1.82070058584 | 9 |
| 0 | 86387 | 1.82024693489 | 10 |
+---------+---------+---------------+------+
[112312000 rows x 4 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
```

The problem is evident. `get_similar_items`

is returning the items with *greatest* cosine distance from the target items, which is of course the opposite of what we want. The most similar items are those with the *smallest* distance. I confirmed this by actually computing the cosine distance manually between the factors for item 0 and item 63843, which matched the distance above.

As I see it, the only workaround I can imagine for the moment would require something like

`result = model.get_similar_items(items=range(n_items),k=n_items)`

To get *all* the similarities, but then I'd have to do a crazy groupby operation on a 12B row SFrame, which I'm not looking forward to. Is there an alternate workaround you can think of? I'm on 1.8, so perhaps this is fixed in 1.8.2?

Thanks!

User 19 | 2/23/2016, 6:58:49 PM

Hi jlorince,

Yes, this is a bug. We agree that we should either be returning the items with the highest cosine similarity (or the smallest cosine distance). Thanks for reporting this!

As a workaround, you could get the user factors and item factors from `model['coefficients']`

and create a `NearestNeighborsModel`

to return the most similar items (using either cosine or Euclidean distance).

Cheers, Chris

User 1933 | 2/24/2016, 1:29:55 AM

Yup - that's exactly what I ended up doing! Thanks!