How to use groupby/aggregate to return top 5 items from one column as scored by another

User 2568 | 5/3/2016, 3:30:00 AM

I'm competing in the the (Kaggel Expedia Hotel Recommendations)[https://www.kaggle.com/c/expedia-hotel-recommendations/data] and I have an SFrame of hotel search log data with 35M rows. I want to group by search id, and return the the top five hotel by relevance for each id. In effect I want aggregate.ARGMAX('hotel', 'relevance') to return a list of the top 5 items, not just one.

This is detailed in this notebook

I have three questions 1. In the notebook, is there a better way to write the code? It seem fine and the performance is Ok, but I'm interested in learning 2. I'd like to request that ARGMAX be extended to take an optional 3rd parameter, which determines how many items to return. This would not be unusual for recommendation type problems. 3. There is an odd behaviour noted at the end of the notebook, where an int() is converted to a float(). Is this a bug or my misunderstanding.

Comments

User 1189 | 5/3/2016, 5:44:52 PM

Hi,

1, 2. Code looks fine. I agree an ARGMAX top-k will be very useful here. I will look into supporting that.

  1. Can be forced to a list using ...apply(lambda ..., list) # bypassing the type inference by explicitly saying what is the type Sometimes the type inference can be awkward.

Yucheng