data growing in left join

User 2565 | 2/6/2016, 9:21:12 PM

full_data = train_data.join(products_data, on='product_uid', how='left')
print len(train_data)
print len(products_data)
print len(full_data)

74067 124428 74320

Isn't my full_data supposed to be the same size as my train_data, given that I do a left outer join?

Comments

User 2565 | 2/10/2016, 1:51:46 PM

The documentation for join states that left is "Equivalent to a SQL left outer join." It led me to think that the requirement for the key was that is was a unique primary key, but this is not required and may not be the case.

In the following example, in the sounds SFrame, the ID 1 is not unique, and therefore the left join appears to be growing.

>>> import graphlab as gl
>>> animals = gl.SFrame({'id': [1, 2, 3, 4],
...                      'name': ['dog', 'cat', 'sheep', 'cow']})
>>> sounds = gl.SFrame({'id': [1, 3, 4, 5, 1],
...                     'sound': ['woof', 'baa', 'moo', 'oink', 'warf']})
>>> animals.join(sounds, how='left')
+----+-------+-------+
| id |  name | sound |
+----+-------+-------+
| 1  |  dog  |  woof |
| 3  | sheep |  baa  |
| 4  |  cow  |  moo  |
| 1  |  dog  |  warf |
| 2  |  cat  |  None |
+----+-------+-------+
[5 rows x 3 columns]

User 2565 | 2/10/2016, 2:21:25 PM

I'm thinking about deleting the rows whose "primary keys" are duplicates, but I didn't find out how to do that yet.


User 2565 | 2/12/2016, 3:16:54 PM

One solution that works for me

import graphlab.aggregate as agg
data.groupby(key_columns='id',
             operations={'name':gl.aggregate.SELECT_ONE('name'),
                         'sound':gl.aggregate.SELECT_ONE('sound')})