Distinct count in groupby?

User 512 | 9/14/2015, 4:37:54 PM

Is there any way to do distinct value count in groupby? The "COUNT" function sees only counting the number of rows, not distinct values.

Comments

User 91 | 9/14/2015, 4:58:55 PM

Can you provide a sample code snipped where the COUNT does not perform this operation? If you include the column whose distinct value you are interested in into the groupby columns, then you can get the distinct count (you will need an additional groupby)

As an example, suppose I have an SFrame with 2 columns a and b

`Columns: a int b int Columns: a int b int

Rows: 12

Data: +---+---+ | a | b | +---+---+ | 1 | 1 | | 2 | 2 | | 3 | 1 | | 1 | 2 | | 2 | 1 | | 3 | 2 | | 1 | 1 | | 2 | 2 | | 3 | 1 | | 1 | 2 | +---+---+ [12 rows x 2 columns] Note: Only the head of the SFrame is printed. You can use printrows(numrows=m, num_columns=n) to print more rows and columns. `

Then I can get the number if unique entries in 'a' grouped by 'b' as follows

` sf.groupby(['b', 'a'], gl.aggregate.COUNT('a')).groupby('b', gl.aggregate.COUNT('a'))

Columns: b int Count int

Rows: 2

Data: +---+-------+ | b | Count | +---+-------+ | 2 | 3 | | 1 | 3 | +---+-------+ [2 rows x 2 columns] `

Note: that this is not the same as sf.groupby('b', gl.aggregate.COUNT('a')) which will not give you unique values of a.

Note 2: We are releasing sf.groupby('b', gl.aggregate.COUNT_DISTINCT('a'))` in the upcoming release (coming very soon) which will make this easier.


User 512 | 9/14/2015, 5:59:57 PM

Thanks for the information! Below is an example:

test = gl.SFrame({'id':[1,1,2,3],'val':['A','A','B','C']})

And I want to have the outputs as:

id COUNT COUNT_DISTINCT 1 2 1 2 1 1 3 1 1


User 19 | 9/14/2015, 6:03:12 PM

In the next version (1.6), you can get this output via:

`

test.groupby('id', {'count': gl.aggregate.COUNT(), 'countdistinct': gl.aggregate.COUNTDISTINCT('id')}) Columns: id int count int count_distinct int

Rows: 3

Data: +----+-------+----------------+ | id | count | count_distinct | +----+-------+----------------+ | 3 | 1 | 1 | | 1 | 2 | 1 | | 2 | 1 | 1 | +----+-------+----------------+

`


User 91 | 9/14/2015, 6:09:48 PM

The following gives you counts:

`

test.groupby(['id', 'val'], gl.aggregate.COUNT('val')) Columns: id int val str Count int

Rows: 3

Data: +----+-----+-------+ | id | val | Count | +----+-----+-------+ | 3 | C | 1 | | 1 | A | 2 | | 2 | B | 1 | +----+-----+-------+ [3 rows x 3 columns] ` while the following gives you distinct counts

`

test.groupby(['id', 'val'], gl.aggregate.COUNT('val')) .groupby('id', {'distinct': gl.aggregate.COUNT('val')})

Columns: id int distinct int

Rows: 3

Data: +----+----------+ | id | distinct | +----+----------+ | 2 | 1 | | 3 | 1 | | 1 | 1 | +----+----------+ [3 rows x 2 columns] `

You can combine them in one call as follows:

`

test.groupby(['id', 'val'], gl.aggregate.COUNT('val')) .groupby('id', {'distinct': gl.aggregate.COUNT('val'), 'count': gl.aggregate.SUM('Count')})

id	int
count	int
distinct	int

Rows: 3

Data: +----+-------+-----------+ | id | count | distinct | +----+-------+-----------+ | 2 | 1 | 1 | | 3 | 1 | 1 | | 1 | 2 | 1 | +----+-------+-----------+ [3 rows x 3 columns] `

Note that the next release will have a COUNT_DISTINCT which will make this easier.


User 512 | 9/14/2015, 6:38:04 PM

Thanks! That is exactly what I was thinking about (using two groupby to get the distinct counts), I am just wondering if there is any better way to do that. I like the new "COUNT_DISTINCT" option in 1.6, do you know when that will be available?