Question about SFrame.sample(...)

User 1255 | 5/29/2015, 3:47:11 AM

Hi!

Thanks for the great work on GraphLab!

Question: Is the following SFrame.sample(...) behavior a feature or a bug?

` import graphlab as gl

sf = gl.SFrame({"id":range(0,15000)})

print len(sf.sample(.3) #prints 4400 print len(sf.sample(.3) #prints 3587 print len(sf.sample(.3) #prints 4375 print len(sf.sample(.3) #prints 3716 `

Ideally the returned SFrame should contain "approximately the fraction times the number of rows", e.g., 0.3 * 15000 ~= 4500.

See https://dato.com/products/create/docs/generated/graphlab.SFrame.sample.html

Cheers! :)

Comments

User 1178 | 5/29/2015, 6:07:28 PM

Hi Jason,

This behavior is by design. The current way of sampling is that we iterate the SFrame rows in parallel and with each row, we toss a coin (by the sampling probability) and decide whether or not we take the row. So it won't be exactly the number of rows you expect. I ran the same code you have and it returns roughly 4500 rows:

` In [10]: print len(sf.sample(.3)) 4529

In [11]: print len(sf.sample(.3)) 4480

In [12]: print len(sf.sample(.3)) 4487

In [13]: print len(sf.sample(.3)) 4487

In [14]: print len(sf.sample(.3)) 4475

`


User 1255 | 5/29/2015, 7:28:48 PM

Hi Ping Wang,

Thanks for the prompt reply! :)

I understand the number of rows returned will be approximate - i.e., ~4500 in our example - but I find it strange when the function returned SFrames with row-counts far from ~4500 .

When convenient, can you try the following snippet?

` from time import sleep import graphlab as gl

sf = gl.SFrame({"id":range(0,15000)})

for i in range(0,5): sleep(2) print len(sf.sample(.3))

prints

3597

3629

5638

3969

4254

`

P.S. This is using the Dato-Core code available on github.com, which should be equivalent to GraphLab-Create 1.3 (minus the fancy stuff!) .

Sincerely,

[JASON@CHAW]


User 19 | 6/8/2015, 4:47:34 PM

Hi Jason,

I agree: the values you report are outside the expected range of values. On GLC 1.4 I get the following with your snippet (using a larger range): 4412 4413 4548 4478 4494 4504 4533 4518 4488 4523 4469 4480