Cache problem when performing filter and iterate

User 3134 | 1/27/2016, 9:33:59 AM

Hi,

I have a strange problem I hadn't observed before with SFrames. Here's a sample code.

rrp_frame.add_column(graphlab.SArray(add_data['sdly_rooms']), 'sdly_rooms')
rrp_frame.add_column(graphlab.SArray(add_data['sdly_revenue']), 'sdly_revenue')
print len(rrp_frame)

final_rrp_frame = rrp_frame[rrp_frame['sdly_rooms'] != -1]
print len(final_rrp_frame)

The first print works fine. The output is 107332. However, the second print statement just hangs. When I interrupt the Kernel, I used to iterate over more than 200,000 records before but this is the first time I am observing this issue. I am using graphlab-create 1.8. Everytime I run the print statement, I get this in the unity server log.

1453887050 : INFO:     (execute_node:129): Materializing only column subset: digraph G {
	"4437760008" [label="B: PyLambda"]
	"4460661576" [label="A: UP(B:0;C:0)"]
	"4460694600" [label="D: SF(S204)"]
	"4866496392" [label="C: PyLambda"]
	"4437760008" -> "4460661576"
	"4460694600" -> "4437760008"
	"4866496392" -> "4460661576"
	"4460694600" -> "4866496392"
}
1453887050 : INFO:     (new_cache:157): Cache Utilization:18503301

Please help.

Comments

User 16 | 1/28/2016, 1:59:17 AM

Hi tejas_revup -

SFrame operations are done lazily; things are not computed until they are needed.

Executing final_rrp_frame = rrp_frame[rrp_frame['sdly_rooms'] != -1] will be very fast because it is not doing any actual work. However then final_rrp_frame is needed, in this case to print the length, then that work has to happen. The logs info you posted seems to verify this is what happening.

You probably just need to give it more time.

Thanks, Toby