improve speed, sframe left join

User 2266 | 7/20/2016, 4:42:17 AM

hi, a paid customer here. we are currently running dato (turi) on 24cpu/96gb machine we use following config for dato

Setting overall SFrame memory consumption = 40GB*2

gl.setruntimeconfig("GRAPHLABFILEIOMAXIMUMCACHECAPACITY",int(42949672960*2) )

Setting a single file SFrame memory consumption = 30GB*1.5

gl.setruntimeconfig("GRAPHLABFILEIOMAXIMUMCACHECAPACITYPERFILE", int(42949672960*1.5) )

Setting sort buffer to 8 gb

gl.setruntimeconfig("GRAPHLABSFRAMESORTBUFFERSIZE", int(8589934592) )

Setting table join buffer to 2 gb

gl.setruntimeconfig("GRAPHLABSFRAMEJOINBUFFERNUM_CELLS", int(2048576000) )

Setting table join buffer to 2 gb

gl.setruntimeconfig("GRAPHLABSFRAMEGROUPBYBUFFERNUM_ROWS", int(2048576000) )

Setting file pool to be at 2000

gl.setruntimeconfig("GRAPHLABSFRAMEFILEHANDLEPOOL_SIZE", int(4000) )

Configure Dato (GraphLab) to use all available CPUs


we have 2 SF, we have 100 million rows on the left side and mere 20000 rows on the right side, running following command

test = sfTrain.join(sfLag, on=['key1','key2','key3'], how='left' )

takes forever.

there are about 64 sf which we would like to left join and at current estimate, it would take good 7 hours just to left join these small tables.

is there something we could do from our end to accelerate?


User 2266 | 7/20/2016, 12:09:51 PM