sframe aggregation fails

User 4616 | 5/23/2016, 8:50:37 AM

Hi, I have a sframe read from 140GB file, and after filtering and trimming I am working with sframe with shape of (160,582,072, 7). When I tried to aggregate the sframe, I am receiving ipython process crash: Process finished with exit code 137

How can I see what went wrong? How can I still group this sframe?

dmesg log show out of memory: [80968.192357] Out of memory: Kill process 32029 (python) score 436 or sacrifice child [80968.192359] Killed process 32029 (python) total-vm:15724008kB, anon-rss:14113180kB, file-rss:0kB [80975.978876] ata6: exception Emask 0x50 SAct 0x0 SErr 0x4090800 action 0xe frozen [80975.978880] ata6: irq_stat 0x00400040, connection status changed [80975.978882] ata6: SError: { HostInt PHYRdyChg 10B8B DevExch } [80975.978884] ata6: hard resetting link [80976.701110] ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

Thanks, Keren

Comments

User 1774 | 5/23/2016, 5:15:14 PM

Hi Keren, GraphLab logs are under /tmp , and I would be grateful if you publish yours. What kind of aggregation did you try to do? Also, what are your machine's specs - # CPUs, RAM and disk sizes?


User 4616 | 5/29/2016, 12:52:44 PM

Hi Guy, I'm using i-7 CPU, 16 GB RAM and 500 GB hard drive.

My main goal is to perform distinct operation on all the fields. I've tried to use the unique function, but it kept on crushing. So I tried to group by all the columns and add a 'count' column by using the following line: sfmnf = sf.groupby(sf.columnnames(), {'count':agg.COUNT(c.MANUFACTURER_NAME)})

This is part of the log that is produces (copied from the top): 1464525921 : INFO: (initializeglobalsfromenvironment:280): Setting configuration variable GRAPHLABFILEIOALTERNATIVESSLCERTFILE to /home/kerenk/miniconda/envs/garage/lib/python2.7/site-packages/certifi/cacert.pem 1464525921 : INFO: (initializeglobalsfromenvironment:280): Setting configuration variable GRAPHLABFILEIOALTERNATIVESSLCERTDIR to 1464525921 : INFO: (commserver:87): my alt bind address: inproc://sframeserver 1464525921 : INFO: (commserver:125): Server listening on: inproc://sframeserver 1464525921 : INFO: (commserver:127): Server Control listening on: inproc://sframeservercontrol 1464525921 : INFO: (commserver:129): Server status published on: inproc://sframeserverstatus 1464525921 : INFO: (registerfunction:461): Registering function objectfactorybase::makeobject 1464525921 : INFO: (registerfunction:461): Registering function objectfactorybase::ping 1464525921 : INFO: (registerfunction:461): Registering function objectfactorybase::deleteobject 1464525921 : INFO: (registerfunction:461): Registering function objectfactorybase::getstatuspublishaddress 1464525921 : INFO: (registerfunction:461): Registering function objectfactorybase::getcontroladdress 1464525921 : INFO: (registerfunction:461): Registering function objectfactorybase::syncobjects 1464525921 : INFO: (registertoolkitfunction:32): Function entry 1464525921 : INFO: (registertoolkitclass:17): Function entry 1464525921 : INFO: (odbcconnector:179): Function entry 1464525921 : INFO: (~odbcconnector:183): Function entry 1464525921 : INFO: (clear:189): Function entry 1464525921 : INFO: (connectodbcshim:140): Trying "/libodbc.so" 1464525921 : INFO: (connectodbcshim:140): Trying "/libodbc.dylib" 1464525921 : INFO: (connectodbcshim:140): Trying "libodbc.so" 1464525921 : INFO: (connectodbcshim:140): Trying "libodbc.dylib" 1464525921 : INFO: (connectodbcshim:165): Unable to load libodbc.{so,dylib} 1464525921 : INFO: (connectodbcshim:168): /libodbc.so: cannot open shared object file: No such file or directory 1464525921 : INFO: (connectodbcshim:168): /libodbc.dylib: cannot open shared object file: No such file or directory 1464525921 : INFO: (connectodbcshim:168): libodbc.so: cannot open shared object file: No such file or directory 1464525921 : INFO: (connectodbcshim:168): libodbc.dylib: cannot open shared object file: No such file or directory 1464525921 : INFO: (registertoolkitclass:17): Function entry 1464525921 : INFO: (unityglobal:42): Function entry 1464525921 : INFO: (registerfunction:461): Registering function unitysgraphbase::summary 1464525921 : INFO: (registerfunction:461): Registering function unitysgraphbase::getvertexfields 1464525921 : INFO: (registerfunction:461): Registering function unitysgraphbase::getedgefields 1464525921 : INFO: (registerfunction:461): Registering function unitysgraphbase::getvertexfieldtypes 1464525921 : INFO: (registerfunction:461): Registering function unitysgraphbase::getedgefieldtypes 1464525921 : INFO: (registerfunction:461): Registering function unitysgraphbase::getvertices 1464525921 : INFO: (registerfunction:461): Registering function unitysgraphbase::getedges 1464525921 : INFO: (registerfunction:461): Registering function unitysgraphbase::savegraph 1464525921 : INFO: (registerfunction:461): Registering function unitysgraphbase::loadgraph 1464525921 : INFO: (register_function:461): RegisHTTP/1.1 200 OK Transfer-Encoding: chunked Date: Thu, 21 Jul 2016 23:13:36 GMT Server: Warp/3.2.6 Content-Type: application/json

016A ["37zyefqi2sweveyp","42fn7zeo6v5ui427","66pt5sk2wz2jrbzu","awoljknjigytdyls","cj2lanoogknwopto","cnm3adnh35xmsx3f","ebxs4t2y6xr5izzy","eg5zus2pz72mr7xb","exshwew2w2jv3n7r","hxrxgzvgms3incmf","hymu5oh2f5ctk5jr","jkisbjnul226jria","lag7djeljbjng6bu","o3l65o4qzcxs327j","qsk2jzo2zh523r24","t7k6g7fkndoggutd","xfllvjyax4inadxh","ygtjzi2wkfonj3z7","yycjajwpguyno4je"] 0


User 15 | 5/31/2016, 6:24:44 PM

Hi,

What are the column types of your SFrame?

I'm guessing what's happening is that either: - one or more of your columns has values that are much bigger than expected (for instance large strings or dictionaries) - many rows in your dataset are the same, which makes one "group" much larger than expected

I think it's probably more likely the first one, as that is a more common situation to run us out of memory. You can combat this by setting some runtime config variables to tell the underlying algorithms to split their buckets in to smaller chunks. For groupby, this is:

graphlab.set_runtime_config('GRAPHLAB_SFRAME_GROUPBY_BUFFER_NUM_ROWS', <some smaller number than it is>)

Since you moved to groupby from unique because it was crashing, I would suggest you go back to using unique, as that is the intended tool for the job. It's kind of hidden, but unique is just a wrapper around our join algorithm. To do the tweaking for that one, you'd do it this way:

graphlab.set_runtime_config('GRAPHLAB_SFRAME_JOIN_BUFFER_NUM_CELLS', <some smaller number than it is>)

You can check what the variables are set at currently with:

graphlab.get_runtime_config()

I apologize that this is happening, and for the arcane set of instructions to work around it. Basically any time an SFrame runs your system out of memory, it is a bug, so thank you for reporting it. We'll work on fixing it so you don't have to do this in the future. In the meantime, I hope this can work around your error.


User 1774 | 5/31/2016, 6:56:38 PM

Also, although (as evan pointed out) we are supposed to be able to handle these sizes, perhaps a little map-reduce can work here: something like

python part1, part2 = sf.random_split(0.5) # split - can be to a larger number of parts... part1 = part1.unique() # map1 part2 = part2.unique() # map2 sf = part1.append(part2) # shuffle sf = sf.unique() # reduce()

You could do similar thing with groupy+gl.aggregate.COUNT() on the map stages and groupby+gl.aggregate.SUM("Count") on the reduce stages.

Question is - do you expect to see many duplicates in this data?