Serever is dying during groupby

User 2172 | 8/17/2015, 6:15:20 AM

My code is very simple:

user_cat = SFrame.read_csv("my-dataset.csv", header=True)
    user_cat_count = user_cat.groupby(key_columns=['user_id','cat_id'], operations={'count': agg.COUNT()})

The csv file has only two columns and pretty large about 75GB. Every time I run the code it crashes in groupby line with the following error: "Unable to reach server for 4 consecutive pings. Server is considered dead. Please exit and restart." I am using grpahlab 1.5.1 on a pretty strong server(24 cores and about 600GB ram). Am I missing something or this is a bug? The log shows: 1439727064 : PROGRESS: (parsecsvstosframe:1104): Parsing completed. Parsed 4915072714 lines in 957.456 secs. 1439727064 : INFO: (newcache:166): Cache Utilization:4295005184 1439727065 : INFO: (groupbyaggregate:1021): Function entry 1439727065 : INFO: (groupbyaggregate:1025): Args: Keys: 1439727065 : INFO: (groupbyaggregate:1026): userid, 1439727065 : INFO: (groupbyaggregate:1026): catid, 1439727065 : INFO: (groupbyaggregate:1027): Groups: 1439727065 : INFO: (groupbyaggregate:1030): , 1439727065 : INFO: (groupbyaggregate:1032): | 1439727065 : INFO: (groupbyaggregate:1034): Operations: 1439727065 : INFO: (groupbyaggregate:1035): 0xe6945f0, 1439727065 : INFO: (groupbyaggregate:1036): 1439727065 : INFO: (createarraysforwriting:236): Opening Frame for writing to with 96 segments and 3 columns 1439727065 : INFO: (groupbyaggregate:215): Filling group container: 1439727065 : INFO: (materialize:219): Materializing: digraph G { "47853328" [label="B: SF(S1,S2)"] "47856048" [label="A: PR(1,0)"] "47853328" -> "47856048" } 1439728283 : INFO: (main:611): Quiting with received character: 10 feof = 0 1439728284 : INFO: (~comm_server:207): Function entry 1439728284 : INFO: (stop:234): Function entry


User 15 | 8/17/2015, 8:46:53 PM


There are known issues with groupby running out of memory if the data has a very lopsided distribution among the group-keys (one group has almost all the values). It's quite troubling that you have 600 GB of RAM and the dataset is only 75 GB though, so I would say this is a bug. There are a few things you can do to help us debug this:

1) Determine if the unity_server process is actually crashing

According to the log, it actually looks like it exited normally.

2) Note memory usage during groupby.

Is your machine actually running out of memory?

3) Tell us what OS you are running

4) Run the example on a smaller subset of the data

If there's one really busy user/category pair, this could mitigate that.