User 1841 | 4/29/2015, 7:33:48 PM
I have been working with GraphLab and everything has been going very smoothly until recently when I tried running on a larger graph, the friendster network from SNAP. GraphLab seems to just stop computing sometime after or during the setup phase (before output).
GraphLab starts normally and I can see each node in my cluster working. Then, at some point, all the machines just seem to idle; there is almost no work happening on each machine. At most, a single core is being used on each machine (vs. the expected 32 cores). There are no errors; the program is running, but nothing progresses. Even the log files stop updating. In my last run, here were the last few lines of the log file:
DEBUG: dc_dist_object.hpp(__parent_to_child_barrier_release:1393): Barrier Release 1 DEBUG: dc_dist_object.hpp(barrier:1454): barrier phase 2 complete DEBUG: dc_dist_object.hpp(__parent_to_child_barrier_release:1393): Barrier Release 1 DEBUG: dc_dist_object.hpp(barrier:1454): barrier phase 2 complete
One thing I found in the log from a completed run, that was not in the log of a bad run, was the line "dctcpcomm.cpp(close:247): Closing listening socket". Otherwise, in general, the log files look the same and nothing really stands out to me.
The exact same setup, code, etc. works if I use the livejournal network. Moreover, I have split the friendster network into 32 pieces (for parallel input), and I have tried running GraphLab on pieces of the network. For example, it works fine if I run it on the first 16 files, or the last 16 files; however, when I run it on all 32 files (the full network) it exhibits the above behavior and won't complete.
This led me to believe it was some memory problem, but I don't believe it is because the cluster I am using has 60GB RAM per machine and GraphLab seems to be using around 10GB.
Please let me know if you have any insights on what may be causing this issue and how to fix it... I am out of ideas. I would be happy to attach any logs or anything that may be useful to figuring this out. I am running SSSP, connected components, and PageRank on an 8-machine Amazon EC2 cluster.