Powergraph: Machines idling during computation on a large network

User 1841 | 4/29/2015, 7:33:48 PM

Hi all,

I have been working with GraphLab and everything has been going very smoothly until recently when I tried running on a larger graph, the friendster network from SNAP. GraphLab seems to just stop computing sometime after or during the setup phase (before output).

GraphLab starts normally and I can see each node in my cluster working. Then, at some point, all the machines just seem to idle; there is almost no work happening on each machine. At most, a single core is being used on each machine (vs. the expected 32 cores). There are no errors; the program is running, but nothing progresses. Even the log files stop updating. In my last run, here were the last few lines of the log file:

DEBUG:    dc_dist_object.hpp(__parent_to_child_barrier_release:1393): Barrier Release 1
DEBUG:    dc_dist_object.hpp(barrier:1454): barrier phase 2 complete
DEBUG:    dc_dist_object.hpp(__parent_to_child_barrier_release:1393): Barrier Release 1
DEBUG:    dc_dist_object.hpp(barrier:1454): barrier phase 2 complete

One thing I found in the log from a completed run, that was not in the log of a bad run, was the line "dctcpcomm.cpp(close:247): Closing listening socket". Otherwise, in general, the log files look the same and nothing really stands out to me.

The exact same setup, code, etc. works if I use the livejournal network. Moreover, I have split the friendster network into 32 pieces (for parallel input), and I have tried running GraphLab on pieces of the network. For example, it works fine if I run it on the first 16 files, or the last 16 files; however, when I run it on all 32 files (the full network) it exhibits the above behavior and won't complete.

This led me to believe it was some memory problem, but I don't believe it is because the cluster I am using has 60GB RAM per machine and GraphLab seems to be using around 10GB.

Please let me know if you have any insights on what may be causing this issue and how to fix it... I am out of ideas. I would be happy to attach any logs or anything that may be useful to figuring this out. I am running SSSP, connected components, and PageRank on an 8-machine Amazon EC2 cluster.

Thanks, Steve

Comments

User 1592 | 4/30/2015, 6:26:43 AM

Hi Steve, The PowerGraph code you are using is unfortunately deprecated. We suggest switching to our newer code base GraphLab Create.

We have recently improved the performance of our newest code, Graphlab Create, to be able to scale to 130 billion edges and 3.5 billion nodes on a single multicore machine. In this scenario, Pagerank takes 10 minutes per iteration on ec2 cr1.8xlarge instance.

The algorithms you mentioned are available in GraphLab Create as well, we recommend trying them out. Keep us posted.


User 1841 | 4/30/2015, 4:30:26 PM

Thanks for the response Danny.

I didn't use Graphlab Create because I am particularly interested in distributed graph systems and it was my understanding that GraphLab Create is free but is for single machine only. The distributed version, Dato Distributed, is not free. Is this correct?

Also, I am coming at this from an academic sense rather than a commercial application, so open-source is important to me. Is my understanding correct that the underlying engine of GraphLab Create (single machine and out-of-core stuff) is open source, but nothing else is?

Thanks again for your help.


User 1592 | 4/30/2015, 4:50:19 PM

Everything is true - GraphLab Create runs on a single multicore machine (beside of several parallel for type functionalities). We are working on a distributed version. Stay tuned!