Async engine hanging

User 170 | 4/4/2014, 4:26:36 PM

I'm running PageRank and SSSP with the async engine as a part of an experiment. I'm having an issue where some runs will hang at "INFO: fiberasyncconsensus.cpp(enddonecritical_section:105): 22: Termination Possible" (where 22 can be any of the machines). This doesn't happen consistently: sometimes it'll do 5 runs on one graph okay, 3 runs on another graph okay, and then hang. Other times it will hang on the very first run. I've tried so far on both 16 and 32 workers, on EC2. Is there a deadlock? Did a worker lose connection (by this I mean within GraphLab---I can always restart Hadoop without issues after killing the run)? Am I doing something stupid?

I'm using MPICH2 with the following command:

mpiexec -f ./machines -n 32 \ "$GRAPHLABDIR"/release/toolkits/graphanalytics/pagerank \ --tol ${tol} \ --engine async \ --format adjgps \ --graph_opts ingress=random \ --graph "$hdfspath"/user/ubuntu/input/${inputgraph} \ --saveprefix "$hdfspath"/user/ubuntu/graphlab-output/

(adjgps is the adj input format without the number of edges, so each lines is: vertex-id dst1 dst2 dst3 ...)

Here are some complete logs (on 32 workers) demonstrating some ways it can hang: (PageRank, soc-LiveJournal1) (PageRank, com-Orkut) (SSSP, com-Orkut)

While the above are examples with the SNAP graphs, I've also observed this with the arabic-2005 graph from WebGraph.

EDIT: Didn't see the attach option... also attached the logs.


User 20 | 4/4/2014, 4:53:56 PM


Try dropping the --graph_opts "ingress=random", or try fewer machines.

In certain cases (which are not entirely well understood unfortunately, perhaps when the graph is too small for the number of machines or something like that), we get a situation where:

1) There are only one or two vertices active in the graph (so maybe only one or two machines are active) 2) All the remaining machines have nothing left to do and thus begin the termination detection algorithm 3) Those one or two vertices wake up more machines, which cancel the termination state (which is somewhat costly) 4) And 2-3 may cycle for a very very long time.

Also, it is certainly also possible that there is a (possibly distributed) deadlock somewhere in (3) that I am not aware of (make sure you have the most current pull from the repository).