Error while using asynchronous engine with multiple machines

User 1007 | 11/30/2014, 11:11:44 AM

Hi, I have recently studied graph asynchronous processing.

I choose PageRank to measure the quality of this mechanism. I copied existing pagerank source code from graphlab/toolkits/graph_analytics/pagerank.cpp.

(1) Synchronous (2) Asynchronous with factorized = true (3) Asynchronous with factorized = false

However, some errors occur when I ran (2), (3).
The following is test environment, execution command, and error message.

  • Platform: AWS EC2, m3.xlarge * 33 ( 1 Master, 32 Workers )
  • Test data: #Vertices = 10^7, #Edge = 10^8, with power law skewed factor = 1.76
  • GraphLab version: 2.2 (PowerGraph)
  • Command : mpiexec -n 32 env CLASSPATH=/home/ubuntu/hadoop/bin/hadoop classpath \
    ~/graphlab/release/apps/pagerankapp/pagerankapp \
    --tol ${tol} \
    --engine async \
    --format adj \
    --graph_opts ingress=random \ --graph "$hdfspath"/user/${USER}/input/${inputgraph} \
    --saveprefix "$hdfspath"/"$outputdir"

    • Error Message:

      INFO: asyncconsistentengine.hpp(setendgamemode:805): Endgame mode 0: 6FAILED!! Cannot Stop Eating! 0: 46FAILED!! Cannot Stop Eating! 0: 5FAILED!! Cannot Stop Eating!

    I found some discussion on GraphLab Forum. http://forum.graphlab.com/discussion/186/error-while-using-asynchronous-engine

    Following these suggestion, I tried

    (1) Commented out #define RPCDONOT … (Yucheng's suggestion); (2) and dividing input file into smaller part, it sometime work with 2 machines.

    With more machines (e.g. 4 machines), graphlab will get stuck.

    Could you give me some advice about this? I will be grateful for any help you can provide.

Comments

User 6 | 12/7/2014, 7:23:55 AM

Hi, It sounds like a PowerGraph bug. We do not have a workaround for that, besides varying the number of machines.