User 11 | 4/7/2014, 7:38:13 PM
I run PageRank in 20 machines and occasionally get a connection error as below:
GRAPHLABSUBNETID/GRAPHLABSUBNETMASK environment variables not defined. Using default values Subnet ID: 0.0.0.0 Subnet Mask: 0.0.0.0 Will find first IPv4 non-loopback address matching the subnet [1;31mFATAL: dctcpcomm.cpp(connect:463): Failed to establish connection [0m[1;31mFATAL: dctcpcomm.cpp(connect:463): Failed to establish connection [0m[ip-10-225-136-93:04577] * Process received signal * [ip-10-225-136-93:04577] Signal: Aborted (6) [ip-10-225-136-93:04577] Signal code: (-6) [ip-10-225-136-93:04577] [ 0] /lib/x8664-linux-gnu/libpthread.so.0(+0xfbb0) [0x7f22d8a9ebb0] [ip-10-225-136-93:04577] [ 1] /lib/x8664-linux-gnu/libc.so.6(gsignal+0x37) [0x7f22d6623f77] [ip-10-225-136-93:04577] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148) [0x7f22d66275e8]
My guess is, the error message is printed from http://docs.graphlab.org/dctcpcomm8cppsource.html line 463.
I have two questions here: 1) Does the cluster size have an impact on this? For example, should this function try more than 10 trials or wait more than 1 second when handling medium/large clusters? 2) I was expecting to see the connection error 10 times because of " logstream(LOG_INFO) << "Trying to connect from " << curid << " -> " << target << " on port " << portnums[target] << "\n"; ". Do you have a clue why it is not there?