dc_tcp_comm.cpp(connect:463): Failed to establish connection

User 11 | 4/7/2014, 7:38:13 PM


I run PageRank in 20 machines and occasionally get a connection error as below:

GRAPHLABSUBNETID/GRAPHLABSUBNETMASK environment variables not defined. Using default values Subnet ID: Subnet Mask: Will find first IPv4 non-loopback address matching the subnet FATAL: dctcpcomm.cpp(connect:463): Failed to establish connection FATAL: dctcpcomm.cpp(connect:463): Failed to establish connection [ip-10-225-136-93:04577] * Process received signal * [ip-10-225-136-93:04577] Signal: Aborted (6) [ip-10-225-136-93:04577] Signal code: (-6) [ip-10-225-136-93:04577] [ 0] /lib/x8664-linux-gnu/libpthread.so.0(+0xfbb0) [0x7f22d8a9ebb0] [ip-10-225-136-93:04577] [ 1] /lib/x8664-linux-gnu/libc.so.6(gsignal+0x37) [0x7f22d6623f77] [ip-10-225-136-93:04577] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148) [0x7f22d66275e8]

My guess is, the error message is printed from http://docs.graphlab.org/dctcpcomm8cppsource.html line 463.

I have two questions here: 1) Does the cluster size have an impact on this? For example, should this function try more than 10 trials or wait more than 1 second when handling medium/large clusters? 2) I was expecting to see the connection error 10 times because of " logstream(LOG_INFO) << "Trying to connect from " << curid << " -> " << target << " on port " << portnums[target] << "\n"; ". Do you have a clue why it is not there?



User 20 | 4/8/2014, 4:49:10 AM

The default log level is marginally above LOGINFO so LOGINFO messages don't appear. What kind of network are you running on? On startup we establish all pairs connections, and though we have not really encountered it before, it is certainly possible certain network configurations / firewall rules may not like that very much.

User 11 | 4/8/2014, 5:01:19 AM

The connection work most of the time but occasionally fail which hurt my automated scripts. I use Amazon AWS.!

Do you suggest looking at something in particular ?

User 20 | 4/8/2014, 5:02:41 AM

What kind of instances are you using? The cluster compute instances or the regular instances?

User 11 | 4/8/2014, 5:27:37 AM

I use c3.large instance. It is a compute instance but has limited memory.

User 20 | 4/14/2014, 4:56:08 PM


Sorry for dropping the ball. I am not sure... Check firewall rules if any, and check the system logs to see if anything is triggering security problems. We have used Ubuntu 12.04 on cc2.8x machines successfully.


User 350 | 10/24/2014, 8:18:08 PM

I had the same problem and solved it by adding a timeout "timer::sleep(3);" after closing the socket on the line 448 in the file http://docs.graphlab.org/dctcpcomm8cppsource.html. (add the timeout before creating a new socket on line 449. i.e., add the timeout between closing the old and creating the new socket).