mpirsync occasionally fail to sync GraphLab binaries !

User 11 | 3/25/2014, 2:38:25 PM

Hi all,

I'm using the script "script/mpirsync" to sync the graphlab source code in all cluster nodes. However, occasionally I get an error similar to :

^[[1;31mFATAL: dctcpcomm.cpp(accept_handler:532): MD5 mismatch. 1239 Process 18 has hash 8ba29a1a5c92a72126e482e0fa9591cf 1240 Process 1 has hash 27cdbc3f81faedfdd9bb0a697d8a0909 1241 GraphLab requires all machines to run exactly the same binary. 1242 ^[[0m[ip-10-67-145-98:08761] * Process received signal * 1243 [ip-10-67-145-98:08761] Signal: Aborted (6) 1244 [ip-10-67-145-98:08761] Signal code: (-6) 1245 [ip-10-67-145-98:08761] [ 0] /lib/x8664-linux-gnu/libpthread.so.0(+0xfbb0) [0x7f5b869c0bb0] 1246 [ip-10-67-145-98:08761] [ 1] /lib/x8664-linux-gnu/libc.so.6(gsignal+0x37) [0x7f5b84545f77] 1247 [ip-10-67-145-98:08761] [ 2] /lib/x8664-linux-gnu/libc.so.6(abort+0x148) [0x7f5b845495e8] 1248 [ip-10-67-145-98:08761] [ 3] /home/ubuntu/graphlab/release/apps/wgb/wgbpagerank(ZN8graphlab7dcimpl11dctcpcomm14accepthandlerEv+0x9aa) [0x5ca03a] 1249 [ip-10-67-145-98:08761] [ 4] /home/ubuntu/graphlab/release/apps/wgb/wgbpagerank(ZN8graphlab6thread6invokeEPv+0x30) [0x572de0] 1250 [ip-10-67-145-98:08761] [ 5] /lib/x8664-linux-gnu/libpthread.so.0(+0x7f6e) [0x7f5b869b8f6e] 1251 [ip-10-67-145-98:08761] [ 6] /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f5b846099cd] 1252 [ip-10-67-145-98:08761] * End of error message * 1253 ^[[1;31mFATAL: dctcpcomm.cpp(accept_handler:532): MD5 mismatch. 1254 Process 23 has hash 8ba29a1a5c92a72126e482e0fa9591cf 1255 Process 1 has hash 27cdbc3f81faedfdd9bb0a697d8a0909 1256 GraphLab requires all machines to run exactly the same binary. 1257 ^[[0m[ip-10-225-132-31:07855] * Process received signal * 1258 [ip-10-225-132-31:07855] Signal: Aborted (6) 1259 [ip-10-225-132-31:07855] Signal code: (-6)

This is a medium size cluster of 30 slaves. After debugging the problem, it turns out that the following command in the mpirsync script OCCASIONALLY fail: "mpiexec.openmpi -hostfile /home/ubuntu/machines -nolocal -pernode rsync -e 'ssh -v -o StrictHostKeyChecking=no -i /home/ubuntu/.ssh/id_rsa' -avz --exclude '.make' --exclude '.cmake' --exclude '.internal' --exclude '.includecache' --exclude '*.o' ip-10-109-134-226.ec2.internal:/home/ubuntu/graphlab/release/ /home/ubuntu/graphlab/release/"

I enabled the "-v" option in the ssh command to trace the problem down. The error looks like:

debug1: Connection established. debug1: identity file /home/ubuntu/.ssh/idrsa type -1 debug1: identity file /home/ubuntu/.ssh/idrsa-cert type -1 debug1: Enabling compatibility mode for protocol 2.0 debug1: Local version string SSH-2.0-OpenSSH6.2p2 Ubuntu-6ubuntu0.1 sshexchange_identification: Connection closed by remote host debug1: Connecting to ip-10-109-134-226.ec2.internal [10.109.134.226] port 22. rsync: connection unexpectedly closed (0 bytes received so far) [Receiver] rsync error: unexplained error (code 255) at io.c(605) [Receiver=3.0.9] debug1: Connection established.

I tried to run the command for each individual slave alone and it works so the problem is not the ssh connection. In fact, running the same command did work once.

I wonder if this is an issue that someone else encountered before. The problem is fixed for my cluster and it is running now but I'm interested to fix it so I do not have the same problem again. Is it possible that bad network connection may cause this problem because all 30 slaves try to connect to the master node on the same time? I know that GraphLab had been used for much larger cluster sizes but these machines might had better network connectivity too. I'm looking for any insights please.

Thanks, -Khaled

Comments

User 6 | 3/25/2014, 3:14:56 PM

Hi Khaled, It does sound like a networking problem since rsync fails and not GraphLab. The only thing that comes into my mind is sshdconfig setup. See for example: http://www.openssh.org/cgi-bin/man.cgi?query=sshdconfig MaxSessions Specifies the maximum number of open sessions permitted per network connection. The default is 10.

Maybe because MaxSession defaults to 10 sometimes the connection fails for 30 machines but works for individual connection.

Best,


User 11 | 3/25/2014, 3:26:52 PM

This is a good place to start tracing the issue.

Thank you Danny, -Khaled


User 6 | 3/25/2014, 5:11:16 PM

A second parameter which may be related is: MaxStartups Specifies the maximum number of concurrent unauthenticated connections to the SSH daemon. Additional connections will be dropped until authentication succeeds or the LoginGraceTime expires for a connection. The default is 10:30:100.


User 11 | 3/25/2014, 6:50:11 PM

Thank you Danny, MaxSesions alone did not solve the problem but adding MaxStartups fixed it.

-Khaled