User 11 | 3/25/2014, 2:38:25 PM
I'm using the script "script/mpirsync" to sync the graphlab source code in all cluster nodes. However, occasionally I get an error similar to :
^[[1;31mFATAL: dctcpcomm.cpp(accept_handler:532): MD5 mismatch. 1239 Process 18 has hash 8ba29a1a5c92a72126e482e0fa9591cf 1240 Process 1 has hash 27cdbc3f81faedfdd9bb0a697d8a0909 1241 GraphLab requires all machines to run exactly the same binary. 1242 ^[[0m[ip-10-67-145-98:08761] * Process received signal * 1243 [ip-10-67-145-98:08761] Signal: Aborted (6) 1244 [ip-10-67-145-98:08761] Signal code: (-6) 1245 [ip-10-67-145-98:08761] [ 0] /lib/x8664-linux-gnu/libpthread.so.0(+0xfbb0) [0x7f5b869c0bb0] 1246 [ip-10-67-145-98:08761] [ 1] /lib/x8664-linux-gnu/libc.so.6(gsignal+0x37) [0x7f5b84545f77] 1247 [ip-10-67-145-98:08761] [ 2] /lib/x8664-linux-gnu/libc.so.6(abort+0x148) [0x7f5b845495e8] 1248 [ip-10-67-145-98:08761] [ 3] /home/ubuntu/graphlab/release/apps/wgb/wgbpagerank(ZN8graphlab7dcimpl11dctcpcomm14accepthandlerEv+0x9aa) [0x5ca03a] 1249 [ip-10-67-145-98:08761] [ 4] /home/ubuntu/graphlab/release/apps/wgb/wgbpagerank(ZN8graphlab6thread6invokeEPv+0x30) [0x572de0] 1250 [ip-10-67-145-98:08761] [ 5] /lib/x8664-linux-gnu/libpthread.so.0(+0x7f6e) [0x7f5b869b8f6e] 1251 [ip-10-67-145-98:08761] [ 6] /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f5b846099cd] 1252 [ip-10-67-145-98:08761] * End of error message * 1253 ^[[1;31mFATAL: dctcpcomm.cpp(accept_handler:532): MD5 mismatch. 1254 Process 23 has hash 8ba29a1a5c92a72126e482e0fa9591cf 1255 Process 1 has hash 27cdbc3f81faedfdd9bb0a697d8a0909 1256 GraphLab requires all machines to run exactly the same binary. 1257 ^[[0m[ip-10-225-132-31:07855] * Process received signal * 1258 [ip-10-225-132-31:07855] Signal: Aborted (6) 1259 [ip-10-225-132-31:07855] Signal code: (-6)
This is a medium size cluster of 30 slaves. After debugging the problem, it turns out that the following command in the mpirsync script OCCASIONALLY fail: "mpiexec.openmpi -hostfile /home/ubuntu/machines -nolocal -pernode rsync -e 'ssh -v -o StrictHostKeyChecking=no -i /home/ubuntu/.ssh/id_rsa' -avz --exclude '.make' --exclude '.cmake' --exclude '.internal' --exclude '.includecache' --exclude '*.o' ip-10-109-134-226.ec2.internal:/home/ubuntu/graphlab/release/ /home/ubuntu/graphlab/release/"
I enabled the "-v" option in the ssh command to trace the problem down. The error looks like:
debug1: Connection established. debug1: identity file /home/ubuntu/.ssh/idrsa type -1 debug1: identity file /home/ubuntu/.ssh/idrsa-cert type -1 debug1: Enabling compatibility mode for protocol 2.0 debug1: Local version string SSH-2.0-OpenSSH6.2p2 Ubuntu-6ubuntu0.1 sshexchange_identification: Connection closed by remote host debug1: Connecting to ip-10-109-134-226.ec2.internal [10.109.134.226] port 22. rsync: connection unexpectedly closed (0 bytes received so far) [Receiver] rsync error: unexplained error (code 255) at io.c(605) [Receiver=3.0.9] debug1: Connection established.
I tried to run the command for each individual slave alone and it works so the problem is not the ssh connection. In fact, running the same command did work once.
I wonder if this is an issue that someone else encountered before. The problem is fixed for my cluster and it is running now but I'm interested to fix it so I do not have the same problem again. Is it possible that bad network connection may cause this problem because all 30 slaves try to connect to the master node on the same time? I know that GraphLab had been used for much larger cluster sizes but these machines might had better network connectivity too. I'm looking for any insights please.