Fwd: ERROR: dc_tcp_comm.cpp(new_socket:378): Check failed: all_addrs[id]==addr [1439064402 == 184617

User 55 | 2/25/2014, 3:50:34 PM

Hello!

I have installed graphlab on a cluster (8-node, IB FDR, OpenMPI 1.6.4). When I tried to run it I have the following error:

frolo@A11:~/graphlab/debug/apps/graphHPC-contest> mpirun -np 2 ./sssp2 --graph /local/rmat-24.txt --ncpus=2 GRAPHLABSUBNETID/GRAPHLABSUBNETMASK environment variables not defined. Using default values Subnet ID: 0.0.0.0 Subnet Mask: 0.0.0.0 Will find first IPv4 non-loopback address matching the subnet ERROR: dctcpcomm.cpp(newsocket:378): Check failed: alladdrs[id]==addr [1439064402 == 184617482] [A12:26890] * Process received signal * [A12:26890] Signal: Aborted (6) [A12:26890] Signal code: (-6) [A12:26890] [ 0] /lib64/libpthread.so.0(+0xf7c0) [0x7f4650d277c0] [A12:26890] [ 1] /lib64/libc.so.6(gsignal+0x35) [0x7f464f036b55] [A12:26890] [ 2] /lib64/libc.so.6(abort+0x181) [0x7f464f038131] [A12:26890] [ 3] ./sssp2(ZN8graphlab7dcimpl11dctcpcomm10newsocketEiP11sockaddrint+0x2ae) [0x7fd8e0] [A12:26890] [ 4] ./sssp2(ZN8graphlab7dcimpl11dctcpcomm14accepthandlerEv+0x54a) [0x7fe1b0] [A12:26890] [ 5] ./sssp2(ZNK5boost4mfi3mf0IvN8graphlab7dcimpl11dctcpcommEEclEPS4+0x61) [0x801769] [A12:26890] [ 6] ./sssp2(ZN5boost3bi5list1INS05valueIPN8graphlab7dcimpl11dctcpcommEEEEclINS4mfi3mf0IvS5EENS05list0EEEvNS04typeIvEERTRT0i+0x41) [0x801cdf] [A12:26890] [ 7] ./sssp2(ZN5boost3bi6bindtIvNS4mfi3mf0IvN8graphlab7dcimpl11dctcpcommEEENS05list1INS05valueIPS6EEEEEclEv+0x36) [0x801d1c] [A12:26890] [ 8] ./sssp2(ZN5boost6detail8function26voidfunctionobjinvoker0INS3bi6bindtIvNS4mfi3mf0IvN8graphlab7dcimpl11dctcpcommEEENS35list1INS35valueIPS9EEEEEEvE6invokeERNS115functionbufferE+0x1d) [0x801d3b] [A12:26890] [ 9] ./sssp2(ZNK5boost9function0IvEclEv+0x69) [0x6dcf61] [A12:26890] [10] ./sssp2(ZN8graphlab6thread6invokeEPv+0x35) [0x799515] [A12:26890] [11] /lib64/libpthread.so.0(+0x77b6) [0x7f4650d1f7b6] [A12:26890] [12] /lib64/libc.so.6(clone+0x6d) [0x7f464f0db9cd] [A12:26890] * End of error message * [A11:29978] [[6151,0],0]-[[6151,1],0] mcaoobtcpmsgrecv: readv failed: Connection reset by peer (104)


mpirun noticed that process rank 1 with PID 26890 on node A12 exited on signal 6 (Aborted).

I will appreciate any help with this issue. Thank you.

Best, Alexander

ps. Maybe it will be useful to introduce special category for troubleshooting? ;-)

Comments

User 20 | 2/25/2014, 5:33:12 PM

Do you have multiple network links between your machines? Basically the IP I was expecting to receive a connection from, was not the same as the IP which actually connected to me.

If you have multiple networks, you may need to constrain the IP subnet by setting the GRAPHLABSUBNETID and optionally, the GRAPHLABSUBNETMASK environment variables.

For instance: mpirun -np 2 env GRAPHLABSUBNETID=192.168.0.0 ./sssp .... will constraint the system to use only the IP addresses beginning with 192.168..


User 55 | 2/25/2014, 6:23:32 PM

Thank you. I have set these variables and the problem disappeared. What Graphlab uses TCP/IP for?


User 20 | 2/28/2014, 7:47:37 PM

We use TCP/IP for distributed communication. We use MPI to negotiate the initial set up, but then we create our own TCP/IP sockets for communication (since MPI doesn't do asynchronous communication very well).


User 55 | 3/1/2014, 8:51:03 AM

Well,do you have MPI communication for synchronous engine? Have you tested graphlab on IPoIB?


User 6 | 3/1/2014, 5:37:29 PM

No to both. All communication is via TCP/IP. We did not test GraphLab on IPoIB although this topic pops us occasionally.


User 20 | 3/1/2014, 9:43:09 PM

I think some people have tried it on IPoIB. My understanding is that it works.


User 55 | 3/2/2014, 2:04:00 PM

Ok, thank you.

Yet another question for my understanding of graphlap ideology. (btw, which name is more correct graphlab or powergraph?). Using TCP/IP instead of MPI is just technical decision ('MPI does not do asycnronous communiction very well', could you please clarify that?) or strategical choice for ethernet-powered clusters (or cloud computing)? That is MLDM applications used mostly by people who are far from HPC and dont need high performance or I misunderstood something?

Just making some analogy with Pregel, there are also not too many implementations of Pregel specified for clusters on IB or custom HPC systems (such as BG/Q). Single known for me exception is Mizan.

Thank you, Alex


User 20 | 3/4/2014, 7:10:34 PM

Hi,

Basically, the MPI API one-sided API is quite difficult to control efficiently for handling very large numbers of small messages. We used to have an MPI communicator based on MPI_Isend before, but it turns out on regular ethernet, a direct TCP socket is faster. Also, we performed most of our testing on EC2, thus Ethernet performance was more important. I do have hopes for a direct ibverbs implementation though.

Yucheng


User 2080 | 7/13/2015, 8:09:03 AM

Hello, everyone. I have installed PowerGraph on my host and a VMware . When I tried to run the pagerank demo, I got the similar errors. I have specified the env path GRAPHLABSUBNETID and GRAPHLABSUBNETMASK in .bashrc file. It seems didn't work. The error seems like this:

Here is my hostfile: 192.168.244.1 192.168.244.129

Any comments will help. Thank you so much :)