Something wrong with MPI

User 872 | 10/23/2014, 9:08:23 AM

Hello!

When I test the MPI with the command: mpiexec -n 2 --hostfile ~/machines env GRAPHLABSUBNETID=10.228.91.114 /home/lyuwei/graphlab/release/demoapps/rpc/rpc_example1, the right result will occur according to the tutorial.

However, when I ran pagerank on my two-machine cluster with the command: mpiexec -n 2 -hostfile ~/machines env GRAPHLABSUBNETID=10.228.91.114 /home/lyuwei/graphlab/release/toolkits/graphanalytics/pagerank --powerlaw=100000 ,there will be only one machine working with the information: mpitools.hpp(init:63): MPI Support was not compiled.

So could anyone help me solve this problem? Thank you all very much!

Lyuwei

Comments

User 6 | 10/23/2014, 10:28:53 AM

Hi, It seemed you compiled rpcexample and pagerank under different environments. Please do cd graphlabroot make clean ./configure cd release/toolkits/graph_analytics make

When running configure you need to verify that MPI was detected properly. It should output a trace saying it was found (or not found).

best,


User 872 | 10/24/2014, 5:46:34 PM

Hi,

Thank you very much, Danny. After I reconfigured it, MPI is OK now. However, when I configured, there was a information: Could NOT find ANT (missing: ANT_EXEC) So does it matters?

Thank you again Lyuwei


User 6 | 10/24/2014, 6:39:46 PM

This is optional for linking to Java, please ignore.


User 872 | 10/26/2014, 7:15:58 AM

Hi, Than you Danny. With your help I can run it smoothly on 2 nodes. However when I add another node, a new problem occurs when I run " mpiexec -n 3 -hostfile ~/machines env GRAPHLABSUBNETID=10.228.91.114 /home/lyuwei/graphlab/release/toolkits/graph_analytics/pagerank --powerlaw=100000".

The information is: GRAPHLABSUBNETID specified, but GRAPHLABSUBNETMASK not specified. We will try to guess a subnet mask Subnet ID: 10.228.91.114 Subnet Mask: 255.255.255.254 Will find first IPv4 non-loopback address matching the subnet Unable to find a network matching the requested subnet [hydra][[37516,1],0][btltcpendpoint.c:458:mcabtltcpendpointrecv_blocking] recv(10) failed: Connection reset by peer (104)


Sorry! You were supposed to get help about: client handshake fail from the file: help-mpi-btl-tcp.txt But I couldn't find that topic in the file. Sorry!



mpiexec has exited due to process rank 2 with PID 37782 on node 10.228.91.116 exiting improperly. There are two reasons this could occur:

  1. this process did not call "init" before exiting, but others in the job did. This can cause a job to hang indefinitely while it waits for all processes to call "init". By rule, if one process calls "init", then ALL processes must call "init" prior to termination.

  2. this process called "init", but exited without calling "finalize". By rule, all processes that call "init" MUST call "finalize" prior to exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be terminated by signals sent by mpiexec (as reported here).


Because my netmask is 255.255.255.0, which is different from graphlab guessed, I modify the GRAPHLABSUBNETMASK to 255.255.255.0. However, the conclusion is: GRAPHLABSUBNETID/GRAPHLABSUBNETMASK environment variables not defined. Using default values Subnet ID: 10.228.91.114 Subnet Mask: 255.255.255.0 Will find first IPv4 non-loopback address matching the subnet Unable to find a network matching the requested subnet Unable to find a network matching the requested subnet Unable to find a network matching the requested subnet


mpiexec has exited due to process rank 2 with PID 37956 on node 10.228.91.116 exiting improperly. There are two reasons this could occur:

  1. this process did not call "init" before exiting, but others in the job did. This can cause a job to hang indefinitely while it waits for all processes to call "init". By rule, if one process calls "init", then ALL processes must call "init" prior to termination.

  2. this process called "init", but exited without calling "finalize". By rule, all processes that call "init" MUST call "finalize" prior to exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be terminated by signals sent by mpiexec (as reported here).


So, would you please tell me what the problem is and how to solve it?

Thank you very much!

Best, Lyuwei


User 6 | 10/26/2014, 7:59:51 AM

I suspect the meaningful error is: "Unable to find a network matching the requested subnet". You need to verify that all nodes can ssh each other and that the subnet you specified is indeed accessible from all machines.


User 872 | 10/27/2014, 2:52:26 AM

Thank you. However my servers can ssh without password each other and my slaves have the same config such as /etc/hosts and netmask, gateway. My master node has a virtue network card to contact with the two slaves. So is there some maybe problem with that?


User 6 | 10/27/2014, 5:18:41 AM

Maybe the slaves are not able to access the IP of the virtual network card?


User 872 | 10/27/2014, 6:22:19 AM

Well, I am able to ping the virtual IP from my slaves. So is there other something I can check? Thank you


User 6 | 10/27/2014, 7:09:21 AM

In a second thought, it seems you did not specify GRAPHLABSUBNETMASK What is your subnet mask? is it 255.255.255.0 ?


User 872 | 10/27/2014, 10:12:57 AM

Yes. When I specify it, the result is the same as above. Powergraph can run on my first 2 nodes(114, 115), but cannot work on 116. In addition, the hardware and software are same of the two slave nodes.


User 6 | 10/30/2014, 6:04:25 AM

You will need to check your configuration, something is different with node 116.


User 872 | 11/3/2014, 11:16:37 PM

Hi,

My nodes' configuration is as below:

<b class="Bold">hydra(master)</b> <b class="Bold">/etc/hosts</b> 127.0.0.1 localhost.localdomain localhost 10.228.91.114 hydra 140.114.91.16 hydra.hydra.sslab.cs.nthu.edu.tw hydra

::1 ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters

10.228.91.115 zeus 10.228.91.116 adam

<b class="Bold">/etc/network/interfaces</b> auto lo iface lo inet loopback

The primary network interface

auto eth0 iface eth0 inet static address 140.114.91.16 netmask 255.255.255.0 network 140.114.91.0 broadcast 140.114.91.255 gateway 140.114.91.254 # dns-* options are implemented by the resolvconf package, if installed dns-nameservers 140.114.63.1 dns-search hydra.sslab.cs.nthu.edu.tw

auto eth0:0 iface eth0:0 inet static address 10.228.91.114 netmask 255.255.255.0 gateway 10.228.91.254 dns-nameservers 140.114.63.1

<b class="Bold">zeus(My normal slave)</b> <b class="Bold">/etc/hosts</b> 127.0.0.1 localhost.localdomain localhost 10.228.91.115 zeus

::1 ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters

10.228.91.114 hydra 10.228.91.116 adam

<b class="Bold">/etc/network/interfaces</b> auto lo iface lo inet loopback

auto eth0 iface eth0 inet static address 10.228.91.115 netmask 255.255.255.0 gateway 10.228.91.254 dns-nameservers 140.114.63.1

<b class="Bold">adam(My abnormal slave)</b> <b class="Bold">/etc/hosts</b> 127.0.0.1 localhost.localdomain localhost 10.228.91.116 adam

The following lines are desirable for IPv6 capable hosts

::1 ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters

10.228.91.114 hydra 10.228.91.115 zeus

<b class="Bold">/etc/network/interfaces</b> auto lo iface lo inet loopback

auto eth0 iface eth0 inet static address 10.228.91.116 gateway 10.228.91.254 netmask 255.255.255.0 dns-nameservers 140.114.63.1

<b class="Bold">So, that's all my configuration and my SSH is OK among the three. Would you help me find out the problem?</b>