graphlab update fails due to inability to resolve host names in open stack cluster

User 295 | 5/7/2014, 11:41:42 PM

Dear Graphlab,

 I am trying to use graphlab with an openstack cluster.  I managed to get the gl-ec2 script to "launch launchtest" with one slave per the tutorial.

As instructed I tried "gl-ec2 update" which compiles for a while and then fails with

"ssh could not resolve hostname <valid-openstack-instance-identifier>".   

This is part of a call to

mpiexec.openmpi -hostfile <localpath> -nolocal -pernode rsync -e 'ssh -o StrictHostKeyChecking=no -i <pathtokey> -avz --exclude ...

Due to the dynamic nature of our cloud, we don't have DNS entries for the created VMs. Am I right in thinking that this is the problem here? Is there a way to tell the script to use IP addresses instead?




User 6 | 5/8/2014, 5:22:04 AM

If this is indeed the call: mpiexec.openmpi -hostfile -nolocal -pernode rsync -e 'ssh -o StrictHostKeyChecking=no -i -avz --exclude ...

it is wrong, since you have to specificy hostfile name, the correct command should be mpiexec.openmpi -hostfile /path/to/ -nolocal -pernode rsync -e 'ssh -o StrictHostKeyChecking=no -i -avz --exclude ...

where should include the names of the machines, each machine on a new line.

User 295 | 5/9/2014, 11:47:27 PM

Dear Danny,

Thanks for your super quick response.  Sorry about the post, I tried to anonymize files and used angle brackets which were probably eaten by the forum software.  The host file was /home/ubuntu/machines as expected.  I have made some progress: on running graphlab update, it adds the server to the list of known hosts and then we get a permission denied (publickey) message. Looking at the console, and inspecting the script it looks like it is the mpirsync call at the end that is failing.  I assume this copies the build to all the slaves? So the master can't log into the slaves.   I'm guessing the original graphlab AMI image had some sort of keys prearranged to communicate between instances which my custom built graphlab image for our local stack does not. I am pursuing this now.



User 6 | 5/10/2014, 3:36:54 AM

Hi Bob, Basically we assume all to all communication, namely the GraphLab master can connect to the slaves via ssh and vice versa. This is a communication issue that has to be fixed before GraphLab can run. There are many examples on the web on how to do it. Here is one example: