srun: error:<nodeName>: slurm_send_recv_rc_msg_only_one to <nodeName>:<nodeName>: Connection refused

User 316 | 7/8/2014, 10:43:41 PM

Hi All - Just wondering if anyone ever tried GraphLab nmf algorithm with the SLURM queuing system. I'm getting connection refused messages. Below is the job submission command:

srun -N 16 graphlab-2.1-release-openmpi/release/toolkits/collaborativefiltering/nmf --matrix smallnetflix --maxiter=5 --D=160 --ncpus=16 --predictions=output/output

See the attached log for errors.

Comments

User 6 | 7/9/2014, 5:49:05 AM

It seems like a problem in your MPI setup. Can you run a simple command like "ls -la" instead of the nmf command?


User 316 | 7/9/2014, 3:12:32 PM

Yes I can. It's a pretty big cluster and I'm the only one with the issue. The application runs fine in another cluster with Torque. That's why I suspect there's something wrong with the SLURM setup. So, question: Does CMU ever tried GraphLab with SLURM?


User 6 | 7/9/2014, 5:05:26 PM

We never tried it with SLURM and are not aware of anyone who ever did.


User 316 | 7/9/2014, 5:07:56 PM

Ok. I'll see what I can get out of the LLNL about SLURM and the above mentioned error. Get back to you as I learn more.


User 316 | 7/9/2014, 5:28:27 PM

Btw - I ran the rpc_example1 with different number of nodes and apparently there's no error up to 6 nodes. Anything above that throws the same error. Suspecting I don't have enough sockets available on the nodes.


User 511 | 7/23/2014, 9:13:58 PM

I ran into the same errors trying to launch a GraphLab job using srun on our cluster with Slum 2.3.3 and OpenMPI 1.8. For me the errors were intermittent---even a two-node job would sometimes fail in this way.

I was able to work around it by using salloc to create a Slurm allocation for the job, and then using mpirun to launch the GraphLab job inside that allocation. For the example you gave, try something like: <pre>salloc -N 16 -- mpirun graphlab-2.1-release-openmpi/release/toolkits/collaborativefiltering/nmf --matrix smallnetflix --maxiter=5 --D=160 --ncpus=16 --predictions=output/output</pre>

(You may need to substitute your MPI package's equivalent command for mpirun.)