User 121 | 7/28/2014, 12:35:52 PM
Am running the GraphLab on DAS4 cluster . I already have consulted  for cluster deployment of GraphLab.
However, I cannot use mpirun command on the DAS4 cluster, because machine allocation cannot be hard-coded. Hence I had to use prun command which runs the parallel program( MPI , for example) on the compute-nodes of the cluster.
I run/ran the following command. prun -np 2 ~/graphlab/release/apps/concomp/concomp --input ~/graphlab/inputs/tsv1001000 --format tsv
It is supposed to run on 2 machines, the concomp (connected components program).
But, it actually creates cluster with only 1 instance, runs program there and then tries to run the program on another machines, where it faults because of some MPI_INIT errors.
I ensured that initparamfrommpi() was called when initializing the distributedcontrol object. However, allgather() function from the mpitools.hpp did not return the proper size of machines which is 2.
Any hint on how to run the GraphLab on this cluster successfully will be highly appreciated. thank you.
 http://www.cs.vu.nl/das4/jobs.shtml  http://graphlab.org/projects/tutorials.html#cluster
// Console output starts [uji300@fs0 graphlab]$ prun -np 2 ~/graphlab/release/apps/concomp/concomp --input ~/graphlab/inputs/tsv1001000 --format tsv GRAPHLABSUBNETID/GRAPHLABSUBNETMASK environment variables not defined. Using default values Subnet ID: 0.0.0.0 Subnet Mask: 0.0.0.0 Will find first IPv4 non-loopback address matching the subnet INFO: dcinitfrommpi.cpp(initparamfrommpi:51): Will Listen on: 10.149.0.52:59435 INFO: distributedingressbase.hpp(finalize:259): Graph Finalize: constructing local graph INFO: memoryinfo.cpp(logusage:90): Memory Info: Finished populating local graph. Heap: 12 MB Allocated: 10.7577 MB INFO: distributedingressbase.hpp(finalize:304): Graph Finalize: finalizing local graph. INFO: dynamiclocalgraph.hpp(finalize:339): Graph finalized in 0.000287 secs
Allocated: 11.639 MB
INFO: distributedingressbase.hpp(exchangeglobalinfo:521): Graph Finalize: exchange global statistics INFO: distributedingressbase.hpp(exchangeglobalinfo:546): Graph info: nverts: 64 nedges: 2000 nreplicas: 64 replication factor: 1 INFO: omniengine.hpp(omniengine:191): Using the Synchronous engine. INFO: distributedgraph.hpp(finalize:702): Distributed graph: enter finalize INFO: distributedingressbase.hpp(finalize:199): Finalizing Graph... INFO: distributedingressbase.hpp(finalize:244): Skipping Graph Finalization because no changes happened... INFO: memoryinfo.cpp(logusage:90): Memory Info: Before Engine Initialization Heap: 13 MB Allocated: 10.7763 MB INFO: memoryinfo.cpp(logusage:90): Memory Info: After Engine Initialization Heap: 13 MB Allocated: 10.7793 MB INFO: synchronousengine.hpp(start:1299): Iteration counter will only output every 5 seconds. INFO: synchronousengine.hpp(start:1314): 0: Starting iteration: 0 INFO: synchronousengine.hpp(start:1363): Active vertices: 64 INFO: synchronous_engine.hpp(start:1412): Running Aggregators Updates: 191
The Grid Engine ras component is not able to read the $PEHOSTFILE for the Grid Engine nodes. The $PEHOSTFILE environment variable shows the file is located at:
The following error is returned:
No such file or directory
[node054:01726] [[38006,0],0] ORTEERRORLOG: Error in file rasgridenginemodule.c at line 79
[node054:01726] * Process received signal *
[node054:01726] Signal: Segmentation fault (11)
[node054:01726] Signal code: Address not mapped (1)
[node054:01726] Failing at address: (nil)
[node054:01726] [ 0] /lib64/libpthread.so.0() [0x3e12a0f710]