Errors while running GraphLab on the cluster (DAS4)

User 121 | 7/28/2014, 12:35:52 PM

Hi all,

Am running the GraphLab on DAS4 cluster [1]. I already have consulted [2] for cluster deployment of GraphLab.

However, I cannot use mpirun command on the DAS4 cluster, because machine allocation cannot be hard-coded. Hence I had to use prun command which runs the parallel program( MPI , for example) on the compute-nodes of the cluster.

I run/ran the following command. prun -np 2 ~/graphlab/release/apps/concomp/concomp --input ~/graphlab/inputs/tsv1001000 --format tsv

It is supposed to run on 2 machines, the concomp (connected components program).

But, it actually creates cluster with only 1 instance, runs program there and then tries to run the program on another machines, where it faults because of some MPI_INIT errors.

I ensured that initparamfrommpi() was called when initializing the distributedcontrol object. However, allgather() function from the mpitools.hpp did not return the proper size of machines which is 2.

Any hint on how to run the GraphLab on this cluster successfully will be highly appreciated. thank you.

[1] http://www.cs.vu.nl/das4/jobs.shtml [2] http://graphlab.org/projects/tutorials.html#cluster

// Console output starts [uji300@fs0 graphlab]$ prun -np 2 ~/graphlab/release/apps/concomp/concomp --input ~/graphlab/inputs/tsv1001000 --format tsv GRAPHLABSUBNETID/GRAPHLABSUBNETMASK environment variables not defined. Using default values Subnet ID: 0.0.0.0 Subnet Mask: 0.0.0.0 Will find first IPv4 non-loopback address matching the subnet INFO: dcinitfrommpi.cpp(initparamfrommpi:51): Will Listen on: 10.149.0.52:59435 INFO: distributedingressbase.hpp(finalize:259): Graph Finalize: constructing local graph INFO: memoryinfo.cpp(logusage:90): Memory Info: Finished populating local graph. Heap: 12 MB Allocated: 10.7577 MB INFO: distributedingressbase.hpp(finalize:304): Graph Finalize: finalizing local graph. INFO: dynamiclocalgraph.hpp(finalize:339): Graph finalized in 0.000287 secs

 Allocated: 11.639 MB

vertices: 64 #edges:2000

INFO: distributedingressbase.hpp(exchangeglobalinfo:521): Graph Finalize: exchange global statistics INFO: distributedingressbase.hpp(exchangeglobalinfo:546): Graph info: nverts: 64 nedges: 2000 nreplicas: 64 replication factor: 1 INFO: omniengine.hpp(omniengine:191): Using the Synchronous engine. INFO: distributedgraph.hpp(finalize:702): Distributed graph: enter finalize INFO: distributedingressbase.hpp(finalize:199): Finalizing Graph... INFO: distributedingressbase.hpp(finalize:244): Skipping Graph Finalization because no changes happened... INFO: memoryinfo.cpp(logusage:90): Memory Info: Before Engine Initialization Heap: 13 MB Allocated: 10.7763 MB INFO: memoryinfo.cpp(logusage:90): Memory Info: After Engine Initialization Heap: 13 MB Allocated: 10.7793 MB INFO: synchronousengine.hpp(start:1299): Iteration counter will only output every 5 seconds. INFO: synchronousengine.hpp(start:1314): 0: Starting iteration: 0 INFO: synchronousengine.hpp(start:1363): Active vertices: 64 INFO: synchronous_engine.hpp(start:1412): Running Aggregators Updates: 191


The Grid Engine ras component is not able to read the $PEHOSTFILE for the Grid Engine nodes. The $PEHOSTFILE environment variable shows the file is located at:

/cm/local/apps/sge/current/spool/node052/activejobs/4557852.1/pehostfile

The following error is returned:

No such file or directory


[node054:01726] [[38006,0],0] ORTEERRORLOG: Error in file rasgridenginemodule.c at line 79 [node054:01726] * Process received signal * [node054:01726] Signal: Segmentation fault (11) [node054:01726] Signal code: Address not mapped (1) [node054:01726] Failing at address: (nil) [node054:01726] [ 0] /lib64/libpthread.so.0() [0x3e12a0f710] [node054:01Html�I�M! ��7# ++����FYI: If you are using Anaconda and having problems with NumPyHello everyone,

I ran into an issue a few days ago and found out something that may be affecting many GraphLab users who use it with Anaconda on Windows. NumPy was unable to load, and consequently everything that requires it (Matplotlib etc).

It turns out that the current NumPy build (1.10.4) for Windows is problematic (more info here).

Possible workarounds are downgrading to build 1.10.1 or forcing an upgrade to 1.11.0 if your dependencies allow. Downgrading was easy for me using conda install numpy=1.10.1

Thanks for your attention!

RafaelMarkdown558,824,8414L���4L���179.110.206.156179.110.206.1564P�}��Xj�8\j�1str�"��\j�Xj��\j�8bj�րi�1(׀i��g��b�j����Xj�\j�Xj�8\j�1.hpp(decrementdistributedcounter:787): Distributed Aggregation of likelihood. 0 remaining. INFO: distributedaggregator.hpp(decrementdistributedcounter:793): Aggregate completion of likelihood Likelihood: -3.22336e+08 INFO: distributedaggregator.3HLABDISABLELAMBDA_SHM"] = "1" os.environ["GRAPHLABFORCEIPCTOTCP_FALLBACK"] = "1" import graphlab as gl

3. Test out your lambda worker code in this environment. If it works, then you can make the above configuration permanent by running:

gl.sys_util.write_config_file_value("GRAPHLAB_DISABLE_LAMBDA_SHM", "1")
gl.sys_util.write_config_file_value("GRAPHLAB_FORCE_IPC_TO_TCP_FALLBACK", "1")

Note that this can be undone by setting these to "0" instead of "1", or by editing the file given by gl.sys_util.get_config_file().

4. If the lambda workers do not work after trying step 1, then there are two things we would very much appreciate you do to help us track down the issue.

4.1. First, execute the following code in a clean python shell, where you have not yet imported graphlab create. At the end of this code, it prints out the path to a zip file that, if you could send it to us, will help us diagnose the issue. Please create a support tick^j�8bj�2�"��bj�^j��bj�8�j�

Comments

User 121 | 7/28/2014, 3:15:25 PM

The problem is resolved. I had to use -sge-script option which is Sun Grid Engine's script for running the jobs. Now it creates cluster with proper number of instances.