[PowerGraph] The program crushes when running on the cluster

User 2150 | 8/8/2015, 10:27:44 AM

Hi,

I have written a program in PowerGraph, which runs well on a single machine. However, the program crushes when running on a cluster.

The command I used is, mpiexec -n 2 --pernode --hostfile ~/machines ./GI

Following is the error message,

GRAPHLABSUBNETID/GRAPHLABSUBNETMASK environment variables not defined. Using default values Subnet ID: 0.0.0.0 Subnet Mask: 0.0.0.0 Will find first IPv4 non-loopback address matching the subnet INFO: dc.cpp(init:573): Cluster of 2 instances created. INFO: distributedgraph.hpp(setingressmethod:3214): Automatically determine ingress method: grid INFO: distributedgraph.hpp(loadfromposixfs:2199): Loading graph from file: ./graph.txt INFO: distributedingressbase.hpp(finalize:199): Finalizing Graph... [compute-0-2:28696] * Process received signal * [compute-0-2:28696] Signal: Segmentation fault (11) [compute-0-2:28696] Signal code: Address not mapped (1) [compute-0-2:28696] Failing at address: 0x2175780 [compute-0-2:28696] [ 0] /lib64/libpthread.so.0[0x33eac0f4a0] [compute-0-2:28696] [ 1] ./GI(ZN13PatternVertexISsEaSERKS0+0x14)[0x718f4a] [compute-0-2:28696] [ 2] ./GI(ZNSt11__copymoveILb0ELb0ESt26randomaccessiteratortagE8__copymIPK13PatternVertexISsEPS4EET0TS9S8+0x53)[0x71904e] [compute-0-2:28696] [ 3] ./GI(ZSt13copymoveaILb0EPK13PatternVertexISsEPS1ET1T0S6S5+0x2f)[0x702f54] [compute-0-2:28696] [ 4] ./GI(ZSt14copymovea2ILb0EN9gnu_cxx17normaliteratorIPK13PatternVertexISsESt6vectorIS3SaIS3EEEENS1IPS3S8EEET1T0SDSC+0x4f)[0x6eb16d] [compute-0-2:28696] [ 5] ./GI(ZSt4copyIN9__gnucxx17normaliteratorIPK13PatternVertexISsESt6vectorIS3SaIS3EEEENS1IPS3S8EEET0TSDSC+0x3f)[0x6d4d4a] [compute-0-2:28696] [ 6] ./GI(ZNSt6vectorI13PatternVertexISsESaIS1EEaSERKS3+0x26c)[0x6c3168] [compute-0-2:28696] [ 7] ./GI(ZN7PatternISsEaSERKS0+0x23)[0x6b1eff] [compute-0-2:28696] [ 8] ./GI(ZN15DataGraphVertexISsEaSERKS0+0x3e)[0x6b1f9c] [compute-0-2:28696] [ 9] ./GI(ZN8graphlab19dynamiclocalgraphI15DataGraphVertexISsENS5emptyEE10addvertexEmRKS2+0xc3)[0x6e1c7f] [compute-0-2:28696] [10] ./GI(ZN8graphlab24distributedingressbaseI15DataGraphVertexISsENS5emptyEE8finalizeEv+0x10fe)[0x6ce21e] [compute-0-2:28696] [11] ./GI(ZN8graphlab17distributedgraphI15DataGraphVertexISsENS5emptyEE8finalizeEv+0x18e)[0x6bc124] [compute-0-2:28696] [12] ./GI(ZN8graphlab18synchronousengineI12PatternMatchISsEEC2ERNS19distributedcontrolERNS17distributedgraphI15DataGraphVertexISsENS5emptyEEERKNS16graphlaboptionsE+0x10d6)[0x6ab8b0] [compute-0-2:28696] [13] ./GI(ZN8graphlab11omniengineI12PatternMatchISsEEC1ERNS19distributedcontrolERNS17distributedgraphI15DataGraphVertexISsENS5emptyEEERKSsRKNS16graphlaboptionsE+0x1fc)[0x6a2000] [compute-0-2:28696] [14] ./GI(main+0x16f)[0x6904e2] [compute-0-2:28696] [15] /lib64/libc.so.6(libcstartmain+0xfd)[0x33ea41ecdd] [compute-0-2:28696] [16] ./GI[0x68fed9] [compute-0-2:28696] * End of error message * mpiexec noticed that process rank 1 with PID 0 on node compute-0-2 exited on signal 11 (Segmentation fault).

I have searched on this topic, the problem is caused by MPI, and the reason might relate to Serialization. But I indeed implemented save() and load() for every class, and I even tried graphlab::ISPODTYPE without save() and load().

Another hint is that the member variable in the classes are the types, such as int, enum, string and vector. The enum or string might be not serializable, but will graphlab::ISPODTYPE solve this?

I really can not figure out what is wrong in my code. It would be appreciated if someone could give me some advise. Thanks in advance.

Comments

User 1592 | 8/9/2015, 6:12:11 AM

Hi You will need to debug your program :-)

ISPODTYPE can be used only for simple types like int, struct of ints which have fixed size. For dynamic types like vectors and strings you will need to implement your own load and save.

My suggestion is to compile and run in debug more, add traces and use your MPI debugging options (for example for openmpi; https://www.open-mpi.org/faq/?category=debugging) to see where the code fails.


User 2150 | 8/9/2015, 12:09:10 PM

Hi Danny,

Thanks for your reply. Do you think the bug is due to serialization?

In the documention, it mentions that "The following template containers are serializable as long as the contained types are all serializable. This can be recursively applied." So vector falls in these category?

As to string, I found the example "Page Rank" in the tutorial, which also use a string variable "pagename". I also tried the example, it runs well on the cluster. My question is whether string is serializable or not?

Is enum serializable?

If not, should I use "Out of Place Serialization"? Could you give me any example?

Thanks again.


User 1592 | 8/9/2015, 1:43:48 PM

Everything is serializable as long as you provide a load() and save() functions which saves it and loads it while allocating the required memory.