graphlab server crashes and i get communication failure error

User 690 | 12/30/2014, 1:24:58 AM

Hi Everybody, graphlab server is crashing and I am getting a communication Failure. I would love any feedback on how to debug this code..

Thanks, Sunil.

code and input https://gist.github.com/cd933c7693cc3ce230a5

server log https://gist.github.com/bb6ff78b0010d57b8960

command line output

[INFO] Start server at: ipc:///tmp/graphlabserver-21851 - Server binary: /home/sunil/graphlab/lib/python2.7/site-packages/graphlab/unityserver - Server log: /tmp/graphlabserver1419901688.log [INFO] GraphLab Server Version: 1.2 PROGRESS: Finished parsing file /home/hdfs/sunil/ctr/sample.100.csv PROGRESS: Parsing completed. Parsed 99 lines in 0.037826 secs. allfeatures : ('CreativeId', 'UserCountry', 'EAId') modelcolumns : ('CreativeId', 'UserCountry', 'EAId', 'OS', 'PublisherId', 'AppId', 'ExchangeId', 'sinH', 'cosH', 'Click') Unable to reach server for 3 consecutive pings. Server is considered dead. Please exit and restart. Traceback (most recent call last): File "ctr.py", line 94, in <module> models = {tuple(catfeatures):createmodel(data,catfeatures,contfeatures,'Click') for catfeatures in categoricalfeaturepreference} File "ctr.py", line 94, in <dictcomp> models = {tuple(catfeatures):createmodel(data,catfeatures,contfeatures,'Click') for catfeatures in categoricalfeaturepreference} File "ctr.py", line 45, in createmodel model = gl.logisticclassifier.create(modeldata,target=target,features=features,solver='auto',maxiterations=12) File "/home/sunil/graphlab/lib/python2.7/site-packages/graphlab/toolkits/classifier/logisticclassifier.py", line 284, in create classweights = classweights) File "/home/sunil/graphlab/lib/python2.7/site-packages/graphlab/toolkits/supervisedlearning.py", line 335, in create ret = graphlab.toolkits.main.run("supervisedlearningtrain", options, verbose=verbose) File "/home/sunil/graphlab/lib/python2.7/site-packages/graphlab/toolkits/main.py", line 57, in run (success, message, params) = unity.runtoolkit(toolkitname, options) File "cyunity.pyx", line 70, in graphlab.cython.cyunity.UnityGlobalProxy.runtoolkit File "cyunity.pyx", line 74, in graphlab.cython.cyunity.UnityGlobalProxy.runtoolkit RuntimeError: Communication Failure: 113. [INFO] Stopping the server connection. Unable to reach server for 4 consecutive pings. Server is considered dead. Please exit and restart. [WARNING] <type 'exceptions.IOError'> [WARNING] <type 'exceptions.ValueError'>

Comments

User 690 | 12/30/2014, 3:00:24 AM

Can somebody suggest as to how I can attach a debugger to the graphlab process..? Thanks, sunil.


User 690 | 12/30/2014, 5:14:59 AM

The stack trace I got on attaching to the unity_server is here

<b class="Bold"><b class="Bold">Program received signal SIGFPE, Arithmetic exception. [Switching to Thread 0x7f5ac080f700 (LWP 14675)] 0x000000000166e3fd in graphlab::supervised::logistic_regression::train() () (gdb) bt

0 0x000000000166e3fd in graphlab::supervised::logistic_regression::train() ()

1 0x00000000016b482a in graphlab::supervised::train(graphlab::toolkitfunctioninvocation&) ()

2 0x00000000013aa16c in std::Functionhandler<graphlab::toolkitfunctionresponsetype ()(graphlab::toolkitfunctioninvocation&), graphlab::toolkitfunctionresponsetype (*)(graphlab::toolkitfunctioninvocation&)>::Minvoke(std::Anydata const&, graphlab::toolkitfunctioninvocation&) ()

3 0x00000000011e2963 in graphlab::unityglobal::runtoolkit ()

4 0x0000000000ab63e3 in ZN6cppipc13dispatchimplIN8graphlab17unityglobalbaseEMS2FNS130toolkitfunctionresponsetypeESsRSt3mapISsN5boost7variantINS56detail7variant14recursiveflagINS113flexibletypeEEEJSt10sharedptrINS117unitysgraphbaseEENS111dataframetESCINS110modelbaseEESCINS117unitysframebaseEESCINS117unitysarraybaseEES4ISsNS518recursivevariantESt4lessISsESaISt4pairIKSsSMEEESt6vectorISMSaISMEENS517recursivewrapperINS121functionclosureinfoEEEEEESOSaISPISQS10EEEEE7executeEPvPNS11commserverERNS18iarchiveERNS18oarchiveE ()

5 0x00000000010dab1e in cppipc::commserver::callback(libfault::zmqmsgvector&, libfault::zmqmsg_vector&) ()

6 0x00000000010e8f0a in libfault::asyncreplysocket::processjob(libfault::asyncreplysocket::threaddata*, libfault::zmqmsgvector*) ()

7 0x00000000010ea218 in libfault::asyncreplysocket::threadfunction(libfault::asyncreplysocket::threaddata*) ()

8 0x0000000001303bfa in thread_proxy ()

9 0x00000035102079d1 in start_thread () from /lib64/libpthread.so.0

10 0x000000350fee886d in clone () from /lib64/libc.so.6

(gdb) </b></b>


User 14 | 12/30/2014, 5:30:45 AM

There seems to be an arithmetic error triggering SIGFPE. We will look into this problem. Thank you for providing such useful information.


User 690 | 12/30/2014, 5:43:09 AM

What intrigues me is that the code that was working yesterday is also crashing ... with no change in code or data. Only thing that could have happened is the license expiring .. Is that a possibility? Can you please clarify? Thanks, Sunil.


User 14 | 12/30/2014, 5:51:26 AM

The license is not going to expire. What may change is the input data, random seed, or some particular combination of both which causes a divide by zero.


User 690 | 12/30/2014, 7:02:38 AM

Thanks for the info about the random-seed. I will try to set the random-seed and see if I can get around the problem. I do know that the input data is the same.


User 940 | 12/31/2014, 6:41:23 PM

Hi Sunil,

We are able to reproduce the issue and are currently working on debugging it. We will let you know as we find out what it is. In the meantime, could you use boostedtreesclassifier?

Thanks again for providing us with extremely useful information.

-Piotr


User 940 | 12/31/2014, 6:57:32 PM

Hi Sunil,

The problem has to do with the fact that your data sample only has one target class. This breaks some internal assumptions in our code. We will handle this error case in our next release.

For now, you can check if you have only one target class by using <a href="http:/http://graphlab.com/products/create/docs/generated/graphlab.SArray.unique.html#graphlab.SArray.unique">SArray.unique()</a>.

Does this solve your problem?

Cheers! -Piotr


User 690 | 1/2/2015, 2:04:56 AM

Hi Piotr, While the data sample I gave you has only one-target class, the data I was trying with has both classes. However the problem is that all the class-0 are grouped together and appear at the beginning of the dataset. While all the class-1 are grouped together and appear at the end of the dataset. And I encounter this floating point exception even there.. Would shuffling the data work around this problem? approximate size of the data is 91G with 220 Million rows. Thanks, Sunil.


User 940 | 1/2/2015, 6:15:46 AM

Hi Sunil,

My guess is that shuffling the data may help. However, I'm not certain. We will continue investigating and let you know the solution as soon as possible.

Thanks for your patience!

-Piotr


User 690 | 1/3/2015, 1:52:21 PM

Hi Piotr, Shuffling the data did the trick. However, it is an unnecessary expensive process. Would love it if the original bug gets fixed. Thanks, Sunil.


User 91 | 1/3/2015, 4:38:33 PM

Can you send us some sample data where the crash happens even when both classes are present?


User 2036 | 6/19/2015, 2:25:49 AM

Graph Lab seems to be incompatible with NLTK package. I lost 1 day's work simply to figure it out.


User 91 | 6/19/2015, 2:30:53 AM

I have used graphlab and nltk together in the past. Can you tell us a bit more about what the issue is so that we can look into it?