Unable to reach server for XX consecutive pings.

User 542 | 7/29/2014, 7:47:21 PM

Using Graphlab 0.9 in Ubuntu 14.04:

When running a triple_apply function on a large graph (25 million vertices, 450 million edges) on a reasonable machine (12 cores, 32 Gb RAM). I end up seeing this in the graphlab logs:

"Unable to reach server for 25 consecutive pings."

over and over where 25 increases to about 200 or so and then the function dies in the python interpreter with Communication Failure 113.

File "/usr/local/lib/python2.7/dist-packages/graphlab/datastructures/sgraph.py", line 839, in tripleapply return SGraph(proxy=g.proxy.lambdatripleapply(tripleapplyfn, mutatedfields)) File "cygraph.pyx", line 181, in graphlab.cython.cygraph.UnityGraphProxy.lambdatripleapply File "cygraph.pyx", line 185, in graphlab.cython.cygraph.UnityGraphProxy.lambdatripleapply RuntimeError: Runtime Exception: 0. Communication Failure: 113.

Watching top: While it runs it seems that I the pylambda runners fill up my RAM and take about 50% of all of the cores. Then it seems to flush (I presume to reload from disk) and then pylambda workers fill back up and a few moments later I start seeing this messages.

Any ideas? Timeout error? Anything I can do to fix?

EDIT - In addition. Killing the python interpreter and the workers requires kill -9 signals.

EDIT EDIT - Nevermind the previous EDIT, In a second attempt I was able to Ctrl-D and wait a few minutes and python/graphlab terminated normally.

Comments

User 19 | 7/29/2014, 7:52:04 PM

It sounds like the GraphLab engine has crashed. Would it be possible for you to post a reproducible example, including sample data and the code for the triple_apply you are trying to perform?


User 542 | 7/29/2014, 8:05:38 PM

The triple_apply code works for the web-google.txt.gz example you have. -- works really great!

def sssp_fn (src, edge, dst): sdist = src['distance'] ddist = dst['distance'] newDist = sdist + 1 dst['distance'] = min(ddist, newDist) return (src, edge, dst)

The databset we're using the the pagelinks of wikipedia. Pretty standard scale-free directed graph.

BTW - I think your version of sssp is broken (or I don't know how to run it correctly) because it gives no where near the correct answers. Right now we're just playing with capabilities.


User 542 | 7/30/2014, 9:04:44 PM

Any ideas here?


User 14 | 7/30/2014, 9:27:47 PM

First of all, thanks for trying out GraphLab Create.

The builtin sssp has a bug (or feature) where it does not ignore vertex 'distance' if it already exists. For example, if the input graph already has a field 'distance' on vertex table, then it will use the value of 'distance' as a "warm start". Other than that, gl.shortestpath.create should behave identical to the tripleapply version, and much more faster. Here is a short code snippet to verify:

<pre><code> import graphlab as gl import time

def sssp_fn(src, edge, dst): sdist = src['distance'] ddist = dst['distance'] newDist = sdist + 1 newDist = min(ddist, newDist) if not dst['changed']: dst['changed'] = ddist != newDist dst['distance'] = newDist return (src, edge, dst)

def sssptripleapply(inputgraph, srcvid): g = gl.SGraph(inputgraph.vertices, inputgraph.edges) g.vertices['distance'] = g.vertices['__id'].apply(lambda x: 1e30 if x != srcvid else 0.0) it = 0 numchanged = len(g.vertices) start = time.time() while(numchanged > 0): g.vertices['changed'] = 0 g = g.tripleapply(ssspfn, ['distance', 'changed']) numchanged = g.vertices['changed'].sum() print 'Iteration %d: numvertices changed = %d' % (it, numchanged) ++it print 'Triple apply sssp finished in: %f secs' % (time.time() - start) return g

Load graph

g = gl.load_graph('/data/webgraphs/email-Enron.txt.gz', 'snap')

g = gl.load_graph('/data/webgraphs/com-orkut.ungraph.txt.gz', 'snap')

g = gl.load_graph('/data/webgraphs/web-Google.txt.gz', 'snap')

Run builtin sssp

m = gl.shortestpath.create(g, 0) builtinsssp_distance = m['graph'].vertices.sort('__id') print 'Builtin sssp finished in %f secs' % m['runtime']

Run triple apply sssp

tripleapplyssspdistance = sssptriple_apply(g, 0).vertices.sort('__id')

Compare the results

print 'Number of vertices that disagrees on the distance: %d' % (builtinssspdistance['distance'] != tripleapplysssp_distance['distance']).sum()

</code></pre>

The crash you experience was caused by worker running out of memory. The tripleapply lambda worker uses much more memory due to the communication needed between master and the workers. Specially, the memory footprint scales as |V| * NUMVERTEXFIELD / NUMGRAPH_PARTITIONS. We are working on bringing down the memory cost, and make it more fault-tolerant.

Meanwhile, if you graph have vertex metadata which are not used by tripleapply, then dropping those fields will reduce the memory cost significantly. In the above example, all we need is ['__id', 'distance', and 'changed']. Also you can try increase the graph partitions by using <pre><code> print gl.getruntimeconfig() gl.setruntimeconfig('GRAPHLABSGRAPHDEFAULTNUM_PARTITIONS', 16) </code></pre>

Best regards, -jay


User 542 | 7/31/2014, 1:31:01 PM

I ran this code... the only change was the location of web-Google.txt.gz

This is the output -

<pre><code> Incremental plotting is not supported when TKAgg is used as the backend [INFO] Start server at: ipc:///tmp/graphlabserver-52922 - Server binary: /usr/local/lib/python2.7/dist-packages/graphlab/unityserver - Server log: /tmp/graphlabserver1406813140.log [INFO] GraphLab Server Version: 0.9.0 PROGRESS: Read 3565117 lines. Lines per second: 387139 PROGRESS: Finished parsing file /data/web-Google.txt.gz PROGRESS: Parsing completed. Parsed 5105039 lines in 9.40858 secs. PROGRESS: Num vertices updated: 47 PROGRESS: Num vertices updated: 1927 PROGRESS: Num vertices updated: 54709 PROGRESS: Num vertices updated: 365938 PROGRESS: Num vertices updated: 463512 PROGRESS: Num vertices updated: 296923 PROGRESS: Num vertices updated: 123545 PROGRESS: Num vertices updated: 36684 PROGRESS: Num vertices updated: 11126 PROGRESS: Num vertices updated: 3120 PROGRESS: Num vertices updated: 895 PROGRESS: Num vertices updated: 258 PROGRESS: Num vertices updated: 83 PROGRESS: Num vertices updated: 37 PROGRESS: Num vertices updated: 12 PROGRESS: Num vertices updated: 5 PROGRESS: Num vertices updated: 2 PROGRESS: Num vertices updated: 4 PROGRESS: Num vertices updated: 1 PROGRESS: Num vertices updated: 0 Builtin sssp finished in 12.974241 secs PROGRESS: First Use of Python Lambda: Starting Lambda Workers. This might take a few seconds. Iteration 0: numvertices changed = 1019 Iteration 0: numvertices changed = 32524 Iteration 0: numvertices changed = 261386 Iteration 0: numvertices changed = 417771 Iteration 0: numvertices changed = 276898 Iteration 0: numvertices changed = 128416 Unable to reach server for 3 consecutive pings. Server is considered dead. Please exit and restart. Traceback (most recent call last): File "pings.py", line 42, in <module> tripleapplyssspdistance = sssptripleapply(g, 0).vertices.sort('__id') File "pings.py", line 21, in sssptripleapply g.vertices['changed'] = 0 File "/usr/local/lib/python2.7/dist-packages/graphlab/datastructures/gframe.py", line 208, in setitem super(GFrame, self).setitem(key, value) File "/usr/local/lib/python2.7/dist-packages/graphlab/datastructures/sframe.py", line 2042, in setitem self.addcolumn(savalue, tmpname) File "/usr/local/lib/python2.7/dist-packages/graphlab/datastructures/gframe.py", line 80, in addcolumn self.graph.proxy = graphproxy File "/usr/local/lib/python2.7/dist-packages/graphlab/cython/context.py", line 23, in exit raise exctype(excvalue) RuntimeError: Communication Failure: 113.

Unable to reach server for 4 consecutive pings. Server is considered dead. Please exit and restart. Unable to reach server for 5 consecutive pings. Server is considered dead. Please exit and restart. Unable to reach server for 6 consecutive pings. Server is considered dead. Please exit and restart.

[INFO] Stopping the server connection. Unable to reach server for 7 consecutive pings. Server is considered dead. Please exit and restart. [INFO] GraphLab server shutdown

</code></pre>

The machine I'm running on has 64 cores and 256 Gb of RAM available (mostly empty)


User 14 | 7/31/2014, 5:47:23 PM

Can this be consistently reproduced on your machine? I've tried the code on my desktop, laptop (osx) and ec2 machine, but I cannot reproduce this crash.

One suspicion is that the machine is running out of tmp file space. Can you please send me the log file tmp/graphlabserver1406813140.log? Also, can you please check the disk space for /var/tmp? Thanks.


User 542 | 8/1/2014, 3:04:12 AM

Thanks for your attention to this.

/var/tmp is no where near full (has 120 or so Gb left)

The log is attached.

Thanks!


User 14 | 8/1/2014, 5:33:29 PM

We have a hypothesis about this issue. The machine has 64 cores and as a result, many file handles are opened at the same time, and it exceeds the file handle limit. To help us verify, would you mind try the following process and run the same code again? Thanks.

In your bash before starting graphlab: $ulimit -n 2048

In python shell or your script:

gl.setruntimeconfig(‘GRAPHLABSFRAMEFILEHANDLEPOOL_SIZE’, 2048)


User 542 | 8/4/2014, 4:12:53 PM

That seems to have done the trick (on your code). We have our own code that is still having problems, but I think its our own programs problems.

Thanks.

TW


User 190 | 1/12/2015, 4:25:04 PM

did you guys ever sort this out?

I seem to be having similar issues (server communication errors when running triple_apply functions)