[RuntimeError: Communication Failure: 113] when trying to sort SFrame

User 1319 | 3/29/2015, 2:52:34 AM

Hi,

I'm getting this [RuntimeError: Communication Failure: 113] when trying to sort SFrame, which contains ~450,000 image. I run this sorting part several times on my laptop and EC2 GPU instance with the same result.

Any suggestions on how to resolve this issue?

Note: After this error happens, the execution of any GraphLab code will produce the same error. I have to restart the IPython notebook in order to be able to execute GraphLab code again.

<b class="Bold">Code:</b> train["shuffle"] = random.sample(xrange(train.numrows()), train.numrows()) train = train.sort("shuffle")

<b class="Bold">The error traceback:</b> RuntimeError Traceback (most recent call last) <ipython-input-8-dfeff17fdc80> in <module>() 1 train["shuffle"] = random.sample(xrange(train.numrows()), train.numrows()) ----> 2 train = train.sort("shuffle")

/home/ubuntu/anaconda/lib/python2.7/site-packages/graphlab/datastructures/sframe.pyc in sort(self, sortcolumns, ascending) 4986 4987 with cythoncontext(): -> 4988 return SFrame(proxy=self.proxy.sort(sortcolumnnames, sortcolumnorders)) 4989 4990 def dropna(self, columns=None, how='any'):

/home/ubuntu/anaconda/lib/python2.7/site-packages/graphlab/cython/context.pyc in exit(self, exctype, excvalue, traceback) 37 def exit(self, exctype, excvalue, traceback): 38 if not self.showcythontrace and exctype: ---> 39 raise exctype(exc_value)

RuntimeError: Communication Failure: 113.

Thanks,

Tarek

Comments

User 1207 | 3/30/2015, 10:06:14 PM

Hey Tarek,

Thanks for reporting this issue. We're looking into this here -- we've been able to reproduce it, and we're working on a fix.

In the mean time, the following code does a shuffle with other operations that shouldn't get this problem. It's ugly and a lot slower, but it should produce okay results:

<pre class="CodeBlock"><code>def shuffle(Z, n): for i in xrange(5): n += n / (7 - i) nblocks = (Z.numrows() / n) + 1 blockorder = range(nblocks) random.shuffle(block_order) Y = Z[nblock_order[0]:n(blockorder[0] + 1)] for bi in blockorder[1:]: Y = Y.append(Z[nbi:n(bi + 1)]) Z = Y

    k = 3 + i
    Y = Z[::k]
    for i in range(k):
        Y = Y.append(Z[i::k])

    Z = Y

return Z

X_train = shuffle(X, 1000) </code></pre>


User 1319 | 3/30/2015, 11:16:16 PM

Hi @hoytak,

Thank you for your quick reply and the work around code.

Looking at my laptop's System Monitor (Ubuntu 14), SFrame.sort eats all the main and Swap memory (4GB each).

Does this error happen because the sorting is done in memory, and eventually, it will crash if the SFrame is relatively big?

Thanks,

Tarek


User 1207 | 3/31/2015, 6:12:17 PM

Hi Tarek,

4GB is a little small; we generally target our tuning parameters for around 8GB of memory. However, these constants are easily adjustable -- see http://forum.dato.com/discussion/835/detailed-settings-for-graphlab-set-runtime-config. The one you would probably be most interested in is GRAPHLABSFRAMESORTBUFFERNUMCELLS (you should decrease this value) and GRAPHLABFILEIOMAXIMUMCACHECAPACITY (also decrease this value). You can get the current values using <code class="CodeInline">gl.getruntimeconfig()</code> and set them with <code class="CodeInline">gl.setruntime_config</code>.

The error tends to happen because one process monitoring the compute process incorrectly thinks it's crashed; however, because of the swapping and disk io, it has simply stalled for a number of seconds waiting for the OS. We're working on a more robust way to monitor the compute process.

Thanks! -- Hoyt


User 1319 | 3/31/2015, 7:28:53 PM

Hi Hoyt,

Thank you for your reply regarding the memory.

Regarding the code you provided, I believe it has a bug. It returns SFrame with more rows than the original SFrame.

For example, calling the method on SFrame (train) with 20 rows [ train = shuffle(train, 4) ] will return SFrame with 55 rows!. Am I missing something!

Cheers,

Tarek


User 1207 | 4/1/2015, 5:46:15 PM

Yes, you are correct -- the range(k) near the end should be range(1, k). Below is the more correct code:

<pre class="CodeBlock"><code>def shuffle(Z, n): for i in xrange(5): n += n / (7 - i) nblocks = (Z.numrows() / n) + 1 blockorder = range(nblocks) random.shuffle(block_order) Y = Z[nblock_order[0]:n(blockorder[0] + 1)] for bi in blockorder[1:]: Y = Y.append(Z[nbi:n(bi + 1)]) Z = Y

    k = 3 + i
    Y = Z[::k]
    for i in range(1, k):
        Y = Y.append(Z[i::k])

    Z = Y

return Z

X = shuffle(X, 1000)</code></pre>


User 1319 | 4/1/2015, 7:34:39 PM

Thanks @hoytak. I highly appreciate your time. Tarek