Graphlab IOError?

User 512 | 10/15/2015, 3:37:54 PM

I was parsing a pretty big text file. At the end Graphlab was stopped because of IOError and ValueError:

PROGRESS: Finished parsing file 20151010.gz PROGRESS: Parsing completed. Parsed 170757196 lines in 2459.32 secs. PROGRESS: Less than 48 successfully started. Using only 35 workers. PROGRESS: All operations will proceed as normal, but lambda operations will not be able to use all available cores. PROGRESS: To help us diagnose this issue, please send the log file to PROGRESS: (The location of the log file is printed at the start of the GraphLab server). [INFO] Stopping the server connection. [WARNING] <type 'exceptions.IOError'> [WARNING] <type 'exceptions.ValueError'>

Below is my Graphlab settings: gl.setruntimeconfig('GRAPHLABDEFAULTNUMPYLAMBDAWORKERS', 48) gl.setruntimeconfig('GRAPHLABCACHEFILELOCATIONS', '/share01:/share02:/share03:/share04:/share05') gl.setruntimeconfig('GRAPHLABFILEIOMAXIMUMCACHECAPACITY', 2147483648 * 35) # 70GB gl.setruntimeconfig('GRAPHLABFILEIOMAXIMUMCACHECAPACITYPER_FILE', 134217728 * 500) # ~50GB

How could I fix this error? Thanks!


User 512 | 10/15/2015, 8:53:59 PM

[WARNING] 'exceptions.IOError' [WARNING] 'exceptions.ValueError'

User 512 | 10/16/2015, 3:36:49 PM

It also shows "Communication Failure: 113"

User 512 | 10/16/2015, 3:52:47 PM

[INFO] Start server at: ipc:///tmp/graphlabserver-118613 - Server binary: /share01/home/anaconda/lib/python2.7/site-packages/graphlab/unityserver - Server log: /tmp/graphlabserver1445010581.log [INFO] GraphLab Server Version: 1.6.1 Traceback (most recent call last): File "", line 483, in <module> blazant=gl.loadsframe('/share02/blazant') File "/share01/home/anaconda/lib/python2.7/site-packages/graphlab/datastructures/", line 208, in loadsframe sf = SFrame(data=filename) File "/share01/home/anaconda/lib/python2.7/site-packages/graphlab/datastructures/", line 867, in init raise ValueError('Unknown input type: ' + format) File "/share01/home/anaconda/lib/python2.7/site-packages/graphlab/cython/", line 49, in exit raise exctype(excvalue) RuntimeError: Communication Failure: 113. [INFO] Stopping the server connection. [WARNING] type 'exceptions.IOError' [WARNING] type 'exceptions.ValueError'

User 512 | 10/16/2015, 3:53:22 PM

Could you help me with this ASAP? Now even loading a file would cause the IOError

User 15 | 10/16/2015, 5:19:20 PM

Hi Shuning,

It's unclear from your output when the error is actually happening, or what the circumstances are around it?

Are you parsing a file, or are you loading a saved SFrame? Is this part of a larger script that runs things with lambda workers, or is this immediately when the SFrame is loaded/parsed? Are you reporting more than one error here?

Once you explain the exact error case, could you send a server log for that run to


User 512 | 10/16/2015, 6:13:25 PM

The error was coming from a same script, but when it happened differs. You can see that sometimes it happened when a SFrame was loaded, sometimes it happened when a SFrame was saved, but the common thing is

  • The error relates to IOError/ValueError exceptions
  • The error relates to Communication Failure: 113

My script is posted below. It is pretty simple.

blazant=gl.loadsframe('/share02/blazant') # Error can happen here sfstream=gl.SFrame.readcsv(file,header=False,delimiter='',columntypehints=str) # Input file is large, but this runs fine sfstream=sfstream.unpack('X1',columnnameprefix="") # This runs fine sfstream=sfstream.join(blazant,on={'srcip':'IP'},how='left') # This runs fine'output') # Error can happen here

User 15 | 10/16/2015, 6:37:17 PM

IOError, ValueError, and Communication Failure are all pretty unhelpful. They basically just mean the server crashed.

I'll need the log from a run to debug anything.

User 512 | 10/17/2015, 4:21:11 AM

Sure, I have emailed the log to dato support

User 512 | 10/20/2015, 4:30:56 PM

The log is attached here.

User 512 | 10/20/2015, 4:31:26 PM

Could you please take a look at it? I haven't heard anything back from dato support. Thanks!

User 15 | 10/20/2015, 5:18:57 PM

This is pretty confusing to navigate. Let's try to focus on one of the error conditions first. It's very weird for loadsframe to be failing. Can you post the log from /tmp/graphlabserver1445010581.log, when the loadsframe call failed? Also, what is the filesystem on /share02 (or all the share0x's for that matter)? The only way I could see loadsframe causing a crash is if the filesystem gave out while we were reading, the SFrame was partly corrupted, or some other process was writing to that location while you were loading. Does this failure (just the loadsframe failure) happen often?

User 512 | 10/20/2015, 5:31:44 PM

Sure, the log is attached. The file system is nfs, and they are all network mounted drives. The load_sframe failure does not happen often though.

User 512 | 10/20/2015, 5:34:01 PM

The main issue is failure when saving the big sframe at the final step. It has about 170M rows and 130 columns, every time Graphlab will show IOError and quit.

User 15 | 10/20/2015, 6:56:59 PM

Thanks for the log.

Remember, since the SFrame is lazy, the issue is probably not in save. It's probably in any of the operations leading up to save, and save causes them all to materialize. Also, the code snippet you provided does not show any apply statements, but the log you first posted shows a lot of pylambda activity, so it seems like there has to be some apply statement somewhere. Is that true? That seems to be what's failing. If not, and save really is what's failing (you can call sf_stream.__materialize__() after each operation to force materialization and narrow the problem down), then I would think it would have something to do with the filesystem, as would the load_sframe problem.

User 512 | 10/20/2015, 8:52:49 PM

Thanks! I will try that.