Trouble with Dato-Distributed

User 3531 | 3/22/2016, 3:33:49 PM

I run the example in Dato Distributed. But it's stuck. I have to interrupt it.

KeyboardInterrupt Traceback (most recent call last) <ipython-input-7-c5b626097df0> in <module>() ----> 1 print j.get_results()

/mirror/anaconda2/lib/python2.7/site-packages/graphlab/deploy/job.pyc in getresults(self) 571 if ismapjob: 572 LOGGER.info("To retrieve partial results from the map job while it is running, please use getmapresults()") --> 573 self.waitforjobfinish() 574 575 status = self.getstatus(silent=True)

/mirror/anaconda2/lib/python2.7/site-packages/graphlab/deploy/job.pyc in waitforjobfinish(self) 698 Wait for the job to reach final state 699 ''' --> 700 while not self.jobfinished(): 701 time.sleep(1) 702

/mirror/anaconda2/lib/python2.7/site-packages/graphlab/deploy/job.pyc in jobfinished(self) 692 Returns whether or not the job has finished 693 ''' --> 694 return self.isfinalstate(self.getstatus(silent = True)) 695 696 def waitforjobfinish(self):

/mirror/anaconda2/lib/python2.7/site-packages/graphlab/deploy/job.pyc in getstatus(self, silent) 403 return self.status 404 --> 405 self.status = self.getstatus(silent = True) 406 if self.isfinalstate(self.status): 407 self._finalize()

/mirror/anaconda2/lib/python2.7/site-packages/graphlab/deploy/job.pyc in getstatus(self, silent) 1205 # Get our job status file first 1206 statusfile = self.execdir + '/status' -> 1207 status = self.loadfileandparse(statusfile, self.parsestatusfile, silent = silent, test_url=False) 1208 1209 # status file should never be none now

/mirror/anaconda2/lib/python2.7/site-packages/graphlab/deploy/job.pyc in loadfileandparse(self, filename, parserfunc, silent, testurl) 345 hdfspath = filename, 346 localpath = localfilename, --> 347 hadoopconfdir=self.environment.hadoopconfdir) 348 349 elif fileutil.iss3path(filename):

/mirror/anaconda2/lib/python2.7/site-packages/graphlab/util/fileutil.pyc in downloadfromhdfs(hdfspath, localpath, hadoopconfdir, isdir) 498 else: 499 basecommand = 'hadoop fs -get \"%s\" \"%s\" ' % (hdfspath, localpath) --> 500 exitcode, stdo, stde = hdfsexeccommand(basecommand) 501 502 if exit_code != 0:

/mirror/anaconda2/lib/python2.7/site-packages/graphlab/util/fileutil.pyc in hdfsexeccommand(command, silent) 789 pobj = subprocess.Popen(command, stdout=PIPE, stderr=PIPE, shell=True) 790 --> 791 stdo, stde = pobj.communicate() 792 exit_code = pobj.returncode 793

/mirror/anaconda2/lib/python2.7/subprocess.pyc in communicate(self, input) 797 return (stdout, stderr) 798 --> 799 return self._communicate(input) 800 801

/mirror/anaconda2/lib/python2.7/subprocess.pyc in communicate(self, input) 1407 1408 if haspoll: -> 1409 stdout, stderr = self.communicatewithpoll(input) 1410 else: 1411 stdout, stderr = self.communicatewith_select(input)

/mirror/anaconda2/lib/python2.7/subprocess.pyc in communicatewith_poll(self, input) 1461 while fd2file: 1462 try: -> 1463 ready = poller.poll() 1464 except select.error, e: 1465 if e.args[0] == errno.EINTR:

KeyboardInterrupt:

What's wrong with it? Here is the log and stdout. Thanks!

Comments

User 17 | 3/23/2016, 5:22:45 PM

Hi @qyy0180,

Reading the application log, it looks like there's a problem with the containers launch out on your cluster. We'll need to view these logs to see what the problem is. In the example you sent, the container in question is container1458540669273000301000003 which ran on graphlabslave1. You can look for logs under /var/log/hadoop-yarn .... or they may be elsewhere, depending on how you have them configured. Worst case, search your server for the container id.

You can also try using yarn logs -applicationId <your app id> to get logs if you have aggregation enabled.

Post the logs up here when you find them! They should point us to the problem.


User 3531 | 3/28/2016, 11:48:26 AM

Hi @romero,

I'm sorry that I can't find it in my computers. So, I try it again. Here is a new issue.

RuntimeError Traceback (most recent call last) <ipython-input-6-c5b626097df0> in <module>() ----> 1 print j.get_results()

/mirror/anaconda2/lib/python2.7/site-packages/graphlab/deploy/job.pyc in getresults(self) 584 # retrieved, so we don't suggest any further actions. 585 else: --> 586 raise RuntimeError("The job execution failed. Cannot retrieve results.") 587 588 # status should be "Completed" at this point

RuntimeError: The job execution failed. Cannot retrieve results.

And here is the logs. Thanks!


User 1178 | 3/31/2016, 9:45:43 PM

Hi,

What is the output you get when you run the following commands?

print j.get_status()
print j.get_metrics()
print j.get_error()

Usually if your job failed, the geterror() will give you information regarding why it fails. If the job failed because the user code (exception is thrown inside user code), the getmetrics() will give you more information regarding why it fails. If you cannot get any information out of these two API, then it is usually something wrong with the environment setup, this is where the Yarn logs would help here.

From looking at your logs, it seems like it cannot find hadoop from one of the worker nodes. Can Yarn job be executed in all the nodes?

Thanks!