What's the result of Dato Distributed' example?

User 3531 | 3/15/2016, 3:29:13 PM

I run the example in Readme of Dato Distributed. But it stuck.

What's wrong with it? And here is the log.

Thanks!

Comments

User 15 | 3/15/2016, 9:08:05 PM

Hi @qyy0180

Unfortunately the server log doesn't give us any useful information for the errors of the actual job. Those live on the machines you deployed the job. Are you able to get the application ID and use that to get the logs? It should be "yarn logs -applicationID <your application id>" if you are on Hadoop.

Evan


User 3531 | 3/16/2016, 2:03:22 AM

Hi @EvanSamanas ,

Is this the right log? Thanks!


User 15 | 3/16/2016, 8:18:05 PM

Hi @qyy0180

This doesn't look like an application-specific log...rather the log of a resource manager that is attempting to deploy applications. I don't see any useful error messages in here. How did you get this log?


User 3531 | 3/17/2016, 1:08:07 AM

Hi @EvanSamanas,

The environment is Hadoop2.6.4, and I find the log in hadoop-2.6.4/logs/

After I interrupted j.get_results(), the following message came out.

KeyboardInterrupt Traceback (most recent call last) <ipython-input-14-bbef447d63a4> in <module>() ----> 1 j.get_results()

/mirror/anaconda2/envs/dato-env/lib/python2.7/site-packages/graphlab/deploy/job.pyc in getresults(self) 571 if ismapjob: 572 LOGGER.info("To retrieve partial results from the map job while it is running, please use getmapresults()") --> 573 self.waitforjobfinish() 574 575 status = self.getstatus(silent=True)

/mirror/anaconda2/envs/dato-env/lib/python2.7/site-packages/graphlab/deploy/job.pyc in waitforjobfinish(self) 699 ''' 700 while not self.jobfinished(): --> 701 time.sleep(1) 702 703 def getstarttime(self):

KeyboardInterrupt:

I think the problem comes from the 'sleep'. Why dose it sleep? Thanks!


User 3531 | 3/17/2016, 7:01:56 AM

Hi @EvanSamanas,

The problem seems complicated. Sometimes it shows the following error.

RuntimeError Traceback (most recent call last) <ipython-input-10-bbef447d63a4> in <module>() ----> 1 j.get_results()

/mirror/anaconda2/lib/python2.7/site-packages/graphlab/deploy/job.pyc in getresults(self) 584 # retrieved, so we don't suggest any further actions. 585 else: --> 586 raise RuntimeError("The job execution failed. Cannot retrieve results.") 587 588 # status should be "Completed" at this point

RuntimeError: The job execution failed. Cannot retrieve results.

BTW, could you tell me where the application-specific log usually be?

Thanks!


User 17 | 3/17/2016, 5:52:16 PM

Hey @qyy0180

You can find application specific logs out in your hadoop cluster. We're looking for the Application Master log first, which will tell us on what nodes your job is attempting to execute on. If you have access to your ResourceManager's UI you can click the application id and view this log. If you have aggregation enabled, you can view container logs by issuing the command

yarn logs -applicationId <your application id>

If not, you need to view logs from wherever your logging directory is configured to, in my system that happens to be /var/log/hadoop-yarn/container/<application id>/ ... etc. And the file we're looking for is called gl_AppMaster.stdout.

This file will tell you where the containers for Dato Distributed attempted to run. Finding the logs from those containers will give us more insight as to why your job isn't running.

You can also use yarn application -status <your application id>

or

job.get_status()

To get a report of the status of your application according to the YARN Resource Manager. If the application never enters the "running" state, you may have a problem with resource allocation.

-Romero.


User 3531 | 3/18/2016, 7:04:34 AM

Hey @romero,

Thank you for your patience! Here the logs.


User 3531 | 3/18/2016, 12:18:36 PM

Hey @romero,

I have solve the issue above. It's because of bad allocation of memory. But now I have a another one. Here is the logs. Thanks!