Error when running parallel_for_each

User 912 | 2/24/2015, 9:34:36 PM

Hello,

I am trying to run parallelforeach in Hadoop, but I keep getting this error:

File "<stdin>", line 1, in <module> File "/home/vyara/graphlab/local/lib/python2.7/site-packages/graphlab/deploy/parallel.py", line 158, in parallelforeach tempdir = job.getresultspath(environment, name, 'temp') File "/home/vyara/graphlab/local/lib/python2.7/site-packages/graphlab/deploy/job.py", line 391, in getresultspath "getconf", "-confKey", "fs.defaultFS"]).rstrip() File "/usr/lib/python2.7/subprocess.py", line 573, in checkoutput raise CalledProcessError(retcode, cmd, output=output) subprocess.CalledProcessError: Command '['hdfs', '--config', '/home/vyara/etc/hadoop/conf/', 'getconf', '-confKey', 'fs.defaultFS']' returned non-zero exit status 1

Do you have an idea what might be causing it? (In case it's a common problem) Here's the call to the function:

paralleljob = graphlab.deploy.parallelforeach(paralleltask, params, name='Parallel Task Job1', environment=env) where env = graphlab.deploy.environment.Hadoop('h') env.setconfigdir('~/etc/hadoop/conf/')

Cheers.

Comments

User 912 | 2/26/2015, 9:12:33 AM

Any thoughts? :(


User 912 | 2/26/2015, 10:30:32 AM

Actually I solved this. Cheers anyway.


User 91 | 2/26/2015, 3:45:15 PM

Sorry for not responding. Could you elaborate on your solution?


User 912 | 2/27/2015, 5:38:31 PM

Well, because I am working on my dissertation project, I need to ssh through two servers to get the access to the school Hadoop cluster and then I have a home directory for me (my user). So I knew the configuration directory was in /tmp/Hadoop/etc.. and I was pointing to this directory when creating the Hadoop environment. However, Python translates this to /user/vyara/tmp/etc.. which is not the actual directory.

In the end I just symlinked to this directory, something like ln -s /etc/.. ~/hadoop-conf and I use this to point to the conf directory and it finally managed to send the job to the cluster.

I am having different issues now - an exception about an unknown container/ failing to launch container.. If you have any advise about that, I would be more than happy to hear it. :)


User 1178 | 3/10/2015, 3:18:03 AM

Hi vyara,

Can you give us more information regarding the error message you got regarding container?

Thanks! Ping


User 912 | 3/10/2015, 10:08:12 AM

Hi Ping,

Here's the exception:

<pre class="CodeBlock"><code>Initializing ApplicationMaster Application master for app, appId=10, clustertimestamp=1424439736004, attemptId=1 Starting ApplicationMaster Max mem capabililty of resources in this cluster 1164 Max vcores capabililty of resources in this cluster 4 Container memory specified above max threshold of cluster. Using max value., specified=4096, max=1164 numStages=1 starting a new stage, stagedirectory=0 currentstage=0 totalContainersInStage=1 Requested container ask: Capability[<memory:1164, vCores:2>]Priority[0] Got response from RM for container ask, allocatedCnt=1 Launching shell command on a new container., containerId=container1424439736004001001000002, containerNode=bigdata-03.dcs.gla.ac.uk:8041, containerNodeURI=bigdata-03.dcs.gla.ac.uk:8042, containerResourceMemory1164, containerResourceVirtualCores2 Setting up container launch container for containerid=container1424439736004001001000002 ctx.commands=/bin/bash glcreatebasevirtenv.sh -n hdfs -a -g 0 -t topasync.zip/steps/0/0 -s hadoop-exec-dir-PexrF4/hadoopwrap.py &><LOGDIR>/gllog Got response from RM for container ask, completedCnt=1 Got container status for containerID=container1424439736004001001000002, state=COMPLETE, exitStatus=127, diagnostics=Exception from container-launch. Container id: container1424439736004001001_000002 Exit code: 127 Stack trace: ExitCodeException exitCode=127: at org.apache.hadoop.util.Shell.runCommand(Shell.java:538) at org.apache.hadoop.util.Shell.run(Shell.java:455) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:197) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)

Container exited with a non-zero exit code 127

container failed, not gonna retry. incremending totalcompletedthisstage,failedthisstage,failedallstages totalCompletedThisStage=1 totalFailedThisStage=1 totalFailedAllStages=1 totalContainersInStage:1 - totalRequestedThisStage:1 waitingForStage to finish at stage=0 and there are 1 threads stage 0 should be finished incrementing stage Artifact application completed. Stopping running containers Application completed. Signalling finish to RM Application Master failed. exiting</code></pre>

My supervisor's assistant told me actually yesterday that this exception is appearing only on the log page, but apparently it has something to do with missing libraries when graphlab is trying to install its tar.gz on the cluster? When I'm creating the Hadoop environment I don't specify an argument gl_source, because I thought this way graphlab is going to download and install everything by itself.. I'm not sure if that could be the problem. :/

Cheers.


User 17 | 3/16/2015, 5:59:40 PM

Hi vyara,

There should be a log from that container's execution that may provide a bit more information as to what's going wrong in during execution. Can you try the yarn logs command to see if you can access this log?

Also, sometimes deleting the <your hdfs username>/GraphLabDeploys directory on hdfs maybe helpful, if a bad or incomplete copy of the glsource got uploaded.

Thanks!