How to run job on hadoop?

User 2060 | 6/19/2015, 8:26:06 PM

I tried to follow the tutorial about the "EC2 and hadoop" and I play with the hadoop cluster in my university.

Below is my code:

`python In [1]: import graphlab as gl In [3]: def add(x, y): ...: return x + y

In [17]: hd = gl.deploy.environment.Hadoop('hd3', glsource='hdfs://eta1.larc.smu.edu.sg:8020/user/dujuan/luning/graphlab',configdir="/etc/hadoop/conf.dist/")

In [18]: job = gl.deploy.job.create(add, x = 1, y = 1, environment = hd)

[INFO] Validating job. [INFO] Validation complete. Job: 'add-Jun-20-2015-04-11-06' ready for execution. [INFO] attempting to find hadoop core-site.xml at /etc/hadoop/conf.dist [INFO] configs = [{'namenode': 'eta1.larc.smu.edu.sg', 'port': 8020}] [INFO] job working directory: hdfs://eta1.larc.smu.edu.sg:8020/user/dujuan/datodistributed/jobs/add-Jun-20-2015-04-11-06 [INFO] Submitting job to Hadoop cluster using command= hadoop --config /etc/hadoop/conf.dist jar /home/dujuan/luning/anaconda/envs/graphlab/lib/python2.7/site-packages/graphlab/deploy/graphlabHadoopYarn-1.0.jar -workerportend 9200 -glcversion 1.4.1 -appname datodistributed -workerportstart 9100 -glsource hdfs://eta1.larc.smu.edu.sg:8020/user/dujuan/luning/graphlab -jar /home/dujuan/luning/anaconda/envs/graphlab/lib/python2.7/site-packages/graphlab/deploy/graphlabHadoopYarn-1.0.jar -json -productkey /var/tmp/graphlab-dujuan/16093/hadoopjobee45fac0-49de-4a5f-bf72-116b415493da/productkey -jobworkingdir hdfs://eta1.larc.smu.edu.sg:8020/user/dujuan/datodistributed/jobs/add-Jun-20-2015-04-11-06 -commanderportend 9100 -commanderportstart 9000 -containermemory 4096 -resourcedir /home/dujuan/luning/anaconda/envs/graphlab/lib/python2.7/site-packages/graphlab/deploy/datodistributed/pipeline -numcontainers 3 -containervcores 2


RuntimeError Traceback (most recent call last) <ipython-input-19-e88626927877> in <module>() ----> 1 job = gl.deploy.job.create(add, x = 1, y = 1, environment = hd)

/home/dujuan/luning/anaconda/envs/graphlab/lib/python2.7/site-packages/graphlab/deploy/job.pyc in create(function, name, environment, **kwargs) 258 LOGGER.info("Validation complete. Job: '%s' ready for execution." % name) 259 execenv = env.getexecutionenv(environment) --> 260 job = execenv.run_job(job) 261 262 # Save the job and return to user

/home/dujuan/luning/anaconda/envs/graphlab/lib/python2.7/site-packages/graphlab/deploy/executionenvironment.pyc in runjob(job, session) 426 427 if 'appid' not in jobinfo or not job_info['app_id']: --> 428 raise RuntimeError("Error submitting application or determining application id. Please confirm that you" 429 " can correctly access the Hadoop environment specified (try running the above" 430 " command from a terminal to see more diagnostic output).")

RuntimeError: Error submitting application or determining application id. Please confirm that you can correctly access the Hadoop environment specified (try running the above command from a terminal to see more diagnostic output).

`

And I also tried the output command:

hadoop --config /etc/hadoop/conf.dist jar /home/dujuan/luning/anaconda/envs/graphlab/lib/python2.7/site-packages/graphlab/deploy/graphlabHadoopYarn-1.0.jar -worker_port_end 9200 -glc_version 1.4.1 -appname dato_distributed -worker_port_start 9100 -gl_source hdfs://eta1.larc.smu.edu.sg:8020/user/dujuan/luning/graphlab -jar /home/dujuan/luning/anaconda/envs/graphlab/lib/python2.7/site-packages/graphlab/deploy/graphlabHadoopYarn-1.0.jar -json -product_key /var/tmp/graphlab-dujuan/16093/hadoop_job_39781ac1-018a-4ef7-81f1-b65f73572540/product_key -job_working_dir hdfs://eta1.larc.smu.edu.sg:8020/user/dujuan/dato_distributed/jobs/add-Jun-20-2015-04-04-51 -commander_port_end 9100 -commander_port_start 9000 -container_memory 4096 -resource_dir /home/dujuMarkdown�I�M! ��7# ++����FYI: If you are using Anaconda and having problems with NumPyHello everyone,

I ran into an issue a few days ago and found out something that may be affecting many GraphLab users who use it with Anaconda on Windows. NumPy was unable to load, and consequently everything that requires it (Matplotlib etc).

It turns out that the current NumPy build (1.10.4) for Windows is problematic (more info here).

Possible workarounds are downgrading to build 1.10.1 or forcing an upgrade to 1.11.0 if your dependencies allow. Downgrading was easy for me using conda install numpy=1.10.1

Thanks for your attention!

RafaelMarkdown558,824,8414L���4L���179.110.206.156179.110.206.1564P�}��Xj�8\j�1str�"��\j�Xj��\j�8bj�րi�1(׀i��g��b�j����Xj�\j�Xj�8\j�1.hpp(decrementdistributedcounter:787): Distributed Aggregation of likelihood. 0 remaining. INFO: distributedaggregator.hpp(decrementdistributed_counter:793): Ag

Comments

User 1178 | 6/22/2015, 5:33:11 PM

Hi Luning,

To help us diagnostic, can you give a little more information regarding your environment?

  1. Can you send us the output of the following command (If you are not comfortable share the information, you may email me ping at dato dot com):

    hadoop version hadoop classpath cat /etc/hadoop/conf.dist

  2. Can you tell us if you have Yarn(Hadoop V2) installed? If yes, what version do you have?

Thanks! Ping


User 2060 | 6/23/2015, 3:59:25 AM

Hi, Wang Ping,

This is my output:

` [dujuan@eta1 ~]$ hadoop version Hadoop 2.0.0-cdh4.6.0 Subversion git://rhel64-6-0-mk4.jenkins.cloudera.com/data/1/jenkins/workspace/generic-package-rhel64-6-0/topdir/BUILD/hadoop-2.0.0-cdh4.6.0/src/hadoop-common-project/hadoop-common -r 8e266e052e423af592871e2dfe09d54c03f6a0e8 Compiled by jenkins on Wed Feb 26 01:58:53 PST 2014 From source with checksum a9d36604dfb55479c0648f2653c69095 This command was run using /usr/lib/hadoop/hadoop-common-2.0.0-cdh4.6.0.jar [dujuan@eta1 ~]$ hadoop classpath /etc/hadoop/conf:/usr/lib/hadoop/lib/:/usr/lib/hadoop/.//:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib/:/usr/lib/hadoop-hdfs/.//:/usr/lib/hadoop-yarn/lib/:/usr/lib/hadoop-yarn/.//:/usr/lib/hadoop-0.20-mapreduce/./:/usr/lib/hadoop-0.20-mapreduce/lib/:/usr/lib/hadoop-0.20-mapreduce/.//

`

/etc/hadoop/conf.dist is a directory, which file should I cat ?


User 1178 | 6/23/2015, 6:04:56 PM

Hi Luning,

The Hadoop installation you have is quite old and it has old API design. GraphLab Create is certified with CDH5.0 and after, which has a more stabler API surface area. Do you mind upgrade your Hadoop system to newer version of CDH?

The 2.0.0 version has YarnClient in the following namespace:

org.apache.hadoop.yarn.client.YarnClient

While the newer version has the following namespace:

org.apache.hadoop.yarn.client.api.YarnClient Thanks!

Ping


User 2060 | 6/24/2015, 1:41:57 AM

Hi Wang Ping, Thank you for your response. Although I cannot install or upgrade to a the latest version, I know the cause.