running graphlab along with Hadoop

User 690 | 12/12/2014, 8:51:49 AM

Hi Everybody, There are atleast two different web-pages about how to work with hadoop. One of them is http://pivotalhd.docs.pivotal.io/doc/2100/webhelp/topics/GraphLab.html http://www.i-programmer.info/news/197-data-mining/7584-graphlab-create.html http://graphlab.com/products/create/docs/generated/graphlab.deploy.environment.Hadoop.html

This url says .. I can't use graphlab Create on a hadoop cluster along-side yarn(it being the resource negotiator ofcourse).. http://forum.graphlab.com/discussion/667/scheduling-graphlab-create-jobs-across-multiple-yarn-containers I want to use logisticclassifier (using python) Can I currently run my logisticclassifier code across machines in a hadoop cluster? if this is not possible, can I setup a parallel mpi-cluster on the same hadoop cluster and run this code?

Please advice,

Thanks, Sunil.

Comments

User 6 | 12/12/2014, 9:06:01 AM

Hi Sunil, So far GraphLab Create (our newest version) supports a single multicore machine, you can schedule it using yarn on cloudera cdh5. GraphLab Create is optimized for disk access, namely you can run very big logistic models on a single machine - billions of rows - even if they do not fit the physical memory.

Previously, we had a distributed code called PowerGraph which is soon going to be deprecated. There you could run it on a cluster. We used MPI for scheduling the run. Anyway PowerGraph did not have logistic regression implement in it.

We recommend trying out GraphLab Create, let us know if you have any performance issues.


User 690 | 12/12/2014, 10:49:52 AM

Thanks Danny, for the response. I am using graphlab-create right now. I have written my logisticclassifier using the example from the logisticclassifier from your webpage. Can you please point me to a end-end-working example of using yarn to allocate resourses for the single computer shared-memory execution?

PS> I currently do not have any performance issues. I am quiet happy with the performance of Graphlab's logistic_classifier which is an order of magnitude faster than the map-reduce implementation. I was just enquiring for future scalability needs? Are there future plans for a distributed Graphlab-Create?

Thanks, Sunil.


User 10 | 12/12/2014, 9:35:55 PM

Hey Sunil -

We have an end-to-end example of submitting Jobs to a YARN cluster available as a Notebook here: http://graphlab.com/learn/gallery/notebooks/datapipelinerecsysintro.html

The way to submit work to a YARN cluster is through the GraphLab Data Pipelines feature. Currently GraphLab Data Pipelines are supported with the Cloudera Hadoop distribution (CDH5). Other YARN-based distributions should work as well, but CDH5 is what is currently supported. We will be adding support for other distributions in the future.

There is more documentation about this feature in the Data Pipelines section of the User Guide, here: http://graphlab.com/learn/userguide/index.html#DeploymentGraphLabData_Pipelines

Please let us know if you have any questions getting started!

As to distributed execution in GraphLab Create - Yes we have plans for that and it is on our roadmap. Today you can run Tasks in parallel with the graphlab.deploy.parallelforeach API, and with the graphlab.toolkits.modelparametersearch API. We will continue to add distributed execution functionality in the coming months, so stay tuned.

Thanks.

Rajat


User 690 | 12/17/2014, 3:45:18 AM

Thanks Rajat!