Train distributed on Spark Cluster

User 1637 | 3/30/2015, 4:14:23 PM

Hi all, I saw this very interesting notebook and i was wondering if the training is done on spark cluster? Or only in local?

I am very curious about the distributed part. I have not yet succeeded in running a training on my hadoop cluster.

Thanks in advance for your help.



User 1190 | 3/31/2015, 6:48:12 PM

Hi @CourbeB

Model training is done in local, but scales beyond memory. The point of the notebook is to show that you can do data crunching distributedly using spark, after which you can use GraphLab Create for efficient model training on a single machine.

Thanks, jay

User 1637 | 4/1/2015, 9:01:24 AM

Thanks a lot for your reply. Does Dato plan to add spark cluster as a new deploy environment?

User 1190 | 4/6/2015, 6:30:04 PM

Hi @CourbeB,

We already support Hadoop/Yarn cluster as our deployment environment. Because a spark cluster runs upon Hadoop/Yarn, so it should already been supported.

Can you elaborate more on your case in case I misunderstood your question?

Thanks, -jay

User 1637 | 4/7/2015, 8:57:30 AM


Currently, Graphlab only uses hadoop map/reduce job. When you use hadoop as deployment evironment, you precise where is the hadoop conf file but you cannot choose between hadoop map/reduce or spark. What i mean is i would prefer to use spark job instead of map/reduce. This is not possible, is it?

Thanks, Baptiste

User 1190 | 4/7/2015, 5:39:34 PM

Hi @CourbeB,

Good question.

If you are asking about using the Spark Standalone Mode ( as deployment environment, the current answer is no.

If you are using the Spark on Yarn Mode, then GraphLab's Job runs the same way as Spark on Yarn, which uses Yarn for job scheduling.

If neither of the above fit in your use case, can you please elaborate a little bit more on what you are trying to achieve?

Thanks, -jay