Parallelized Python

User 956 | 11/24/2014, 9:32:18 PM

Hey all,

I have a question regarding the parallelism of GraphLab. I am currently working on implementing a graph algorithm in Python for GraphLab and asking myself how to parallelize the python execution. The algorithm can be mapped on the typical MapReduce paradigma. I calculate some stuff on a SFrame, after that I do a merge and repeat this several times. The stuff on the SFrame could be (logically) parallized. It is triggered by a for each loop. My question is, how can I parallelize this? Does GraphLab solve this problem on its own? Should I use another interface than the Python interface?

The algorithm looks like that (Spreading Activation) Start at a specific point in the graph Go to all neighbors and calculate a value for them (for all neighbors the same) Go to all neighbors-neighbors and calculate values for them and sum the values

As soon as I have implemented the algorithm I will publish it as how-to, blog post or discussion.

Thanks for your help! Tarek

Comments

User 6 | 11/25/2014, 6:00:29 AM

Hi Tarek, Here is an example for <a href="https://github.com/graphlab-code/how-to/blob/master/tripleapplyweightedpagerank.py">weighted pagerank</a>. It does weighted average calculation. Please take a look and let us know if this helps you - the <a href="http://graphlab.com/products/create/docs/generated/graphlab.SGraph.tripleapply.html">triple_apply()</a> call is parallelized.


User 16 | 11/25/2014, 6:15:31 PM

Hi Tarek,

We will have a parallel-for-each very soon (with the next version of GraphLab Create). Using this you will be able to easily spin up multiple EC2 hosts that can execute your calculations in parallel.


User 956 | 11/27/2014, 7:19:18 PM

Hi Danny, Hi Tobi,

thanks for your responses. The parallel-for-each is not available with GraphLab 1.1, is it? I implemented the algorithm using triple_apply, but I noticed that the performance is worse compared to the for-each loop on SFrames. So I am looking forward to get the for-each to improve the performance.

I would like to share the algorithm with the community, but I guess it's to complicated for a How-To. What would be the best way to do this? A discussion in the forum or a blog post? Please let me know, if you think this might be a good idea.

Cheers, Tarek


User 10 | 12/9/2014, 6:37:05 PM

Hey Tarek -

The graphlab.deploy.parallelforeach API is available in GraphLab Create 1.1 (though we need to do a better job documenting it). The API docs are here: http://graphlab.com/products/create/docs/generated/graphlab.deploy.parallelforeach.html#graphlab.deploy.parallelforeach. This API launches a set of EC2 instances to process the Task in parallel. If all the Tasks are operating on the same overall SFrame, then there may be some additional logic required to ensure that only one Task is writing to the SFrame at a time (in the Map/Reduce paradigm essentially ensuring that all the Mappers are completed before the Reducers run).

We would love your contribution! I think the best way to share it would be as a contributed IPython Notebook on the GraphLab Gallery (http://graphlab.com/learn/gallery/index.html), and a blog post by you to share your findings (and link to the IPython Notebook). Let's discuss offline, and finalize the plan.

Take care.

Rajat