Correctly executing the default PageRank in analytics toolkit

User 90 | 3/6/2014, 2:59:20 AM

I am trying to run the default PageRank provided in the analytics toolkit on a cluster of 12 nodes where each node has 32 cores.

(1) I want to control the number of computation threads per node and hence, am using --ncpus=X where I vary X from 1 to 30. However, for all of my runs, there are exactly 77 processes being created on each of the 12 nodes. Is this the correct behavior? I was expecting that there will be some threads for communication/synchronization, but 77 is a lot?

(2) Here is what I notice: --ncpus=2 ---> 19.3 seconds --ncpus=4 ---> 70.9 seconds --ncpus=8 ---> does not complete in 10 minutes

Is this trend expected? Also, the outputs for --ncpus=2 and --ncpus=4 do not match: for example, vertex 0 is 0.345408 when --ncpus=2 and it is 0.256067 when --ncpus=4. This is surely incorrect; what am I doing wrong?

(3) Is there a way to determine whether my request for --ncpus was accepted or not (maybe some kind of verbose mode)? Currently, the output does not say anything related to ncpus parameter.

Command: mpirun -n 12 -ppn 1 -hostfile hostfile.txt ./pagerank --graph=graph-ip.txt --format=snap --saveprefix=output --ncpus=X

I have attached the output of a run (when --ncpus=2).

Comments

User 20 | 3/6/2014, 4:14:12 AM

Interesting. Each process spawns a ton of threads, but most of them are silent. 32 of those threads are from OpenMP (which are still used in a few places), and 32 of these threads are the internal thread-pool. That makes up 64. And a bunch of other little threads which wake up periodically to do stuff like collect logs, and stuff.

The default execution behavior of pagerank is synchronous execution with dynamic scheduling (i.e. vertices which have not converged are scheduled for execution in the next round). Executions should generally be deterministic, up to numeric precision issues (we are using double precision everywhere). If you are getting variable answers, and variable convergence behavior, your graph might be quite be particularly difficult (extremely skewed power-law graph)? In truth we have never quite encountered such a graph before.

You could consider instead fixing the number of iterations: ./pagerank --iterations=20 which in this case will run 20 complete rounds of pagerank with no dynamic scheduling, and see what you get.


User 90 | 3/6/2014, 5:46:32 AM

My input graph is a standard power-law graph. I think the PageRank algorithm does not assume inputs to satisfy any particular graph structure and it is expected to return the same stable answer across multiple runs.

I tried fixing the number of iterations; that does not help.

Two subsequent runs of the following command (kept exactly same) produce different outputs: mpirun -n 12 -ppn 1 -hostfile hostfile.txt ./pagerank --graph=graph-ip.txt --format=snap --saveprefix=output --ncpus=2 --iterations=20

Also, two subsequent runs of the following command (kept exactly same) produce different outputs: mpirun -n 12 -ppn 1 -hostfile hostfile.txt ./pagerank --graph=graph-ip.txt --format=snap --saveprefix=output --ncpus=2

Is this because I am not using the stable 2.2 release (as recommended by you in my previous post: http://forum.graphlab.com/discussion/51/unable-to-compile-graphlab-analytics-with-mpi#latest)?


User 20 | 3/6/2014, 6:40:48 AM

Hmm... Thats interesting. When is your checkout taken? It is preferred that you use the most current head of the repository.


User 90 | 3/6/2014, 6:47:03 AM

My checkout was taken at 03.12 PM today (March 05). I again took the latest version right now and am still facing the same issue.


User 20 | 3/6/2014, 6:57:22 AM

Interesting. I am not sure. Let me ask another collaborator to see if he has any thoughts. Is it possible to share your data?


User 90 | 3/6/2014, 7:15:57 AM

Sure. What data do you need?


User 20 | 3/6/2014, 7:18:58 AM

The graph file for which you are finding that repeated runs do not give the same answer. If you can find a relatively small version of the file which has the problem, I can try to take look.


User 90 | 3/6/2014, 7:22:51 AM

I am using the LiveJournal network from SNAP available at http://snap.stanford.edu/data/soc-LiveJournal1.html.


User 20 | 3/6/2014, 7:23:41 AM

Hmm... We have performed runs on this graph before without issues. Let me try this on our systems and I will let you know.


User 90 | 3/6/2014, 6:15:43 PM

I have taken a latest checkout today morning and am working on that now. After various experiments, here is what I have noticed.

  1. If I do not set --ncpus parameter, multiple runs output the same stable answer. It takes 50 iterations with 180-200 seconds.

  2. If I set --ncpus=30, multiple runs output the same stable answer (note that number of cores on each node is 32). Also, this answer matches the answer in case 1. It also takes 50 iterations with 180-200 seconds.

  3. If I set --ncpus=2, multiple runs output the same stable answer. However, I do not get the same answer as in above cases. It takes 30-37 iterations with 19-21 seconds only -- this also does not match as in above cases.

  4. If I set --ncpus=25, the algorithm does not converge in a reasonable time; the number of active nodes keeps on increasing/decreasing randomly and hence, I had to manually kill the process when it was on 64th iteration.

I am attaching the outputs for case 1, 3 and 4.

--- Which is the correct answer for PageRank?

--- Does this mean that --ncpus parameter does not work correctly? How can I fix this and move on?

--- To be 100% sure, if I do not set --ncpus parameter, then it defaults to number of cores - 2 (which should be 30 in my case since each node has 32 cores). Am I right?


User 20 | 3/6/2014, 6:17:15 PM

Thats very interesting. Indeed default is #cores - 2. It could very possibly be a bug with the ncpus setting. I will look into it. Thanks for your debugging work!


User 90 | 3/6/2014, 6:31:04 PM

Okay. Just to get a rough estimate, will this take time a lot of time? Hoping for a quick fix :)


User 20 | 3/7/2014, 1:17:53 AM

I think I found the issue. It may impact performance very marginally if any at all.

Apply with patch -p 1 < patch.txt in the root.


User 90 | 3/7/2014, 2:36:23 AM

I have applied the patch. Now if I set --ncpus=2, it works. However, if I set --ncpus=30 or --ncpus=25, the program crashes.

I am attaching the output here.


User 20 | 3/7/2014, 3:40:16 AM

Progress! Got it. Silly error. Undo the last patch and apply this new one. I am only "marginally" satisfied with this solution. It introduces a potential point of contention. Let me know if there is too performance degradation.

Basically, the reason is because we allocate #cores - 2 (30 in your case) thread-local receiving buffers. Then when --ncpus is < 30, we end up skipping some of these buffers when processing received messages. I added basically a "work-stealing" thing so that when the thread is done with its own buffer, it will look around the other buffers to pick up something to process. But the side effect is that I have to add a lock.

I am not certain how contentious this lock will be in the grand scheme of things. There is already a ton of buffer combining going on so each time the lock is acquired, the thread will pick up quite a lot of messages, so the lock may not be too contentious in practice.


User 20 | 3/7/2014, 4:26:51 AM

Hmm... I may need a lock elsewhere too. wait.


User 20 | 3/7/2014, 4:31:56 AM

Sorry. Here it is an updated patch again.

I am really not too excited with this solution. Part of the objective of the design is that I can avoid the use of locks by making everything use thread local buffers. I can get a lot of performance that way. This sorts of ends up putting locks back. Let me know if this works, and maybe I will try to come up with a better solution some other time.


User 90 | 3/7/2014, 7:01:26 AM

Thanks Yucheng for your help! This patch works fine; I am able to set --ncpus to different values and the answer is always consistent.

To be 100% sure that the execution is correct, is the following behavior expected? PageRank --engine=synchronous takes 180 seconds and --engine=asynchronous takes 300 seconds. SSSP --engine=synchronous takes 15 seconds and --engine=asynchronous takes 170 seconds.

I was expecting asynchronous to be faster than synchronous. If this is a correct behavior, can you please explain why?


User 20 | 3/7/2014, 6:06:16 PM

Asynchronous might be faster sometimes, but it can be quite dataset dependent and can be hard to tune. Basically it is a balance of two things.

1: The Asynchronous algorithm will converge faster than Synchronous: in the sense that it will require less work in total. (less vertex updates). but; 2: The Asynchronous implementation has a much higher system overhead (simply due to the complexity of managing asynchrony). (Currently it requires nearly 2x the message overhead of the synchronous algorithm. Our message headers are a little too big.)

The Asynchronous implementation works by creating thousands of user-mode threads (fibers) per machine, each running one vertex program. So performance gains may only be evident on large graphs where there is enough work for so many fibers. Also, the number of fibers to create is somewhat dependent on the task, as well as the network type; the faster the network, the fewer fibers needed. You can tune the number of fibers with --engine_opt="nfibers=xxxx". I think the default is 10000; this seems to be about right for regular Gigabit ethernets. You can try tuning it down.


User 90 | 3/8/2014, 6:01:40 PM

Thanks Yucheng. That helps!


User 33 | 3/8/2014, 7:35:38 PM

There is no certain answer to the question that which engine has better performance, sync or async. The performance highly depends on the input graph, algorithms, hardware configuration and even partitioning. Our project, namely powerswitch, provides a study on this problem and some cases. (http://ipads.se.sjtu.edu.cn/projects/powerswitch.html)

For your cases, 1. PageRank performs much better in sync mode, while LBP performs notable better in async mode. 2. SSSP is relative complex. The diameter of your input graph, LivJournal, is quite low (17), so that execution in sync mode can converge much quickly. However, if you try roadNet-CA (http://snap.stanford.edu/data/roadNet-CA.html) on SSSP, async mode may provide better performance due to high diameter (849). More detail information can refer our tech-reprot. (http://ipads.se.sjtu.edu.cn/projects/powerswitch/PowerSwitch-IPADSTR-2013-003.pdf)

Rong, IPADS http://ipads.se.sjtu.edu.cn/projects/powerlyra.html


User 253 | 4/23/2014, 10:16:27 PM

Hi, Any idea when this issue will be fixed and rolled out in a release?