Doubt on partitioning the orginal file on powgraph in parallel

User 944 | 11/19/2014, 3:32:58 AM

Hi,

I have input a file which is stored as tsv format, like below: 9999 7943 9999 9358

the file is in the HDFS system and I have 3 distributed machines. and then the below command is executed and successful.

mpiexec -n 3 --hostfile /root/machines env CLASSPATH=$CLASSPATH /home/hongsibao/graphlab/graphlab-master/release/toolkits/graph_analytics/pagerank --graph=hdfs://10.67.238.65:9000/pr10W --format=tsv --iterations=9 --engine=synchronous --saveprefix=/home/xuke/1106/

My questions are: 1. I don't think the original file is divided into 3 parts which are addressed by each machine separately because as I check the cost memory, the value of the 3 machines are almost same with the result of execution by single machine.

So I think each machine deal with the whole data not part of it, why?

  1. the result files in the output path exist in each machine where the total number of the vertex from the files is the same with 100,000. I thought it should be only one file in the master machine not 3 shares, is my opinion correct?

Any help would be appreciated. BR,

Comments

User 6 | 11/19/2014, 8:05:30 AM

Please read and follow section 1 here: http://graphlab.org/projects/tutorials.html#perf_tuning


User 944 | 11/19/2014, 8:59:22 AM

Dear Danny,

Thanks a lot for your comments.

I have checked what you post and here is : GraphLab has built in parallel loading of the input graph. However, for efficient parallel loading, the input file should be split into multiple disjoint sub files. When using a single input file, the graph loading becomes serial (which is bad!).

Any way I still do not understand completely, based on above does it represent each machine deal with the same input file over HDFS system because the occupied memory and the running time of each machine is the similar, which is the same with the result of single machine.