Parallel Loading of the Input Graph in GraphLab

User 84 | 3/3/2014, 5:31:00 PM

     I am trying to processing a very large data set (about 150GB) under Graphlab. Thus, I am sure I need to parallel load the input data set using multiple nodes. Fortunately, I found that as the following statement shows, Graphlab provides the feature of parallel loading. But I did not find how to edit my command line to use this feature. Could anyone help me? Thanks in advance.
     1: Understanding input graph loading

GraphLab has built in parallel loading of the input graph. However, for efficient parallel loading, the input file should be split into multiple disjoint sub files. When using a single input file, the graph loading becomes serial (which is bad!). Each MPIinstance has a single loader of the input graph attached to it (does not matter how many cpus are used by that MPI instance). Tip: Always split your input file into at least as many MPI processes you are using. Regards, Yue

Comments

User 6 | 3/3/2014, 5:37:03 PM

Hi You are right. For increasing GraphLab speed with loading large graphs it is recommended to split them first into smaller disjoint chunks. Please use Linux "split -l " command. You can first count the number of lines using "wc -l filename" and then split it to files with X lines each.


User 84 | 3/3/2014, 6:40:09 PM

Thanks Danny, Assum I have get 8 subfiles: 01.tsv, 02.tsv....08.tsv. How should I edit the command line to achieve the parallel loading? For example, How should I edit the following command? Thanks. mpirun -np 8 -machinefile $PBSNODEFILE ~/graphlab/release/toolkits/graphanalytics/pagerank --graph=/home/yxzhao/iosig/filewritetime/hello --format=tsv --graph_opts="ingress=random" --engine=synchronous


User 6 | 3/3/2014, 6:42:46 PM

If you put all the subfiles into the folder /home/yxzhao/iosig/filewritetime/hello then they should be read in parallel.


User 11 | 3/17/2014, 10:14:07 PM

Hi Danny,

I have a follow up question. I understand that the master MPI should have access to all files. I wonder what is your recommendation to achieve the best Graphalb performance in AWS settings. Should we partition the dataset, keep it in a master node directory and then synch this directory to all slaves, or is it satisfactory to have all files in the master and synch ONE file to each slave (so slaves have access to disjoint parts of the dataset but master have the entire dataset). The other option could be setting an NFS directory up but I'm not confident about the network overhead of using it during the loading process.

Thank you, -Khaled


User 20 | 3/18/2014, 5:20:18 PM

Sharing of files across independent machines is indeed a problem. The best solution I think is to set up HDFS across the machines and use that.


User 568 | 8/3/2014, 1:48:35 AM

Is there a way for single multicore machine to use parallel loading?


User 33 | 8/3/2014, 3:49:01 AM

In fact, GrpahLab has used multiple threads to load input files in parallel. You just need to split the file into NxM disjoint pieces. N is the number of machines, and M is the number of cores.


User 568 | 8/3/2014, 4:00:59 PM

I don't think it works properly.

./my_app --graph /dataset/orkut.txt takes 200s

Then I split it into about 10 different files under /dataset/orkut-split

./my_app --graph /dataset/orkut-split still takes 200s

mpiexec -n 4 ./my_app --graph/dataset/orkut-split takes about 40s

It seems that Graphlab haven't used multiple threads when loading files in a single process.


User 33 | 8/4/2014, 1:37:52 AM

Where you get 200s and 40s? Please provide a complete log. Parallel loading only reduces ingress time (loading and finalizing), and no impacts to the runtime.


User 568 | 8/4/2014, 5:50:07 PM

Yeah I know. I used the graphlab::timer to calculate the time for loading and finalizing separately.

The pseudo code is like:

timer.start(); graph.load(dc); printf("Load time: %f", timer.current_time() );

timerfinalize.start(); graph.finalize(); printf("Finalize time: %f", timerfinalize.current_time());

I have engine.eclapse_second() to calculate the compute time. So they are separate.


User 33 | 8/5/2014, 5:00:25 AM

I believe that parallel loading works even for single machine. So the problem may from cmdline, configuration, version of code or anything else. The log is helpful.

[single file] <pre class="CodeBlock"><code> mpiexec -f ~/mpd.hosts -n 1 ./pagerank --graph=/data/sdb1/yanzhe/realworld-data/web-google-single --format=snap --engine sync --engineopts maxiterations=5 ... INFO: dc.cpp(init:573): Cluster of 1 instances created. Loading graph. INFO: distributedgraph.hpp(setingressmethod:3314): Automatically determine ingress method: grid Loading graph in format: snap INFO: distributedgraph.hpp(loadfromposixfs:2255): Loading graph from file: /data/sdb1/yanzhe/realworld-data/web-google-single Loading graph. Finished in 3.0045 Finalizing graph. ... </code></pre>

[multiple files] <pre class="CodeBlock"><code> mpiexec -f ~/mpd.hosts -n 1 ./pagerank --graph=/data/sdb1/yanzhe/realworld-data/web-Google/web-Google --format=snap --engine sync --engineopts maxiterations=5 ... INFO: dc.cpp(init:573): Cluster of 1 instances created. Loading graph. INFO: distributedgraph.hpp(setingressmethod:3314): Automatically determine ingress method: grid Loading graph in format: snap INFO: distributedgraph.hpp(loadfromposixfs:2255): Loading graph from file: /data/sdb1/yanzhe/realworld-data/web-Google/web-Googleaa INFO: distributedgraph.hpp(loadfromposixfs:2255): Loading graph from file: /data/sdb1/yanzhe/realworld-data/web-Google/web-Googleaw INFO: distributedgraph.hpp(loadfromposixfs:2255): Loading graph from file: /data/sdb1/yanzhe/realworld-data/web-Google/web-Googleam INFO: distributedgraph.hpp(loadfromposixfs:2255): Loading graph from file: /data/sdb1/yanzhe/realworld-data/web-Google/web-Googleau INFO: distributedgraph.hpp(loadfromposixfs:2255): Loading graph from file: /data/sdb1/yanzhe/realworld-data/web-Google/web-Googleao ... INFO: distributedgraph.hpp(loadfromposixfs:2255): Loading graph from file: /data/sdb1/yanzhe/realworld-data/web-Google/web-Googleal INFO: distributedgraph.hpp(loadfromposixfs:2255): Loading graph from file: /data/sdb1/yanzhe/realworld-data/web-Google/web-Googlebt Loading graph. Finished in 0.628164 Finalizing graph. ... </code></pre>


User 568 | 8/5/2014, 2:43:38 PM

./mpagerank --graph /dataset/web-Google.txt GRAPHLABSUBNETID/GRAPHLABSUBNETMASK environment variables not defined. Using default values Subnet ID: 0.0.0.0 Subnet Mask: 0.0.0.0 Will find first IPv4 non-loopback address matching the subnet INFO: dc.cpp(init:573): Cluster of 1 instances created. INFO: distributedgraph.hpp(setingressmethod:3201): Automatically determine ingress method: grid INFO: distributedgraph.hpp(loadfrom_posixfs:2190): Loading graph from file: /dataset/web-Google.txt Load time : 7.669127 seconds


User 568 | 8/5/2014, 2:44:18 PM

./m_pagerank --graph /dataset/google

GRAPHLABSUBNETID/GRAPHLABSUBNETMASK environment variables not defined. Using default values Subnet ID: 0.0.0.0 Subnet Mask: 0.0.0.0 Will find first IPv4 non-loopback address matching the subnet INFO: dc.cpp(init:573): Cluster of 1 instances created. INFO: distributedgraph.hpp(setingressmethod:3201): Automatically determine ingress method: grid INFO: distributedgraph.hpp(loadfromposixfs:2190): Loading graph from file: /dataset/google/googleaa INFO: distributedgraph.hpp(loadfromposixfs:2190): Loading graph from file: /dataset/google/googleac INFO: distributedgraph.hpp(loadfromposixfs:2190): Loading graph from file: /dataset/google/googlead INFO: distributedgraph.hpp(loadfromposixfs:2190): Loading graph from file: /dataset/google/googleaf INFO: distributedgraph.hpp(loadfromposixfs:2190): Loading graph from file: /dataset/google/googleab INFO: distributedgraph.hpp(loadfrom_posixfs:2190): Loading graph from file: /dataset/google/googleae Load time : 8.761849 seconds


User 568 | 8/5/2014, 2:47:30 PM

mpiexec -n 4 ./mpagerank --graph /dataset/google GRAPHLABSUBNETID/GRAPHLABSUBNETMASK environment variables not defined. Using default values Subnet ID: 0.0.0.0 Subnet Mask: 0.0.0.0 Will find first IPv4 non-loopback address matching the subnet INFO: dc.cpp(init:573): Cluster of 4 instances created. WARNING: dc.cpp(init:587): Duplicate IP address: 10.191.169.245 WARNING: dc.cpp(init:592): For maximum performance, GraphLab strongly prefers running just one process per machine. INFO: distributedgraph.hpp(setingressmethod:3201): Automatically determine ingress method: grid INFO: distributedgraph.hpp(loadfromposixfs:2190): Loading graph from file: /dataset/google/googlead INFO: distributedgraph.hpp(loadfromposixfs:2190): Loading graph from file: /dataset/google/googleaf INFO: distributedgraph.hpp(loadfromposixfs:2190): Loading graph from file: /dataset/google/googleac INFO: distributedgraph.hpp(loadfromposixfs:2190): Loading graph from file: /dataset/google/googleaa INFO: distributedgraph.hpp(loadfromposixfs:2190): Loading graph from file: /dataset/google/googleae INFO: distributedgraph.hpp(loadfromposixfs:2190): Loading graph from file: /dataset/google/googleab Load time : 2.720055 seconds Load time : 2.729057 seconds Load time : 2.729105 seconds Load time : 2.729088 seconds


User 568 | 8/5/2014, 2:57:46 PM

The graphlab was downloaded recently from https://github.com/graphlab-code/graphlab


User 33 | 8/5/2014, 3:05:20 PM

https://github.com/graphlab-code/graphlab/blob/master/src/graphlab/graph/distributed_graph.hpp#L2184-L2190

Graphlab depends on openmp to support parallel loading. I guess you do not install openmp?


User 6 | 8/5/2014, 3:13:25 PM

Each mpi node loads one file in parallel. On how many machines are you running? If you run on a single machine you will not benefit from parallel loading.


User 568 | 8/5/2014, 4:09:01 PM

Hi Rongchen,

My gcc version is 4.8.2 so it comes with openmp and I check the configure log, it added -fopenmp into the compiler flag.


User 568 | 8/5/2014, 4:12:12 PM

Hi Danny,

I am running on a single machine with multicores.

So you mean Graphlab can only do parallel loading between mpi nodes and there is no parallel loading for a single node with multiple threads?

If it is like that why would Rongchen get acceleration when using a single mpi node?


User 33 | 8/5/2014, 4:57:16 PM

Hi echoxxx

The code in the link I mentioned should be executed in parallel, and of course it also works on my machine. But, currently I have no idea why it not works on your side ...

Rong


User 568 | 8/5/2014, 6:48:05 PM

Hi,

Do you know why Danny said "If you run on a single machine you will not benefit from parallel loading" :(


User 568 | 8/5/2014, 7:17:52 PM

Actually, when I run

./m_pagerank --graph /dataset/google

I think multiple threads were started because multiple lines of log were printed at the same time.

INFO: distributedgraph.hpp(loadfromposixfs:2190): Loading graph from file: /dataset/google/googleaa INFO: distributedgraph.hpp(loadfromposixfs:2190): Loading graph from file: /dataset/google/googleac INFO: distributedgraph.hpp(loadfromposixfs:2190): Loading graph from file: /dataset/google/googlead INFO: distributedgraph.hpp(loadfromposixfs:2190): Loading graph from file: /dataset/google/googleaf INFO: distributedgraph.hpp(loadfromposixfs:2190): Loading graph from file: /dataset/google/googleab INFO: distributedgraph.hpp(loadfromposixfs:2190): Loading graph from file: /dataset/google/googleae

So multiple threads may start to read together but for some reasons there is no performance improvement compared to single thread.

I wonder can the disk be the bottleneck? The disk can read to 100M/s. The google dataset is no more than 100MB. It takes 8 seconds to read. So the disk is far from its bottleneck?


User 33 | 8/6/2014, 12:59:37 AM

I don't think disk is the problem.

I do remember, there is a performance issue in parallel loading. https://github.com/graphlab-code/graphlab/pull/103

But it has been fixed recently (20 May). https://github.com/graphlab-code/graphlab/commit/12e21dde659f5d34348788c21df29fc4418f6dc8#diff-d41d8cd98f00b204e9800998ecf8427e

Once again, sorry, when you pulled the code from graphlab? You can check the source code, src/graphlab/graph/ingress/distributedconstrainedrandom_ingress.hpp.

It should include the following lines https://github.com/graphlab-code/graphlab/blob/master/src/graphlab/graph/ingress/distributedconstrainedrandom_ingress.hpp#L84-L88


User 568 | 8/6/2014, 6:19:45 PM

I checked the source code and it contains the lines you mentioned.


User 33 | 8/7/2014, 8:10:21 AM

Can you try the code from https://github.com/realstolz/powerlyra. Please use the same command, configure and anything else.


User 33 | 8/7/2014, 4:15:02 PM

//Forget above post

It seems that the graphlab code supports parallel loading even for the single machine. But the code does not scale well for too many threads (cores). Currently, graphlab directly uses all cores to do loading, and cannot manually configure.

I made a rough experiment.

machine1: <b class="Bold">24</b>cores (2X12) single file: <b class="Bold">2.27</b>s, multiple(48) files: <b class="Bold">0.62</b>s

machine2: <b class="Bold">40</b>cores (4X10) single file: <b class="Bold">2.19</b>s, multiple(48) files: <b class="Bold">2.11</b>s

machine3: <b class="Bold">16</b>cores (2X8) single file: <b class="Bold">2.94</b>s, multiple(48) files: <b class="Bold">0.52</b>s

BTW, how many cores of your machine?


User 568 | 8/7/2014, 5:46:38 PM

I have 16 cores.


User 568 | 8/7/2014, 5:47:40 PM

16 physical cores and 32 logical cores. I think graphlab will use # of physical cores -2 right? I remember 2 cores will be reserved for mpi commuinication.


User 568 | 8/7/2014, 6:01:33 PM

I didn't change the configure. I just use ./configure or ./configure --no_mpi.


User 33 | 8/7/2014, 6:07:17 PM

I think the parallelism of loading is not controlled by graphlab. Since graphlab uses openmp to parallelize loading phase, not pthread or fiber which are used for computation and controlled by graphlab.

Maybe closing HT will be better for loading.


User 568 | 8/7/2014, 6:23:27 PM

What do you mean by HT? Hyper Threading?


User 33 | 8/7/2014, 7:15:20 PM

yep, just a guess


User 568 | 8/7/2014, 9:35:42 PM

I turned off HTT. If just a single file, it takes 8 seconds. If I split it into 6 files. then the time for loading varies a lot. Three runs takes 8s, 15s, 11s respectively. I think there exists multithread contention here.


User 33 | 8/8/2014, 1:32:33 AM

I agree that there exists contention.

You have 16 cores, why just split the file into 6? It means only 6 threads can really take part in loading (others will idle). So the performance will highly depends on which cores threads running on are selected to do the work.

I guess your server, 16 cores, is SMP (multiple chips and each chip has multiple cores, 2X8 or 4X4). Some cores, share the last level cache, and others not. Threads on different cores within the same chip will contend on last level cache and memory controller, while threads on the cores of different chips will use more caches and memory controllers.

I have thought you split the file into more than 32 partitions, and all logical cores took part in loading. So I suggested turn off HT, which can avoid highly contention on L1 cache by two hardware threads within 1 core.

You can try 16 or more partitions without HT.


User 568 | 8/8/2014, 2:36:58 AM

I split to more(32) partitions, the time remains the same.

If there is no contention, the performance should scale, even if use only 6 cores.

My computer has 2 processors, every processor 8 cores with 20M shared L3 cache.

Still think it's because of contention.


User 33 | 8/8/2014, 2:53:18 AM

You can use "export OMPNUMTHREADS=xxx" in bash to control the number of threads used by OpenMP, and try different configurations.

The scalability on my 16core machine is moderate. It also has 2 processors (AMD Opteron 4280), each on has 8 cores single file: 2.94s, multiple(48) files: 0.52s

Hence, even there are some contentions in code, it still can provide speedup on some machines.


User 568 | 8/8/2014, 3:04:54 AM

I tried it and it doesn't work.

Even if there is contention, it should have some speedup. But it seems that there is no improvement at all:( My code is downloaded in 1 month, so it should be the newest right?


User 33 | 8/8/2014, 3:14:21 AM

I think there is no related commits in latest month.

I also cannot get improvement, even slowdown, on some machines, e.g. my 40core server And I tried to control the number of threads on this 40core server, but the result is stranger. Using fixed 48 partitions 1, 2, 16 or 24 threads can get a little improvement, but 4, 8, 12, 32 or 40 threads are worse.

Currently, I have no idea about it :\


User 568 | 8/8/2014, 3:17:17 AM

Yeah, multithread is really hard to tell the performance and debugging. Anyway, thank you very much:)