Is there any way to speed up the loading/partitioning the graph?

User 231 | 5/9/2014, 5:56:34 AM

Hi there,

These days, I am working on a large graph and test simple Single Source Shortest Path task inside the toolkit package. The problem I met is that GraphLab spend tooo much time on loading ( and partitioning, I guess) the graph. My graph has 172,655,479 vertex, and 1,544,271,504 edges. It's fairly large. I run the SSSP code on a 32 nodes cluster ( each is 4 cores, 8G mem). The total task spend 4000secs to finish. The logs showed that the real engine time is just 19secs! It seems all the 1hours+ is doing the loading and partitioning the graph data.

Is there any way to speed up the loading process ? Or is there some other configuration that I missed ? Thanks for your answers!

Comments

User 6 | 5/9/2014, 5:59:36 AM

Hi Javier, Did you split your input file into disjoint parts? this could significantly speed up the loading part. We assume you have access to a shared NFS folder where all the files are found. You can use Linux bash "split -l" command for splitting the file into equal files with l lines each.


User 231 | 5/9/2014, 6:59:48 AM

I didn't split it. When you said "disjoint", did you mean "disjoint" graph ? or I can just split it using the split cmd ?
My data is load from hdfs which is of "adj" format, would it be faster by using NFS folder ?
Any idea ? Thanks!


User 33 | 5/9/2014, 1:20:23 PM

Just split your input file using 'split' cmd The number of sub-files should be equal to the multiple of nodes and cores (32 * 4)


User 231 | 5/10/2014, 5:11:02 PM

It works, thanks. Can you tell me the reason why simple split could make the loading faster ?


User 6 | 5/10/2014, 7:07:13 PM

Because GraphLab is capable of loading the input files in parallel. If there is only a single input file, just node can load it. But if they are many, GraphLab master node splits the work into the slaves which loads disjoint parts of the Graph.


User 231 | 5/11/2014, 3:05:00 AM

Thanks!