Upper Bound of File/Graph Size

User 84 | 3/5/2014, 6:32:11 PM

Hello, Is there any upper bound of input file or graph size in Graphlab for a specific hardware environment? Our cluster information is as following: 1 login node and 64 compute nodes. Each compute node has 16GB of RAM and 2 CPU sockets, each with quad core Intel Xeon 2.66GHz CPU.

    I am working on processing a large dataset (10GB). I split the dataset as 8 and 17 sub-files, respectively. And then appointed 8 and 17 nodes to work for the job, respectivey. But every time the node crashed.  The attachment is the log. It was stuck at the end. Thank.

Comments

User 20 | 3/5/2014, 7:37:19 PM

Hi,

The graph seems to be about 680M edges or so? It looks like the machines are out of memory. What task are you running?

It looks like you are close though: the graph almost successfully finalized with about 12GB allocated per machine.

Suggestions are to try more machines (you can have less or more splits, its fine. It will just load slightly slower if loading is imbalanced).

Good numbers are squares (16, 25, 36, 49, 64) or one of these 13, 31, 57 which will use the PDS partitioning heurisitic.

Don't set a partitioning heuristic on the command line (i.e. do not set --graph_opts="ingress=[something]"). We will try to autodetect and set the best heuristic accordingly.


User 84 | 3/6/2014, 5:53:27 PM

Hi, Yucheng, Thanks very much for your help. Yes, the graph has about 680M edges; The task I was running is PageRank; Upon the same dataset I retested and got that the cases of 14 or more splits are OK. The cases of 13 or less splits all failed. Thanks again.