User 453 | 7/10/2014, 4:10:57 AM

Hi all, I'm executing add_edge for a finalized graph, the code is as follows:

```
bool add_missing_assign_edges(graph_type& graph, const graphlab::vertex_set& token_set)
{
// Get missing edges from graph
edge_list missing_edges = graph.map_reduce_vertices<edge_list>(accumulate_missing_assign_edges, token_set);
uint32_t remaining = missing_edges.edges.size();
logstream(LOG_INFO) << remaining << " edges need to be added\n";
for(std::vector<edge_pair>::iterator it = missing_edges.edges.begin();
it != missing_edges.edges.end(); ++it) {
graph.add_edge((*it).first, (*it).second, edge(E_ASSIGN, 0));
}
graph.finalize();
return true;
}
```

The output of remaining is 1526169, but after finalize, I actually get 9157014 edges, which is 6 times than I expected. And I'm running on a 6 machine cluster, so I'm thinking maybe I should try execute this on only one machine? Any thoughts about how to do that? Or I did it in a wrong way?

Thanks!

User 453 | 7/10/2014, 3:20:28 PM

If you guys need more information please let me know, thanks! Currently I'm trying to execute this only with graph.procid() == 0, but it will stuck at finalizing step.

User 453 | 7/10/2014, 4:13:58 PM

I tried with 2 machines, the edges is roughly 2 times larger than expected, so I assume each machine executes add_edge separately so there will be duplicate edges... But why there will be duplicate edges since Graphlab does not allow it?

User 14 | 7/10/2014, 6:08:18 PM

graphlab does allow duplicate edge, but no duplicate vertices.

User 453 | 7/10/2014, 6:11:21 PM

http://docs.graphlab.org/classgraphlab*1*1distributed__graph.html#a5b52fb7c39c5b9ac0fe8710f48642982

However, each edge direction may only be added exactly once. i.e. if edge 5->6 is added already, no other calls to add edge 5->6 should be made.

So this is just a instruction, not a constraint? Is there any solution for remove duplicate edges?

User 14 | 7/10/2014, 6:34:18 PM

Yes, it is instruction not constraint. The graph doesn't care about multiple edges with the same source and destination, and some algorithm may not care as well. If you do not want to add duplicate edges, We recommend you perform deduplication during the data preprocessing stage.

User 453 | 7/10/2014, 6:35:58 PM

Actually those edges are generated during the execution, so I may not able to deduplication in preprocessing stage. Could I do that on-the-fly?

User 453 | 7/10/2014, 6:41:15 PM

Or in other words, could I execute add_edge on only 1 machine so I can prevent duplicating?

User 14 | 7/10/2014, 6:56:08 PM

In your case, edge*list missing*edges is computed on every machine, and add_edge will add the same edge to the graph N times, where N is the number of machine you have. You probably just want to do it on the master node like this:

if (dc.procid() == 0) {
edge*list missing*edges = graph.map*reduce*vertices ...
graph.add_edges(...)
}
graph.finalize();

I hope this helps.

Thanks, -jay

User 453 | 7/10/2014, 7:23:21 PM

Hi Jay,

Thanks for your help! I tried your solution, but it will stop at enter finalize:

INFO: distributed_graph.hpp(finalize:702): Distributed graph: enter finalize

The modified code is as follows:

411 bool add*missing*assign*edges(graphlab::distributed*control& dc,
412 graph*type& graph, const graphlab::vertex*set& token*set)
413 {
414 //TODO: If this becomes a bottleneck, use dc.numprocs to split edge adding to different
415 // process
416
417 if(dc.procid() == 0) {
418 // Get missing edges from graph
419 edge*list missing

Thanks! Really appreciate it.

User 14 | 7/10/2014, 8:06:59 PM

The code looks good to me. Can you make sure "graph.finalize()" is called on all machines?

You can put a line on 430 like logstream(LOG_INFO) << "machine id: " << dc.procid() << std::endl;

User 453 | 7/10/2014, 11:22:43 PM

Hi Jay,

I move L419 (map*reduce*edges) out of "if" and then it could run now. I think finalize() will lock the process so map*reduce*vertices can not be executed...

Although it could run right now, the number of edges it created is slightly less than expected... Here is the new code:

415 bool add*missing*assign*edges(graphlab::distributed*control& dc,
416 graph*type& graph, const graphlab::vertex*set& token*set)
417 {
420 // I move this out
421 edge*list missing*edges = graph.map*reduce*vertices<edge*list>
422 (accumulate*missing*assign*edges, token*set);
423
424 if(dc.procid() == 0) {
425 // Get missing edges from graph
426
427 uint32*t remaining = missing*edges.edges.size();
428 logstream(LOG*INFO) << remaining << " edges need to be added\n";
429
430 for(std::vector<edge*pair>::iterator it = missing*edges.edges.begin();
431 it != missing*edges.edges.end(); ++it) {
432 graph.add_edge((*it).first, (*it).second, edge(E*ASSIGN, 0));
433 }
434 }
435
436 logstream(LOG*INFO) << "machine id: " << dc.procid() << "\n";
437 graph.finalize();
438 return true;
439 }

INFO: hdtm.cpp(add*missing*assign_edges:428): 1526234 edges need to be added
INFO: hdtm.cpp(main:801): Added 1481428 edges

Could you give me a hint on that? I have tried add dc.full_barrier(); before finalize(), but that does not work either. Maybe one can not create an edge if the source and/or target vertex does not have a copy on that machine?

Thanks!

User 453 | 7/10/2014, 11:53:48 PM

Some more info, I tried print out all failed add*edge, but there is no output. So I assume the problem may occur in ingress*ptr->add*edge(source, target, edata); (L891 in distributed*graph.hpp). I did not specify any special partition algorithm and I think I'm using "grid" as ingress method.

INFO: distributed*graph.hpp(set*ingress_method:3201): Automatically determine ingress method: grid

User 14 | 7/11/2014, 12:00:48 AM

My bad, map*reduce*vertices has to be called on all machine, and that's why it was halting. For the new issue, I suspect that it is because there are self edges -- those with source==target -- in the "missing_edges" and self edges will be ignored during ingress. Can you double check on it?

User 453 | 7/11/2014, 12:56:08 AM

Hi Jay,

Unfortunately they are not... If I execute the same program with mpiexec -n 1, then the number of edges would be correct, and I also outputting all failed add_edge, but there is nothing.

Thanks!

User 14 | 7/11/2014, 2:23:29 AM

How do you output all failed add_edge?

User 453 | 7/11/2014, 2:25:44 AM

Oh sorry I did not explain that part. I'm outputting it by testing the result of add_edge

446 bool res = graph.add_edge((*it).first, (*it).second, edge(E*ASSIGN, 0));
447 if(!res) logstream(LOG*INFO) << "create " << (*it).first << "->" << (*it).second << "failed\n";

Is that the right way to do it?

User 14 | 7/11/2014, 5:11:29 AM

Hmm.. I'm not sure what's going wrong here. A better sanity check for it will be trying to add a fixed set of edges to an empty graph, and see if the program does the right thing. For example, instead of adding edges from missing_edge.edges, try adding edges from [(i,i+1) for i in 0...10000]. Then you can check whether the numbers are correct.

User 453 | 7/14/2014, 5:01:51 PM

Hi Jay,
Yes you are right. There is no problem with edge addition. The actual problem is, I'm using a vertex*set that generated before finalize() to track the changes. since I have changed the graph, the old vertex*set does not apply on this new graph so I always missing some edges. Now I'm re-selecting the vertex_set after finalize() and it works perfectly! Thanks for your help!

Do you mind to share the internal mechanism of vertex*set? When I execute my program on a 1 machine cluster, seems the changes won't affect vertex*set, but if I'm running it on a real cluster, it will change along with graph.

Thanks!

User 14 | 7/15/2014, 3:06:29 AM

vertex*set is distributed. Each node has a local graph and the vertex*set tracks vertices in the local graph. When you have one machine, it is OK because the local graph is everything.