add_edge will add more edges than expected(expected edges * num of machines)?

User 453 | 7/10/2014, 4:10:57 AM

Hi all, I'm executing add_edge for a finalized graph, the code is as follows:

			bool add_missing_assign_edges(graph_type& graph, const graphlab::vertex_set& token_set)
			   // Get missing edges from graph
			   edge_list missing_edges = graph.map_reduce_vertices<edge_list>(accumulate_missing_assign_edges, token_set);

			   uint32_t remaining = missing_edges.edges.size();				
			   logstream(LOG_INFO) << remaining << " edges need to be added\n";

			   for(std::vector<edge_pair>::iterator it = missing_edges.edges.begin();
			       it != missing_edges.edges.end(); ++it) {
			     graph.add_edge((*it).first, (*it).second, edge(E_ASSIGN, 0));
			   return true;

The output of remaining is 1526169, but after finalize, I actually get 9157014 edges, which is 6 times than I expected. And I'm running on a 6 machine cluster, so I'm thinking maybe I should try execute this on only one machine? Any thoughts about how to do that? Or I did it in a wrong way?



User 453 | 7/10/2014, 3:20:28 PM

If you guys need more information please let me know, thanks! Currently I'm trying to execute this only with graph.procid() == 0, but it will stuck at finalizing step.

User 453 | 7/10/2014, 4:13:58 PM

I tried with 2 machines, the edges is roughly 2 times larger than expected, so I assume each machine executes add_edge separately so there will be duplicate edges... But why there will be duplicate edges since Graphlab does not allow it?

User 14 | 7/10/2014, 6:08:18 PM

graphlab does allow duplicate edge, but no duplicate vertices.

User 453 | 7/10/2014, 6:11:21 PM

However, each edge direction may only be added exactly once. i.e. if edge 5->6 is added already, no other calls to add edge 5->6 should be made.

So this is just a instruction, not a constraint? Is there any solution for remove duplicate edges?

User 14 | 7/10/2014, 6:34:18 PM

Yes, it is instruction not constraint. The graph doesn't care about multiple edges with the same source and destination, and some algorithm may not care as well. If you do not want to add duplicate edges, We recommend you perform deduplication during the data preprocessing stage.

User 453 | 7/10/2014, 6:35:58 PM

Actually those edges are generated during the execution, so I may not able to deduplication in preprocessing stage. Could I do that on-the-fly?

User 453 | 7/10/2014, 6:41:15 PM

Or in other words, could I execute add_edge on only 1 machine so I can prevent duplicating?

User 14 | 7/10/2014, 6:56:08 PM

In your case, edgelist missingedges is computed on every machine, and add_edge will add the same edge to the graph N times, where N is the number of machine you have. You probably just want to do it on the master node like this:

if (dc.procid() == 0) { edgelist missingedges = graph.mapreducevertices ... graph.add_edges(...) } graph.finalize();

I hope this helps.

Thanks, -jay

User 453 | 7/10/2014, 7:23:21 PM

Hi Jay,

Thanks for your help! I tried your solution, but it will stop at enter finalize:

INFO: distributed_graph.hpp(finalize:702): Distributed graph: enter finalize

The modified code is as follows:

411 bool addmissingassignedges(graphlab::distributedcontrol& dc, 412 graphtype& graph, const graphlab::vertexset& tokenset) 413 { 414 //TODO: If this becomes a bottleneck, use dc.numprocs to split edge adding to different 415 // process 416 417 if(dc.procid() == 0) { 418 // Get missing edges from graph 419 edgelist missingedges = graph.mapreducevertices<edgelist> 420 (accumulatemissingassignedges, tokenset); 421 422 uint32t remaining = missingedges.edges.size(); 423 logstream(LOGINFO) << remaining << " edges need to be added\n"; 424 425 for(std::vector<edgepair>::iterator it = missingedges.edges.begin(); 426 it != missingedges.edges.end(); ++it) { 427 graph.add_edge((it).first, (it).second, edge(E_ASSIGN, 0)); 428 } 429 } 430 431 graph.finalize(); 432 return true; 433 }

Thanks! Really appreciate it.

User 14 | 7/10/2014, 8:06:59 PM

The code looks good to me. Can you make sure "graph.finalize()" is called on all machines?

You can put a line on 430 like logstream(LOG_INFO) << "machine id: " << dc.procid() << std::endl;

User 453 | 7/10/2014, 11:22:43 PM

Hi Jay,

I move L419 (mapreduceedges) out of "if" and then it could run now. I think finalize() will lock the process so mapreducevertices can not be executed...

Although it could run right now, the number of edges it created is slightly less than expected... Here is the new code:

415 bool addmissingassignedges(graphlab::distributedcontrol& dc, 416 graphtype& graph, const graphlab::vertexset& tokenset) 417 { 420 // I move this out 421 edgelist missingedges = graph.mapreducevertices<edgelist> 422 (accumulatemissingassignedges, tokenset); 423 424 if(dc.procid() == 0) { 425 // Get missing edges from graph 426 427 uint32t remaining = missingedges.edges.size(); 428 logstream(LOGINFO) << remaining << " edges need to be added\n"; 429 430 for(std::vector<edgepair>::iterator it = missingedges.edges.begin(); 431 it != missingedges.edges.end(); ++it) { 432 graph.add_edge((it).first, (it).second, edge(EASSIGN, 0)); 433 } 434 } 435 436 logstream(LOGINFO) << "machine id: " << dc.procid() << "\n"; 437 graph.finalize(); 438 return true; 439 }

INFO: hdtm.cpp(addmissingassign_edges:428): 1526234 edges need to be added INFO: hdtm.cpp(main:801): Added 1481428 edges

Could you give me a hint on that? I have tried add dc.full_barrier(); before finalize(), but that does not work either. Maybe one can not create an edge if the source and/or target vertex does not have a copy on that machine?


User 453 | 7/10/2014, 11:53:48 PM

Some more info, I tried print out all failed addedge, but there is no output. So I assume the problem may occur in ingressptr->addedge(source, target, edata); (L891 in distributedgraph.hpp). I did not specify any special partition algorithm and I think I'm using "grid" as ingress method.

INFO: distributedgraph.hpp(setingress_method:3201): Automatically determine ingress method: grid

User 14 | 7/11/2014, 12:00:48 AM

My bad, mapreducevertices has to be called on all machine, and that's why it was halting. For the new issue, I suspect that it is because there are self edges -- those with source==target -- in the "missing_edges" and self edges will be ignored during ingress. Can you double check on it?

User 453 | 7/11/2014, 12:56:08 AM

Hi Jay,

Unfortunately they are not... If I execute the same program with mpiexec -n 1, then the number of edges would be correct, and I also outputting all failed add_edge, but there is nothing.


User 14 | 7/11/2014, 2:23:29 AM

How do you output all failed add_edge?

User 453 | 7/11/2014, 2:25:44 AM

Oh sorry I did not explain that part. I'm outputting it by testing the result of add_edge

446 bool res = graph.add_edge((it).first, (it).second, edge(EASSIGN, 0)); 447 if(!res) logstream(LOGINFO) << "create " << (it).first << "->" << (it).second << "failed\n";

Is that the right way to do it?

User 14 | 7/11/2014, 5:11:29 AM

Hmm.. I'm not sure what's going wrong here. A better sanity check for it will be trying to add a fixed set of edges to an empty graph, and see if the program does the right thing. For example, instead of adding edges from missing_edge.edges, try adding edges from [(i,i+1) for i in 0...10000]. Then you can check whether the numbers are correct.

User 453 | 7/14/2014, 5:01:51 PM

Hi Jay, Yes you are right. There is no problem with edge addition. The actual problem is, I'm using a vertexset that generated before finalize() to track the changes. since I have changed the graph, the old vertexset does not apply on this new graph so I always missing some edges. Now I'm re-selecting the vertex_set after finalize() and it works perfectly! Thanks for your help!

Do you mind to share the internal mechanism of vertexset? When I execute my program on a 1 machine cluster, seems the changes won't affect vertexset, but if I'm running it on a real cluster, it will change along with graph.


User 14 | 7/15/2014, 3:06:29 AM

vertexset is distributed. Each node has a local graph and the vertexset tracks vertices in the local graph. When you have one machine, it is OK because the local graph is everything.