Easiest method to extract connected component subgraphs

User 942 | 11/23/2014, 5:50:07 PM

Hello,

I am trying to extract all the subgraphs from connected components. What is the optimal way to do it ?

I have tried several approach such as grouping vertices by component ids and then trying to apply a closure to extract the edges of the subgraph, creating the subgraph and dumping it to disk but I have a Pickling error: cannot pickle object of class graph.SGraph (even though I use json)

I also tried the getneighborhood <code class="CodeInline">g.getneighborhood(ids=map(int, x['ids'].tolist()), radius=1, full_subgraph=True)</code>

But I have the same issue. I think graph lab does not like to apply a closure which does a <b class="Bold">filter_by operation on a global graph</b> inside and <b class="Bold">apply</b> function but I am not sure.

I now have switched to a triple-apply function to label the edge with their respective componentid. In my graph I now have nodes and edges labeled with their componentid. Is it possible to extract subgraphs in parallel ?

Thank you

Comments

User 949 | 11/23/2014, 8:57:16 PM

I wrote an implementation of the label propagation algorithm: http://pastebin.com/fThJHxs4

This version assigns each nodes with the most common label in its neighborhood. See the full algorithm here: http://arxiv.org/abs/0709.2938

You can modify it to assign the smallest label in the neighborhood to each node, thus propagating the minimal node in each component. I will post an updated version of the code suitable for your needs tomorrow.


User 14 | 11/23/2014, 9:35:37 PM

Hi @kikohs,

If you have component id on each edge, a "sequential" implementation of all subgraph extraction is as simple as

for cid in range(num_components): sub_v = g.get_vertices([], {'cid', cid}) sub_e = g.get_edges([], [], {'cid': cid}) sub_g = SGraph(sub_v, sub_e) sub_g.save(...) The getvertices/getedges inside loop is parallelized. If you want to achieve higher degree of parallelism, you may want to checkout jobs/tasks http://graphlab.com/products/create/docs/graphlab.deploy.html#tasks which allow you to define tasks running distributed.

Jay


User 942 | 11/24/2014, 3:27:58 PM

Thanks for your insights.

@guy4261‌ How does your algorithm compares to the builtin GraphLab Create weakly connected components ? In my case, the nodes are already labelled, I merely wanted to extract all the subgraphs in parallel. However, I keep your code and your article because I may need to do label propagation in a near future :)

@JayGu‌ I thought it would be easier to parallelize the subgraph extraction. I will use the "sequential" version for now. Thanks


User 949 | 11/26/2014, 12:23:07 AM

@kikohs My algorithm compares to not reading closely enough :sweat_smile:

  1. I didn't notice that graphlab have connected components extraction.
  2. I didn't notice (or rather misunderstood) that you don't need connected components, but instead all subgraphs. I'm still not sure what do you mean by this - a powerset of the nodes in each connected component and their edges? A subset of the edges (even more work)? If you can elaborate, that sounds like a very nice exercise. Good luck with that anyway!

User 942 | 11/26/2014, 9:37:31 AM

@JayGu‌ The "sequential" extraction takes more than 5 hours (I stopped after that) while finding and labelling the nodes and edges takes 60 seconds. I have 10k components but it really takes forever. Do you have any tips ?

@guy4261 I want to extract each connected component as a subgraph of the original graph. If my graph G has 3 connected components, it means there are 3 disconnected subgraphs which can be extracted. Since it is way too long to extract them, I will try to work on each component using the groupby('component_id') and applying functions to them.


User 6 | 11/26/2014, 5:00:06 PM

I suggest the following

<pre class="CodeBlock"><code>g = gl.SGraph() g = g.addedges(data, srcfield='guid', dstfield='domain') g = g.addedges(data, srcfield='domain', dstfield='guid') cc = gl.connected_components.create(g) data = data.join(cc['componentid'], on={'domain':'__id'}) # add a new column called componentid to the data table g2 = gl.SGraph() g2 = g2.add_edges(data[data['componentid'] == 0], srcfield='guid', dst_field='domain') # subgraph with comp id 0 and so on</code></pre>


User 942 | 11/27/2014, 3:32:32 PM

@DannyBickson‌ here is my code <pre class="CodeBlock"><code> def extractcomponents(g, cc, minsize=2): comps = [] nodes = g.vertices edges = g.edges cc = cc[cc['Count'] > min_size] for cid in cc['component_id']: n = nodes[nodes['component_id'] == cid] e = edges[edges['component_id'] == cid] c = gl.SGraph(n, e) comps.append(c) return comps </code></pre>

It still takes hours. Is your code different ? Is there some underlying mechanism that could improve the process ?


User 6 | 11/27/2014, 4:23:51 PM

I suspect the append is a performance bottleneck. Why do you need separate graphs? you can create a single graph with disjoint components inside of it. The graph does not have to be connected. You can still compute shortest path from nodes which are in one of the disjoint graphs etc.


User 942 | 11/27/2014, 7:32:45 PM

I need to compute a lot of stats per component. I will try to manage them as a whole.


User 18 | 11/29/2014, 2:21:59 AM

@kikohs‌ what are the dimensions of your graph? i.e., how many vertices, edges, and how many vertex and edge properties? Also, how many different connected components are there?

The code looks fine. Nothing jumped out at me. So my first guess would be that if the vertex and edge tables have a lot of extra columns denoting vertex and edge properties, then constructing the new SFrames could take a while. Another possibility is that there are many many small connected components, and you are spending a lot time pulling out two or three vertices/edges at a time.

BTW, SGraph.vertices and SGraph.edges are special, mutable properties of the graph. They can be the left-hand side of an assignment. When using them as RHS operands, I prefer SGraph.getvertices() and SGraph.getedges(). That way I won't accidentally modify the original graph.


User 942 | 12/2/2014, 9:35:32 AM

@alicez The graph size varies between 200k-700k nodes and 100k - 200k edges. with only one or two properties.

However, they are a lot of connected components: 10k - 50k. I think it is the issue here (most of them are size 2 to 4).

I partially solved the problem by manipulating all components at once.