User 52 | 1/8/2015, 9:03:30 PM

I am actually working on a problem for my research work. Please suggest the most optimal way to do it.

I have a Bipartite graph which is well connected .In this graph we have two types of vertices namely major and minor. There is no edge between major vertices and there is no edge between minor vertices. What i want to do is the following:

- I want to find the number of paths of length 2 between every pair of Major vertices.
- Degree of all the major vertices.

Remember i have a large number of vertices and edges.

Thanks, Kapil

User 954 | 1/8/2015, 9:51:12 PM

Hi Kapil,

You can use <a href="http://graphlab.com/products/create/docs/generated/graphlab.SGraph.triple*apply.html#graphlab.SGraph.triple*apply">triple*apply()</a> operator to compute degree of vertices. It is very fast. In our <a href="http://dato.com/learn/userguide/index.html#Working*with*data">user-guide</a> , we describe how to use triple*apply() operator to compute degree information.

In order to find the number of paths of length 2, you can define another triple_apply() operator on the result of the previous one as follow:

<pre class='CodeBlock'><code>
def number*path*len_2(src, edge, dst):
edge['path_count'] += ( src['degree'] - 1 )
edge['path*count'] += ( dst['degree'] - 1 )
return (src, edge, dst)
g = g.triple*apply(number*path*len*2, mutated*fields=['path_count'])
</code></pre>

the 'path*count' field on each edge E is the number paths with length 2 that includes edge E in their path.
If you sum up all 'path*count' values, then you have two times of the total number of paths with length 2. ( so you need to divide that by two :) )

User 52 | 1/12/2015, 11:24:12 AM

hey soroush,

yes i was able to compute degrees using triple apply as u suggested

But ,as i have a bipartite graph and i want to find number of paths of length 2 between the vertices which doesnt have an edge between them. As there is no edge between them so no path count .

User 1189 | 1/12/2015, 7:39:41 PM

Hi,

This operation is actually slightly tricky. I think I can figure out a way to do with triple apply. Essentially you first accumulate at each minor vertex, the list of major vertices connected to it. Then you do one more pass across each edge which will allow you to accumulate at each major vertex, the length 2 path counts. However, an issue is that the output could be quite large. (in the worst cast #vertices ^ 2).

An alternative possibility will be to do so with a series of join and groupby operations.

User 52 | 1/14/2015, 9:38:44 AM

hey Thanks for the reply,

Do you know how can i apply the operation (triple apply) to selective set of vertices like only to minors.

Can u explain a bit how u are saying we can use groupby and join?

Thanks

Kapil

User 1189 | 1/14/2015, 7:01:28 PM

<b class="Bold">Using triple apply</b> With triple apply it will something like:

Assume each vertex has the attributes:
- id (int)
- is*major (int)
- neighbors (list). initialized to empty list.
- second*degree_neighbors (dict) initialized to empty dict.

Then a triple apply could go for instance: def get_neighbors(src, edge, dst): if src['is_major'] == 0: src['neighbors'].append(dst['id']) if dst['is_major'] == 0: dst['neighbors'].append(src['id'])

Alternatively, if all your edges only lead from major to minor, then you don't need the is_major field and you can take advantage of the triple apply arguments and just go:

def get_neighbors(src, edge, dst): dst['neighbors'].append(src['id'])

To collect it back to the majors will be then just operating on src:
def collect*second*deg_neighbors(src, edge, dst):
for i in dst['neighbors']:
src['second_degree_neighbors'][i] = src['second_degree_neighbors'][i] + 1

These are also easy to implement in C++ using the SDK which will give you really good performance.

Do note that this is ultimately going to be quite memory intensive depending on the size of your graph. Since in the worst case the neighbor list, or the second*degree*neighbor dictionary can be of size #V.

<b class="Bold">Using Tables</b> Using join and groupby's is essentially taking the observation that given an edge list, the 2nd degree neighborhood is essentially the computation of a self-join on the edge list. This may or may not be faster than the above depending on the size of your graph.