How does nearest neighbor in GraphLab Create handle 5 million nodes ?

User 674 | 6/9/2015, 6:07:43 PM

Dear all, It is very amazing to me that GraphLab Create can handle 5 million very very easily.

I came up with a version of k nearest neighbor in PowerGraph, [SOURCE].

The high level idea: Load S (the given set of points) and R ( query set ) from disk into graph. Connect each node r in R to every node in S.

Invoke the setDistance engine on all R points. Gather: for each edge, gather dist and vid to current vertex. Apply: do a quicksort to both dist and vid at the same time. Scatter: empty.

So at the end of the day, each node in R will have the K nearest neighbors and respective distances.

However, there are two problems with this code: 1. This code cannot use a C++ pair to group dist and vid together. If I use pair, compile error will show up. 2. This code cannot handle a larger number of points. The largest number of points I can handle at the moment is 2000. If I try with more points, segmentation fault will show up.

What I need help with I want figure out if I am loading points into PowerGraph correctly or not. I would truly appreciate some pointer on how to add C++ pair support to PowerGraph. The most desirable, I would love to understand what I might have done wrong to handle only 2000 points instead of millions of points.

Any input/pointer is truly appreciated, please do not be shy.

Thank you very much, Qiyuan Qiu

Comments

User 19 | 6/9/2015, 10:50:54 PM

Hi Qiyuan,

Thanks for getting in touch. The open source PowerGraph project is deprecated, but it's straightforward to develop scalable C++ applications using our scalable graph datastructure, the SGraph. Please check out our SDK here: https://github.com/dato-code/GraphLab-Create-SDK. In this case you will want to write your k-nearest neighbor in terms of a tripleApply function. See here for an example: https://github.com/dato-code/GraphLab-Create-SDK/blob/master/sdkexample/sgraphweighted_pagerank.cpp#L88

Happy to help, Chris


User 674 | 6/10/2015, 1:48:32 AM

Hi Chris,

Thank you very much for the pointers.

I choose PowerGraph because it is the only distributed version. (Graphlab Create is for single machine AFAIK). Is there currently any distributed GraphLab Create we have access to ?

Thanks, Qiyuan Qiu


User 19 | 6/10/2015, 5:21:09 AM

Hi Qiyuan,

As of now we don't, but please stay tuned! We are always aiming to build features for our customers. Can you tell us a bit more about your particular use case? Is nearest neighbors your main algorithm? What are the applications of interest? Getting feedback helps us prioritize how soon we release certain features.

Thanks, Chris


User 674 | 6/10/2015, 2:49:22 PM

Hi Chris,

At the moment we are happy with GraphLab Create. We want to understand the performance characteristic in a distributed context. In other words, we want to experiment and see what the pros and cons are for both of these setting (single machine / cluster).

Thanks, Qiyuan


User 19 | 6/10/2015, 4:23:52 PM

Hi Qiyuan,

Glad to hear you're enjoying GraphLab Create!

For your experiments, is there a particular type of algorithm of interest? Or just data processing in general?

Thanks Chris