Why is the implementation of GraphLab k-means++ different from the original algorithm ?

User 1041 | 12/9/2014, 3:30:54 PM

I'm doing research with the powerful GraphLab k-means. I notice the <a href="https://github.com/graphlab-code/graphlab/blob/master/toolkits/clustering/kmeans.cpp">k-means++ algorithm implemented in GraphLab</a> is different from the original algorithm in <a href="http://rangevoting.org/kMeansPlusPlus.pdf">Arthur's paper</a>.

The original algorithm(from Arthur's paper) is outlined as following: <img src="http://cdn.vanillaforums.com/graphlab.vanillaforums.com/editor/re/97hwmgyunuet.png" alt="" />

In this algorithm, before sampling a new center ci, the cumulative probability is calculated for each point. The implementation of Arthur's code follow this algorithm. However, the GraphLab k-means++ does not follow this algorithm. As shown in the following code (from k-means++ algorithm implemented in GraphLab), GraphLab k-means++ updates cumulative probability when it iterates through the dataset. For each point, line 452 does a Bernoulli experiment with probability myp. But, myp becomes larger and larger for points being iterated later. My question is that: is this a bug ? Because apparently, GraphLab k-means++ is not following original k-means++ algorithm. Or if it is not a bug, how does this implementation approximate the original k-means++ algorithm ? <img src="http://cdn.vanillaforums.com/graphlab.vanillaforums.com/editor/tc/jowlfqmwk0m0.png" alt="" />

Thanks for your attention ! I'm looking forward to your reply.


User 6 | 12/12/2014, 5:37:10 AM

Hi, We just double checked and our implementation of k-means++ in GraphLab Create is identical to Arthur's algorithm. We recommend using Graphlab Create.


User 1041 | 12/12/2014, 6:19:45 AM


why do you think it's identical to Arthur's algorithm ?
Firstly, GraphLab Create's k-means++ implementation is apparently different from the algorithm description.

Secondly, in Arthur's implementation of k-means++, there is cumulative probability calculation and searching process (line 307 to 313) while there is nothing similar in GraphLab Create's k-means++ implementation.

/// Arthur's implemention. 301 // Repeatedly choose more centers 302 for (int newcluster = 1; newcluster < k; newcluster++) { 303 Scalar temp = 0; 304 for (i = 0; i < n; i++) 305 temp += distsq[i]; 306 while (1) { 307 Scalar cutoff = (rand() / Scalar(RANDMAX)) * totalcost; 308 Scalar curcost = 0; 309 for (i = 0; i < n; i++) { 310 curcost += distsq[i]; 311 if (curcost >= cutoff) 312 break; 313 } 314 if (i < n) 315 break; 316 } 317 memcpy(centers + newcluster*d, points + pointindices[i]*d, d*sizeof(Scalar)); 318 totalcost = SeedKmppUpdateAssignment(topnode, newcluster, centers, dist_sq); 319 }

Thirdly, I have implement k-means++ in GraphLab Create and it's very different from the current implementation but similar to Arthur's algorithm description and implementation. It is like:

981 for(KMEANSINITIALIZATION=1; KMEANSINITIALIZATION < NUMCLUSTERS; KMEANSINITIALIZATION++) 982 { 983 //Get a random number y uniformly between 0 and sumD. 984 //sumD is equal to cumuDistances[numofvertices-1]. 985 double randomY = graphlab::random::rand01(); 986 randomY *= cumuDistances[numofvertices-1]; 987 988 //Find i s.t. D(x1)^2 + D(x2)^2 + ... D(xi)^2 >= y > D(x1)^2 + D(x2)^2 + ... + D(x(i-1))^2. 989 int i; 990 for(i=0; i<numofvertices; i++) 991 { 992 if(cumuDistances[i] >= randomY) 993 { 994 break; 995 } 996 } 997 998 //Add x_i to CLUSTERS. 999 CLUSTERS[KMEANSINITIALIZATION].center = graph.vertex(i).data().point; 1000 1001 //Calculate D(xi) for each xi. 1002 //Calculate the cumulative distance sumD = (D(x1)^2 + D(x2)^2 + ... + D(xn)^2). 1003 graph.computeCumulativeDistance(kmeansppinitialization, cumuDistances);

Please give me evidence and convince me ! Thanks.

User 6 | 12/12/2014, 6:37:08 AM

HI, If I am not wrong, you are referring to PowerGraph, which is our older open source. We have recently switched to GraphLab Create, which the major parts will be released as open source as well. Here is the k-means documentation: http://graphlab.com/products/create/docs/generated/graphlab.kmeans.create.html#graphlab.kmeans.create

Since PowerGraph code is going to be deprecated soon, we recommend switching to GraphLab Create.

User 1041 | 12/12/2014, 2:52:12 PM

Yes. It is PowerGraph. Thanks for letting me know GraphLab Create. Because I will implement algorithms in GraphLab, if I switch to GraphLab Create, do I have the source code for debugging ? If not now, when will the code be open ?


User 6 | 12/13/2014, 3:46:34 PM

We are working on releasing the majority of GraphLab Create as open source. It will take a few more months - stay tune we will announce on this forum.

By the way, if you need to debug our implementation, it is possible to set the cluster centers by a fixed value using the command line initial_centers. See <a href="http://graphlab.com/products/create/docs/generated/graphlab.kmeans.create.html#graphlab.kmeans.create">here</a>.

User 1041 | 12/17/2014, 3:19:32 PM

Thanks a lot, Danny. I'm going to implement k-means|| in GraphLab PowerGraph.