User 1041 | 12/9/2014, 3:30:54 PM
I'm doing research with the powerful GraphLab k-means. I notice the <a href="https://github.com/graphlab-code/graphlab/blob/master/toolkits/clustering/kmeans.cpp">k-means++ algorithm implemented in GraphLab</a> is different from the original algorithm in <a href="http://rangevoting.org/kMeansPlusPlus.pdf">Arthur's paper</a>.
The original algorithm(from Arthur's paper) is outlined as following: <img src="http://cdn.vanillaforums.com/graphlab.vanillaforums.com/editor/re/97hwmgyunuet.png" alt="" />
In this algorithm, before sampling a new center ci, the cumulative probability is calculated for each point. The implementation of Arthur's code follow this algorithm. However, the GraphLab k-means++ does not follow this algorithm. As shown in the following code (from k-means++ algorithm implemented in GraphLab), GraphLab k-means++ updates cumulative probability when it iterates through the dataset. For each point, line 452 does a Bernoulli experiment with probability myp. But, myp becomes larger and larger for points being iterated later. My question is that: is this a bug ? Because apparently, GraphLab k-means++ is not following original k-means++ algorithm. Or if it is not a bug, how does this implementation approximate the original k-means++ algorithm ? <img src="http://cdn.vanillaforums.com/graphlab.vanillaforums.com/editor/tc/jowlfqmwk0m0.png" alt="" />
Thanks for your attention ! I'm looking forward to your reply.