Failing to obtain good k-means performance

User 4000 | 3/25/2016, 8:45:44 AM

Hello,

I am running the k-means routine from the python bindings on a file with 66 Million samples and 8 dimensions. I have a 4 NUMA-node, 48 core, 1TB RAM linux box I'm running on. I find that a very large portion of time spent is system time and not computation. I checked the runtime configuration here and for every item that affects performance I have very large values (several hundred GBs for example when compared to the file size on disk of 6.2 GB. It is taking 20+ min to compute 10 clusters when I would expect ~5 min. Is there a list of other configurations I should check to get the best better performance?

Thank you!

Comments

User 2156 | 3/25/2016, 8:44:29 PM

Hi Disa,

We will discuss this issues and take actions on it. Thank you for your feedback.


User 12 | 3/25/2016, 8:58:47 PM

Hi Disa, We're always looking for ways to improve model speed, so I'm happy to work with you on this. A couple questions to get started:

  • How many clusters are you using?

  • What method does the model say it's using for the training?

  • What is the number of unpacked features? You mentioned that you have 8 dimensions, so I'm guessing the answer is 8, but I just want to double check. Are any of those 8 features more complex types like lists?

  • Are all of the features floats?

  • Try separating the selection of the initial cluster centers from the iterative assignment of points to clusters, and time each piece. To get the initial cluster centers, set the max_iterations parameter to 0. Then, for the second part, take the clusters from that model and pass them to the initial_centers parameter with the usual number of iterations (10 or 20, etc). How long does each of these steps take?

Thanks, Brian


User 1207 | 3/26/2016, 12:43:38 AM

Hey Disa,

One thing to add -- if you are running on a NUMA architecture, you can usually greatly improve the performance of GLC by controlling it with the numactl utility. Graphlab, and k-means included, assumes fast, relatively uniform memory access, but it can scale very well on limited resources. Thus I would recommend trying to run python (and then GLC) using numactl to see if this helps -- try numactl --localaccess -H <N-N> ipython, where <N-N> is a set of cpu cores corresponding to a single physical CPU. You can find out which core numbers are on a single CPU using numactl --hardware.

That's another thing to try :-).

-- Hoyt