Wednesday, July 13, 2011

Kmeans: an easy clustering algorithm... maybe too easy for real world problems

I was writing the first post about the document auto classification, when a friend contacted me for an opinion related to clustering algorithm.
I had a quick look at the dataset (4 components, I plotted the four combinations of 3D subsets, and I saw that the shapes were pretty much like an ellipsoid).
Let me report the call:
me:"what kind of clustering algorithm did you use?"
him: "k-means!"
me: "the stupid k-means?"
him: "yes, the data set is pretty much regular and the points are well divided so I believe kmeans should work properly...Do you think I made a mistake to implement it?"
me: "no you are a fantastic developer. More easily it cannot work with data organized like ellipsoid shape! ...try to visit wikipedia, there is a nice sample with iris data set to explain where it fails"
him: "but iris data set is a tricky set, the points are overlapped, my scenario is different: it is more easy!".
me: "ok, this evening visit my blog, I'll show you a nice sample where the data are definitely divided, but the kmeans fails!"
Two set of points,  in different colors the clusters obtained, the two big points are the centroids.
Just to have another prospective, I have done the same test in 3D:

Two set of points,  in different colors the clusters obtained, the two big points are the centroids.
Of course, increasing the number of centroids, the situation gets worse and worse:
I really hope that my dear friend will be convinced that doesn't exist an algorithm useful for all kind of problem!
Moreover... I would have in my job dataset "tricky" like Iris!!!
Contact me for the notebook to play with the simulations.
Stay Tuned.
PS: I'll be on vacation until the end of the month... see you soon!


  1. Mathematica has 4 clustering algorithms: "Agglomerate", "Optimize", "KMeans", "PAM". You can play with them using the following code:

    data1 = Table[{Random[]/5, Random[]}, {100}];
    data2 = Table[{Random[]/5 + distance, Random[]}, {100}];
    dataM = Flatten[{data1, data2}, 1] // RandomChoice[#, Length[#]] &;
    clusters =
    ClusteringComponents[dataM, 2, 1,
    DistanceFunction -> EuclideanDistance, Method -> myMethod];
    Point@Pick[dataM, clusters, 1],
    Point@Pick[dataM, clusters, 2]
    }, PlotLabel -> myMethod
    {myMethod, {"Agglomerate", "Optimize", "KMeans", "PAM"}}
    {distance, 0.2, 0.8, 0.1}
    , ImageSize -> 600, Frame -> All

  2. @Sjoerd C. de Vries : thanks for your comment and source code.
    As you mentioned, Mathematica provides many different kind of clustering algo, and many distance functions.
    ...BTW, defining a proper distance function, also the kmeans works properly with the ellipsoid data set I shown in the post.
    For example a dissimilarity distance (based on variance measure) easily solve the problem.