I had a quick look at the dataset (4 components, I plotted the four combinations of 3D subsets, and I saw that the shapes were pretty much like an ellipsoid).
Let me report the call:
me:"what kind of clustering algorithm did you use?"
him: "k-means!"
me: "the stupid k-means?"
him: "yes, the data set is pretty much regular and the points are well divided so I believe kmeans should work properly...Do you think I made a mistake to implement it?"
me: "no you are a fantastic developer. More easily it cannot work with data organized like ellipsoid shape! ...try to visit wikipedia, there is a nice sample with iris data set to explain where it fails"
him: "but iris data set is a tricky set, the points are overlapped, my scenario is different: it is more easy!".
me: "ok, this evening visit my blog, I'll show you a nice sample where the data are definitely divided, but the kmeans fails!"
Two set of points, in different colors the clusters obtained, the two big points are the centroids. |
|
I really hope that my dear friend will be convinced that doesn't exist an algorithm useful for all kind of problem!
Moreover... I would have in my job dataset "tricky" like Iris!!!
Contact me for the notebook to play with the simulations.
Moreover... I would have in my job dataset "tricky" like Iris!!!
Contact me for the notebook to play with the simulations.
Stay Tuned.
PS: I'll be on vacation until the end of the month... see you soon!
Mathematica has 4 clustering algorithms: "Agglomerate", "Optimize", "KMeans", "PAM". You can play with them using the following code:
ReplyDeleteGraphicsGrid[
Table[
data1 = Table[{Random[]/5, Random[]}, {100}];
data2 = Table[{Random[]/5 + distance, Random[]}, {100}];
dataM = Flatten[{data1, data2}, 1] // RandomChoice[#, Length[#]] &;
Table[
clusters =
ClusteringComponents[dataM, 2, 1,
DistanceFunction -> EuclideanDistance, Method -> myMethod];
Graphics[
{Green,
Point@Pick[dataM, clusters, 1],
Red,
Point@Pick[dataM, clusters, 2]
}, PlotLabel -> myMethod
],
{myMethod, {"Agglomerate", "Optimize", "KMeans", "PAM"}}
],
{distance, 0.2, 0.8, 0.1}
]
, ImageSize -> 600, Frame -> All
]
@Sjoerd C. de Vries : thanks for your comment and source code.
ReplyDeleteAs you mentioned, Mathematica provides many different kind of clustering algo, and many distance functions.
...BTW, defining a proper distance function, also the kmeans works properly with the ellipsoid data set I shown in the post.
For example a dissimilarity distance (based on variance measure) easily solve the problem.
Cheers
cristian
ReplyDeleteThanks for sharing, check out
Data Mining Service Providers
Data Mining Process