Tuesday, December 9, 2014

Partitional clustering: number of clusters and performances a quantitative analysis

Abstract
Partitional clustering methods as k-medoid or k-means require an input parameter to specify the number of clusters to partition the data.
The complexity in time of such algorithms strictly depends on
  • the number of clusters used to initialise the computation.
  • the steps to update the centroids.
Whenever the similarity distance doesn't allow the determination of the centroid thru the analytical methods, the complexity in time tends to explode.
In the post I show an heuristic to minimise the complexity in time in non-Euclidean space.
Number of computational steps for standard k-mean executed. Chart depicts  the steps by number of points $N$ in the range [1k,10k] and  by number of clusters in the range [5,$N/10$].
The cardinality of the clusters has been simulated assuming uniform distribution.
The red line indicates the minimum number of steps.
Introduction
K-means or k-medoids are widely used because its implementation is very easy  and the complexity in time is low (it's linear over the iterations*#clusters#*#points).
Is the complexity linear  for whatever problem?
The answer is no!

The problem
The problem arises when the objects to be clustered don't lie in a space for which the computation of the centroids can be done thru analytical methods.
In this context, the computational complexity to update the centroids becomes preponderant respect the other steps and the overall complexity is not any longer linear in time.

The complexity in case of non-Euclidean space
In a Euclidean space, the points of a cluster can be averaged. 
In non-Euclidean spaces, there is no guarantee that points have an “average” (...some puritan of the measure theory might stuck up their nose on that) so we have to choose one of the members of the cluster as centroid.
As explained in the former post, for each cluster the computation of the centroid requires (in case of symmetric distance) $\binom{K_i}{2}$ computational steps.
The overall complexity (for each iteration) is the following:
  $ \sum_{i=1}^k \binom{z_i}{2}+k \cdot n$ 

A dataset having several points might require a number of computations simply not feasible; in that case it's essential the minimisation and optimisation of each param of the algorithm to reduce the complexity.
A numeric example will clarify the point.
Let's assume a data set of 10.000 elements, and let's assume that the compression rate of the clustering is 90% (meaning the number of clusters are 1.000).

  • If the clusters cardinality is homogeneous, than the number of steps for the first iteration will be = 10.045.000.
  • If the clusters cardinality is highly unbalanced, for instance in case that a cluster contains 95% of the points, than the number of steps is = 50.360.655.
As you can see... even if the data set is relatively small the computational effort is very high!

Determination of the #clusters that minimises the complexity in time

There are no many choices ... the only option is to determine the number of clusters that minimises the computational steps.

You might argue that the # clusters should be selected according to the distribution of the data set, but it's also true the following:
  • In the real world you don't know a priori the exact number of clusters and all the heuristics to estimate them are quite often not really helpful in non-Euclidean space with several points. I prefer so, to have a sub-optimal clustering solution but with much faster answer :)
  • clustering analysis is usually one of the many analysis that usually you perform to understand your data, therefore even if the clustering is not perfect, it doesn't hurt too much. Clustering it's just one piece of the entire puzzle!
The approach
If the clustering divides the data set in clusters having more or less the same size, the  solution is quite straightforward, we can consider the minimum of the function described above, thru the analysis of first derivate.
Assuming that each $z_i = \frac{n}{k}$ then the formula can be rewritten as:
$ f(n,k)=\frac{n^2}{2 k}+k n-\frac{n}{2}$
The value of $k$ that minimises the first derivative of the such function is:
$ \frac{\partial f(n,k) }{\partial k}=\frac{\partial}{\partial k} \frac{n^2}{2 k}+k n-\frac{n}{2} = n-\frac{n^2}{2 k^2}$.
And it takes value = 0 with $k= \frac{\sqrt{n}}{\sqrt{2}}$.

In all the other cases for which the clusters size is not homogeneous, then we can easily simulate it!

Computations for a data set of 10k points.
 The computations are estimated by number of clusters (cluster size ~ [2,1000]) and with unbalanced clusters sizes.
The red line indicates the minimum. 
As you can see the difference between  the computational steps to be executed in case of "best" configuration of #clusters and all the other possible configurations is quite impressive: it's almost 1*10^7 in worst case. And this is for each iteration!
I let you calculate how much time you can save working with the best configuration :).
Stay tuned!
c.

35 comments:

  1. This is a very informative article about Data Mining. It is true that now a days data mining is playing a crucial role in the business growth or competition check, but we should done this job by experts only. Loginworks has expertise in providing the best data mining services on demand. Please check them at:

    http://www.loginworks.com/data-mining-services-various-type/

    ReplyDelete
  2. Internet Data Mining is a process that involves searching, collecting, filtering and analyzing the data from online sources. There are many Data Mining Company which providing the Data Mining Service

    ReplyDelete
  3. Thanks For this information. There are many Data Mining Companies which provides Data Mining Service, web data scraping and web data extraction services.

    ReplyDelete
  4. Thanks, these Data mining services information are really valuable. You have great search on it. Please write more on Benefits of Data Mining. Now you are subscribed.

    ReplyDelete
  5. Every one wants to get discounts on his online shopping purchase and this website is giving access to generate free amazon card via amazon gift card code generator that has the authority to provide up to 100$ amazon card in a single go...!! just check this out and enjoy shopping online only at amazon gift card.

    ReplyDelete
  6. This is really a useful and informative post. Through your portal, one can enhance the knowledge in Analytics world Data Mining.

    ReplyDelete
  7. The comprehensive data warehouse solutions provided by your company incorporate excellent data modelling along with the creative governance in the data migration or data integration.

    ReplyDelete
  8. This comment has been removed by the author.

    ReplyDelete
  9. Thank you so much for this nice information. Hope so many people will get aware of this and useful as well. And please keep update like this.

    Text Analytics Companies

    Sentiment Analysis Tool

    ReplyDelete
  10. Top quality blog with unique content and found valuable looking forward for next updated thank you
    Ethical Hacking Course in Bangalore

    ReplyDelete

  11. Through this post, I realize that your great information in playing with all the pieces was exceptionally useful. I advise this is the primary spot where I discover issues I've been scanning for. You have a smart yet alluring method of composing.
    360DigiTMG project management training

    ReplyDelete
  12. Students often face difficulty in completing their homework on time leading to poor academic grades. It is often hectic and tiring to compete with other academic works but Homework Help is just to place to look for help. It provides all the essential to complete homework before their deadlines.

    ReplyDelete
  13. Conducting good research help you to face this problem and you can also take Assignment Help Online from experienced professionals. These professionals are well trained in finding authentic sources for collecting information for assignment topics.

    ReplyDelete
  14. What you are sharing your all blogs can you realize is this? Its was really great experience to your all readers. I can't explain about you and your skills. Thankyou so much for sharing this great and useful information about many interesting niche, I hope keep more post for sharing. Good luck for your new posts :) psychology assignment help - science assignment help - assignment help new york - ruby assignment help

    ReplyDelete
  15. Thanks for posting this fantastic content. I appreciate your effort. Checkout How To Start A Small Business At Home For 12 Year Olds

    ReplyDelete
  16. Are you looking for Homework Help from the best experts? If yes, then there is no need to search. We have a team of experts who have experience in completing the assignments with perfection. Contact GreatAssignmentHelp to get all types of assignment help on time

    ReplyDelete
  17. The new report by Expert Market Research titled, ‘Global Chocolate Market Report and Forecast 2022-2027’, gives an in-depth analysis of the global Chocolate Market, assessing the market based on its segments like categories, product types, growth, qualities, processed types, applications, and regions. The report tracks the latest trends in the industry and studies their impact on the overall market upcoming years. The chocolate market is being driven by the growing demand for chocolate in the food and beverage industry, as a popular flavouring ingredient. For instance, chocolate is used in producing cookies and cakes, and can also be found in beverages, Additionally, the wide variety of chocolates available in the market are also attracting further consumer investments, which is expected to positively impact the market growth. The major players in the market are Nestlé S.A., Ferrero Group, Mars, Incorporated, Mondelez International, Hershey Co., Barry Callebaut AG, and Kerry Group plc, many others are lead of this industry. Expert Market Research provides market intelligence across a range of industry verticals which include Pharmaceuticals, Health, Food and Beverage, Chemical and Materials, Energy and Mining. For more information about us, Kindly visit our website also.

    ReplyDelete
  18. One of the primary drivers driving the market’s favourable outlook is significant growth in the Life Science Market, which is accompanied by an increased demand for analytical insights for clinical trials. Other growth-inducing factors include technological advancements such as the creation of new telemedicine, mhealth, and e-prescribing systems. These analytics solutions are also commonly used to generate precision and tailored medicines, which rely on precise genomic data from patients with specific medical needs.

    Read More:

    Integrated Workplace Management System Market

    Siding Market

    Chlor-Alkali Market

    Solid Waste Management Market

    Advanced Functional Materials Market

    Polypropylene Market

    Drilling Fluids Market

    Alopecia Market

    Laparoscopy Devices Market

    ReplyDelete
  19. I Like your post. It gives more information to us.
    Sex Crimes Lawyer VA

    ReplyDelete
  20. I simply wanted to write down a quick word to say thanks to you for that wonderful information you are showing on this site. USA verified PayPal account

    ReplyDelete
  21. Anyone who wants to learn Hindi in the simplest possible method should enrol in Ziyyara's online Hindi language course.

    ReplyDelete
  22. Hi.
    Thank you for sharing this information, this blog was good and it’s give good information about The problem arises when the objects to be clustered don't lie in a space for which the computation of the centroids can be done thru analytical methods. thanks for sharing this information with us.
    Here is sharing some OTM information may be its helpful to you.
    OTM Training

    ReplyDelete
  23. "Dissertation writing services offer expert support to students tackling the intricacies of creating a thesis. Guiding through topic selection, literature review, and research design, these services ensure a robust scholarly document. With experienced writers, they provide aid in data collection, analysis, and synthesis of findings. Constructive critiques refine content. Caution is advised to choose ethical and credible services that prioritize authenticity. Such services alleviate the daunting task, providing students with well-structured, top-quality dissertations. Ultimately, Dissertation writing services
    streamline the process, helping students achieve academic success."

    ReplyDelete
  24. Thank you for sharing this informative blog with us. Your blog is very useful for us. Are you a student pursuing a management course in Australia and facing challenges with your assignments? Look no further! Our Management Assignment Help service in Australia is here to assist you in achieving academic excellence and mastering the art of management. Our dedicated team of expert writers and subject matter specialists is well-versed in the intricacies of management principles, theories, and practices. We offer comprehensive support for all aspects of management assignments, including:


    ReplyDelete
  25. I want to express my sincere appreciation for sharing an outstanding article. Your dedication to providing valuable information is truly admirable and has not only helped me but, I believe, numerous others as well. Your commitment to delivering informative content is highly valued, and I encourage you to keep up this exceptional standard. Your contributions make a meaningful impact, and we appreciate the knowledge and insights you consistently provide. On the hunt for assignment aid? Your search terminates right here! We extend our management assignment help services within your reach, featuring a proficient squad of 2000+ assignment professionals. Rest assured, we're committed to shepherding you through your assignments. Queries? Contact us promptly.

    ReplyDelete