Text & Data Mining by practical means: Partitional clustering: number of clusters and performances a quantitative analysis

Tuesday, December 9, 2014

Partitional clustering: number of clusters and performances a quantitative analysis

Abstract
Partitional clustering methods as k-medoid or k-means require an input parameter to specify the number of clusters to partition the data.
The complexity in time of such algorithms strictly depends on

the number of clusters used to initialise the computation.
the steps to update the centroids.

Whenever the similarity distance doesn't allow the determination of the centroid thru the analytical methods, the complexity in time tends to explode.
In the post I show an heuristic to minimise the complexity in time in non-Euclidean space.

Number of computational steps for standard k-mean executed. Chart depicts the steps by number of points $N$ in the range [1k,10k] and by number of clusters in the range [5,$N/10$].
The cardinality of the clusters has been simulated assuming uniform distribution.
The red line indicates the minimum number of steps.

Introduction
K-means or k-medoids are widely used because its implementation is very easy and the complexity in time is low (it's linear over the iterations*#clusters#*#points).
Is the complexity linear for whatever problem?
The answer is no!

The problem
The problem arises when the objects to be clustered don't lie in a space for which the computation of the centroids can be done thru analytical methods.
In this context, the computational complexity to update the centroids becomes preponderant respect the other steps and the overall complexity is not any longer linear in time.

The complexity in case of non-Euclidean space
In a Euclidean space, the points of a cluster can be averaged.
In non-Euclidean spaces, there is no guarantee that points have an “average” (...some puritan of the measure theory might stuck up their nose on that) so we have to choose one of the members of the cluster as centroid.
As explained in the former post, for each cluster the computation of the centroid requires (in case of symmetric distance) $\binom{K_i}{2}$ computational steps.
The overall complexity (for each iteration) is the following:
$ \sum_{i=1}^k \binom{z_i}{2}+k \cdot n$

A dataset having several points might require a number of computations simply not feasible; in that case it's essential the minimisation and optimisation of each param of the algorithm to reduce the complexity.
A numeric example will clarify the point.
Let's assume a data set of 10.000 elements, and let's assume that the compression rate of the clustering is 90% (meaning the number of clusters are 1.000).

If the clusters cardinality is homogeneous, than the number of steps for the first iteration will be = 10.045.000.
If the clusters cardinality is highly unbalanced, for instance in case that a cluster contains 95% of the points, than the number of steps is = 50.360.655.

As you can see... even if the data set is relatively small the computational effort is very high!

Determination of the #clusters that minimises the complexity in time

There are no many choices ... the only option is to determine the number of clusters that minimises the computational steps.

You might argue that the # clusters should be selected according to the distribution of the data set, but it's also true the following:

In the real world you don't know a priori the exact number of clusters and all the heuristics to estimate them are quite often not really helpful in non-Euclidean space with several points. I prefer so, to have a sub-optimal clustering solution but with much faster answer :)
clustering analysis is usually one of the many analysis that usually you perform to understand your data, therefore even if the clustering is not perfect, it doesn't hurt too much. Clustering it's just one piece of the entire puzzle!

The approach
If the clustering divides the data set in clusters having more or less the same size, the solution is quite straightforward, we can consider the minimum of the function described above, thru the analysis of first derivate.
Assuming that each $z_i = \frac{n}{k}$ then the formula can be rewritten as:
$ f(n,k)=\frac{n^2}{2 k}+k n-\frac{n}{2}$
The value of $k$ that minimises the first derivative of the such function is:
$ \frac{\partial f(n,k) }{\partial k}=\frac{\partial}{\partial k} \frac{n^2}{2 k}+k n-\frac{n}{2} = n-\frac{n^2}{2 k^2}$.
And it takes value = 0 with $k= \frac{\sqrt{n}}{\sqrt{2}}$.

In all the other cases for which the clusters size is not homogeneous, then we can easily simulate it!

Computations for a data set of 10k points.
The computations are estimated by number of clusters (cluster size ~ [2,1000]) and with unbalanced clusters sizes.
The red line indicates the minimum.

As you can see the difference between the computational steps to be executed in case of "best" configuration of #clusters and all the other possible configurations is quite impressive: it's almost 1*10^7 in worst case. And this is for each iteration!

I let you calculate how much time you can save working with the best configuration :).

Stay tuned!

40 comments:

SamuelJune 2, 2015 at 6:37 AM
This is a very informative article about Data Mining. It is true that now a days data mining is playing a crucial role in the business growth or competition check, but we should done this job by experts only. Loginworks has expertise in providing the best data mining services on demand. Please check them at:

http://www.loginworks.com/data-mining-services-various-type/
ReplyDelete
Replies
UnknownJuly 21, 2015 at 3:28 AM
Internet Data Mining is a process that involves searching, collecting, filtering and analyzing the data from online sources. There are many Data Mining Company which providing the Data Mining Service
ReplyDelete
Replies
AnonymousOctober 28, 2015 at 5:38 AM
Thanks For this information. There are many Data Mining Companies which provides Data Mining Service, web data scraping and web data extraction services.
ReplyDelete
Replies
Zaheer FazalNovember 30, 2015 at 10:34 PM
Thanks, these Data mining services information are really valuable. You have great search on it. Please write more on Benefits of Data Mining. Now you are subscribed.
ReplyDelete
Replies
AnonymousMay 16, 2016 at 6:53 PM
Every one wants to get discounts on his online shopping purchase and this website is giving access to generate free amazon card via amazon gift card code generator that has the authority to provide up to 100$ amazon card in a single go...!! just check this out and enjoy shopping online only at amazon gift card.
ReplyDelete
Replies
katherineharrykwelApril 17, 2019 at 8:55 AM
This is really a useful and informative post. Through your portal, one can enhance the knowledge in Analytics world Data Mining.
ReplyDelete
Replies
leoJune 27, 2019 at 12:58 AM
Amazing content.
Data Mining Process
ReplyDelete
Replies
NikishaDecember 19, 2019 at 2:39 AM
The comprehensive data warehouse solutions provided by your company incorporate excellent data modelling along with the creative governance in the data migration or data integration.
ReplyDelete
Replies
Hir InfotechDecember 30, 2019 at 4:10 AM
This comment has been removed by the author.
ReplyDelete
Replies
CharlesJanuary 27, 2020 at 9:18 PM
Thank you so much for this nice information. Hope so many people will get aware of this and useful as well. And please keep update like this.

Text Analytics Companies

Sentiment Analysis Tool
ReplyDelete
Replies
Ethical Hacking CourseNovember 7, 2020 at 5:38 AM
Top quality blog with unique content and found valuable looking forward for next updated thank you
Ethical Hacking Course in Bangalore
ReplyDelete
Replies
AnonymousNovember 8, 2020 at 8:41 PM

Through this post, I realize that your great information in playing with all the pieces was exceptionally useful. I advise this is the primary spot where I discover issues I've been scanning for. You have a smart yet alluring method of composing.
360DigiTMG project management training
ReplyDelete
Replies
Glen MaxyJuly 19, 2021 at 12:19 AM
Students often face difficulty in completing their homework on time leading to poor academic grades. It is often hectic and tiring to compete with other academic works but Homework Help is just to place to look for help. It provides all the essential to complete homework before their deadlines.
ReplyDelete
Replies
James MartinDecember 21, 2021 at 2:11 AM
Conducting good research help you to face this problem and you can also take Assignment Help Online from experienced professionals. These professionals are well trained in finding authentic sources for collecting information for assignment topics.
ReplyDelete
Replies
kinglalisahDecember 22, 2021 at 8:55 AM
What you are sharing your all blogs can you realize is this? Its was really great experience to your all readers. I can't explain about you and your skills. Thankyou so much for sharing this great and useful information about many interesting niche, I hope keep more post for sharing. Good luck for your new posts :) psychology assignment help - science assignment help - assignment help new york - ruby assignment help
ReplyDelete
Replies
Latest GistFebruary 4, 2022 at 7:45 AM
Thanks for posting this fantastic content. I appreciate your effort. Checkout How To Start A Small Business At Home For 12 Year Olds
ReplyDelete
Replies
Assignment HelpApril 26, 2022 at 5:57 AM
Are you looking for Homework Help from the best experts? If yes, then there is no need to search. We have a team of experts who have experience in completing the assignments with perfection. Contact GreatAssignmentHelp to get all types of assignment help on time
ReplyDelete
Replies
Jessica LucasJune 9, 2022 at 4:08 AM
The new report by Expert Market Research titled, ‘Global Chocolate Market Report and Forecast 2022-2027’, gives an in-depth analysis of the global Chocolate Market, assessing the market based on its segments like categories, product types, growth, qualities, processed types, applications, and regions. The report tracks the latest trends in the industry and studies their impact on the overall market upcoming years. The chocolate market is being driven by the growing demand for chocolate in the food and beverage industry, as a popular flavouring ingredient. For instance, chocolate is used in producing cookies and cakes, and can also be found in beverages, Additionally, the wide variety of chocolates available in the market are also attracting further consumer investments, which is expected to positively impact the market growth. The major players in the market are Nestlé S.A., Ferrero Group, Mars, Incorporated, Mondelez International, Hershey Co., Barry Callebaut AG, and Kerry Group plc, many others are lead of this industry. Expert Market Research provides market intelligence across a range of industry verticals which include Pharmaceuticals, Health, Food and Beverage, Chemical and Materials, Energy and Mining. For more information about us, Kindly visit our website also.
ReplyDelete
Replies
louis waneJuly 5, 2022 at 2:19 AM
One of the primary drivers driving the market’s favourable outlook is significant growth in the Life Science Market, which is accompanied by an increased demand for analytical insights for clinical trials. Other growth-inducing factors include technological advancements such as the creation of new telemedicine, mhealth, and e-prescribing systems. These analytics solutions are also commonly used to generate precision and tailored medicines, which rely on precise genomic data from patients with specific medical needs.

Read More:

Integrated Workplace Management System Market

Siding Market

Chlor-Alkali Market

Solid Waste Management Market

Advanced Functional Materials Market

Polypropylene Market

Drilling Fluids Market

Alopecia Market

Laparoscopy Devices Market
ReplyDelete
Replies
SrisLawyersJuly 19, 2022 at 12:52 AM
I Like your post. It gives more information to us.
Sex Crimes Lawyer VA

ReplyDelete
Replies
BesthyiptemplateJuly 24, 2022 at 1:46 AM
I simply wanted to write down a quick word to say thanks to you for that wonderful information you are showing on this site. USA verified PayPal account
ReplyDelete
Replies
zarifAugust 23, 2022 at 12:23 AM
Anyone who wants to learn Hindi in the simplest possible method should enrol in Ziyyara's online Hindi language course.
ReplyDelete
Replies
nareshJuly 20, 2023 at 5:56 AM
Hi.
Thank you for sharing this information, this blog was good and it’s give good information about The problem arises when the objects to be clustered don't lie in a space for which the computation of the centroids can be done thru analytical methods. thanks for sharing this information with us.
Here is sharing some OTM information may be its helpful to you.
OTM Training

ReplyDelete
Replies
johnsonmioAugust 31, 2023 at 9:20 AM
"Dissertation writing services offer expert support to students tackling the intricacies of creating a thesis. Guiding through topic selection, literature review, and research design, these services ensure a robust scholarly document. With experienced writers, they provide aid in data collection, analysis, and synthesis of findings. Constructive critiques refine content. Caution is advised to choose ethical and credible services that prioritize authenticity. Such services alleviate the daunting task, providing students with well-structured, top-quality dissertations. Ultimately, Dissertation writing services
streamline the process, helping students achieve academic success."
ReplyDelete
Replies
KataresSeptember 28, 2023 at 2:01 PM
Eskişehir
Adana
Sivas
Kayseri
Samsun
TN553
ReplyDelete
Replies
DenizKızıGüzeli71October 16, 2023 at 4:16 PM
https://titandijital.com.tr/
amasya parça eşya taşıma
adıyaman parça eşya taşıma
hatay parça eşya taşıma
giresun parça eşya taşıma
İZAX2
ReplyDelete
Replies
NeonNomad1QOctober 20, 2023 at 12:56 AM
aydın evden eve nakliyat
yozgat evden eve nakliyat
kırklareli evden eve nakliyat
antep evden eve nakliyat
konya evden eve nakliyat
46G5WQ
ReplyDelete
Replies
AstralSerpentess12ATOctober 23, 2023 at 3:21 AM
izmir evden eve nakliyat
malatya evden eve nakliyat
hatay evden eve nakliyat
kocaeli evden eve nakliyat
mersin evden eve nakliyat
2GCT3
ReplyDelete
Replies
jhon carryOctober 24, 2023 at 7:41 PM
Thank you for sharing this informative blog with us. Your blog is very useful for us. Are you a student pursuing a management course in Australia and facing challenges with your assignments? Look no further! Our Management Assignment Help service in Australia is here to assist you in achieving academic excellence and mastering the art of management. Our dedicated team of expert writers and subject matter specialists is well-versed in the intricacies of management principles, theories, and practices. We offer comprehensive support for all aspects of management assignments, including:

ReplyDelete
Replies
1E464TaylorA487ENovember 7, 2023 at 4:07 AM
CFD8E
Adıyaman Evden Eve Nakliyat
Bitlis Evden Eve Nakliyat
Kırıkkale Evden Eve Nakliyat
Kırşehir Evden Eve Nakliyat
Artvin Evden Eve Nakliyat
ReplyDelete
Replies
65CE3Laura520B9November 11, 2023 at 7:41 PM
A8792
Kocaeli Şehirler Arası Nakliyat
Paribu Güvenilir mi
Bone Coin Hangi Borsada
Çanakkale Parça Eşya Taşıma
Sinop Şehirler Arası Nakliyat
İzmir Şehir İçi Nakliyat
Eskişehir Lojistik
Gölbaşı Boya Ustası
Muş Lojistik
ReplyDelete
Replies
A1F2EGlenn6EBA0December 22, 2023 at 4:38 AM
50A88
kırklareli mobil sohbet chat
bedava sohbet siteleri
ücretsiz sohbet sitesi
antalya kadınlarla rastgele sohbet
çanakkale ücretsiz sohbet uygulaması
canlı sohbet bedava
balıkesir ücretsiz sohbet
muğla sesli görüntülü sohbet
sohbet sitesi
ReplyDelete
Replies
30295AmberED008January 20, 2024 at 7:12 AM
B7636
Görüntülü Sohbet Parasız
Spotify Takipçi Satın Al
Dlive Takipçi Hilesi
Periscope Takipçi Satın Al
Meta Coin Hangi Borsada
Binance Sahibi Kim
Okex Borsası Güvenilir mi
Görüntülü Sohbet
Coin Nedir
ReplyDelete
Replies
Assignment Help OnlineJanuary 24, 2024 at 4:13 AM
I want to express my sincere appreciation for sharing an outstanding article. Your dedication to providing valuable information is truly admirable and has not only helped me but, I believe, numerous others as well. Your commitment to delivering informative content is highly valued, and I encourage you to keep up this exceptional standard. Your contributions make a meaningful impact, and we appreciate the knowledge and insights you consistently provide. On the hunt for assignment aid? Your search terminates right here! We extend our management assignment help services within your reach, featuring a proficient squad of 2000+ assignment professionals. Rest assured, we're committed to shepherding you through your assignments. Queries? Contact us promptly.
ReplyDelete
Replies
3595DDevyn962A5February 10, 2024 at 1:12 PM
A70D5
dexscreener
satoshivm
layerzero
poocoin
thorchain
DefiLlama
debank
phantom wallet
uwulend finance
ReplyDelete
Replies
ISO In Saudi ArabiaAugust 27, 2024 at 11:49 PM
I didn’t expect to find such an easy-to-follow explanation on this subject. The writer did a wonderful job breaking it down in a way that’s accessible to everyone. Big thanks for the clarity! Visit our link for ISO Certification in Jeddah

ReplyDelete
Replies
AnonymousJanuary 30, 2025 at 5:47 PM
AF3CF795E1
takipçi kasma instagram
ReplyDelete
Replies
AnonymousFebruary 1, 2025 at 12:29 PM
2589C7CA2A
takipçi paketleri
ReplyDelete
Replies
AnonymousFebruary 5, 2025 at 11:44 PM
C1DC9E2C43
kadın takipçi
Kaspersky Etkinleştirme Kodu
M3u Listesi
Township Promosyon Kodu
Total Football Hediye Kodu
Pokemon GO Promosyon Kodu
Pubg New State Promosyon Kodu
Sıra Bulucu
Titan War Hediye Kodu
ReplyDelete
Replies
AnonymousMay 20, 2025 at 9:05 PM
This study on partitional clustering provides a quantitative analysis of how the number of clusters affects clustering performance. It highlights the trade-off between choosing too few or too many clusters and shows the impact on accuracy, cohesion, and separation metrics, offering useful insights for optimizing cluster selection.

4.1-mini
<a href="https://digitalfloats.com/graphic-design-course-in-
hyderabad/">Partitional clustering: number of clusters and performances a quantitative analysis;/a>
https://digitalfloats.com/graphic-design-course-in-hyderabad/
ReplyDelete
Replies

Add comment

Pages

Tuesday, December 9, 2014

Partitional clustering: number of clusters and performances a quantitative analysis

40 comments: