It's presented a quick and effective method to leverage text clustering for data de-duplication and normalisation.
The customised version proposed in this post bypasses the very well known problem of the assignment of the number of clusters.
Text Clustering results: each row depicts a cluster found. |
Introduction
One of the most common problems in the data management is about the consistency: customer addresses, company names and whatever attribute in string format can be represented in multiple formats, often mutually equivalent.
Data normalisation presents some challenges, especially in the following scenarios:
- Variations/mutation of the trusted source of information are not a priori known.
- The data to be normalised are characterised by high volume.
- There are no deterministic and static rules to normalise the data.
- ...you don't have plenty time to find the perfect solution :), and you have to deliver fast!
Let's assume that you already found for your specific use case the best distance to judge how far is a record from another one.
To make the story a bit more realistic, let's assume hereafter that our attribute to be normalised is a string.
...I would hope that nobody of you really thought to create a big matrix to keep track of the all possible distances among the possible values of the attribute... Why? If you have for instance 10k possible values, you need to computate at least 49.995.000 comparison (assuming that your distance is symmetric!), that's because the complexity follows Binomial[N,2].
You might perform the brute force approach over a statistically representetive sample set, but in that case it's really hard the validation of the real accuracy of your results.
Why Partitional Clustering is efficient?
As you know, this is a "practical mean" blog, so I encourage you to refer to real books if you are looking for the theoretical proof. Anyhow it's easy to get convinced about the following:
- number of comparisons to be performed assuming $K$ clusters, $I$ iterations, $N$ elements to be clustered: $N \cdot K \cdot I$.
- number of comparison to update the centroids: $I \cdot (\binom{K_1}{2}+\binom{K_2}{2}+\cdots+\binom{K_K}{2}) << \binom{N}{2}$
The comparison of the computational complexity of the brute force approach with the partitional clustering is well depicted by the below example, where I assumed ten iterations to get the convergence, 10 clusters, a data set that grows from 1000 to 20.000 points:
The computational effort: in red the brute force approach, in blue the clustering. |
1. Group together records that look similar each other using k-medoid clustering
2. Store in a index the centroids found.
3. delete (only if the use case allows it) all the member of the clusters with exception of the centroids.
4. for each new record to be stored in the DB, perform the comparison with the elements of the index: if the distance is to high, add the new record to the index.
K-medoids vs K-means
The main difference of K-medoid respect the well known K-means is in the assignment of the centroids:
- in K-means the centroid is the barycenter of the cluster, so it might not be a real record.
- in K-medoids, the centroid is a record of the cluster that minimises the sum of the distances respect all the other points of the cluster. Of course, depending on the case, you might prefer minimise the variance, or even define more sophisticated minimisation criteria.
- K-means is usually much faster, since the computation of the barycenter it's usually taken as average of the position of the records in the space,
- K-medoids requires (for each cluster) the additional comparison of the distances of the points of the cluster from all the other points of the clusters (that explains the formula mentioned in the complexity formula.
...But there is still something that makes me unhappy with K-medoids: It's parametric, meaning it requires the priori knowledge of the number of clusters.
In the real world simply forget it!
Customised version for auto-determination of number of clusters.
- Instantiate the standard k-medoids with arbitrary number of classes (in the case analysed later I used just two clusters as initialisation).
- after the operation to update the medoids, compare the medoids with the others: if two medois are very similar, then merge the two clusters.
- calculate the mean distances (this computation can take advantages of the operations performed during the medoids updating) for each cluster: if the mean distance overcome a threshold (that indicates the sparsity of the points of the cluster) then split the cluster. In this experiment I used a wild approach: each point of the split cluster have been consider as new cluster.
- goto step 2 until convergence.
You might argue that such customisation doesn't require the "number of cluster" as param but introduce two thresholds: one for the clustering merging and another one for the splitting.
Actually the threshold is just one:
Experiment: Addresses de-duplication.
As data set I downloaded from this link the addressees of the pizzeria in New York (I'm Italian, and I love pizza :D).
To assess the accuracy of the algorithm I randomised each entry introducing a gaussian noise.
Data Perturbation:
The perturbation method I used, is based on randomised swapping of characters of the string:
The definition of distance is crucial: several options are available. In this experiment I used the common Nedelman-Wunsch similarity. I'm pretty sure that good results can be achieved also with Jaro-Winkler and Levenshtein distances.
I noticed that 70% of matching it's the good one.
In the next posts I'll show a technique more refined based on the convergence slope of the clustering... but that's another story.
Results:
The data set contains 50 records:
The below chart shows how many time (out of 100 trials) the alto found exactly $k$ number of clusters.
Convergence of the algorithm:
The time to converge is the following:
As usual: stay tuned!
cristian
Actually the threshold is just one:
- if $\theta$ is the minimum acceptable distance for which two strings might be considered as the same, then $1- \theta$ is the threshold for which two strings are absolutely different!
Experiment: Addresses de-duplication.
As data set I downloaded from this link the addressees of the pizzeria in New York (I'm Italian, and I love pizza :D).
To assess the accuracy of the algorithm I randomised each entry introducing a gaussian noise.
Data Perturbation:
The perturbation method I used, is based on randomised swapping of characters of the string:
- Select the number $n$ of characters you want to swap. Such number is drawn within the range [0,StringLength[str]] using a gaussian distribution.
- Draw $n$ couples of integers (between[1,StringLength[str]]).
- For each couple swap the corresponding first character with the second.
The definition of distance is crucial: several options are available. In this experiment I used the common Nedelman-Wunsch similarity. I'm pretty sure that good results can be achieved also with Jaro-Winkler and Levenshtein distances.
Experiment Setting
I firstly estimated the param of the customised k-medoid algo: the Merging/Splitting threshold.
The value is dependent by the nature of the data.
I selected randomly several couples of strings, measured the distances and sorted by value.I firstly estimated the param of the customised k-medoid algo: the Merging/Splitting threshold.
The value is dependent by the nature of the data.
- What's the minimum acceptable distance between two strings to be considered the same?
I noticed that 70% of matching it's the good one.
In the next posts I'll show a technique more refined based on the convergence slope of the clustering... but that's another story.
Results:
The data set contains 50 records:
- 10 different addresses
- for each address 5 deviations generated with the aforementioned perturbation.
The below chart shows how many time (out of 100 trials) the alto found exactly $k$ number of clusters.
The first bar says that 22 times out 100 trials, the alto returned 10 clusters. |
- As you can see, in 70% of the cases the algorithm partitions the data set in [9,12] clusters!
Best case: achieved 20% of the cases. Steps to converge ~26. |
Worst case: returned 17 clusters. Convergence after 40 steps. |
The time to converge is the following:
- On average the algorithm converges after 36 steps.
As usual: stay tuned!
cristian
Data analysis in business paraphernalia has different facets and methodologies. Different aspects of life which includes businesses, politics, science, etc have different interpretation of the data, but the data collection is a basic thing for successful execution. See more benefits of data mining in healthcare
ReplyDeleteInformation examination in business gear has diverse features and procedures. Distinctive parts of life which incorporates organizations, governmental issues, science, and so on have diverse understanding of the information, however the information accumulation is a fundamental thing for effective execution.
ReplyDeleteThere are quite a number of things you 918kiss need to look at if you want to get the best deal and if you want to avoid problems in the future. Below is a short list of some of the things you can consider when choosing a service to use or join online.
ReplyDeleteEHT cut the lead to return to owning the mega888 free download and can receive. And Thank you a Philadelphia Eagles were a mess right from the Fox sports. Rugby is a learning my proven strategies of your Steam games right.
ReplyDeleteAmazing content.
ReplyDeleteData Mining Process
Great ....You have beautifully presented your 168 8099 apk scr888 casino game 4 thought in this blog post. I admire the time and effort you put into your blog and detailed information you offer.
ReplyDeleteI am undeniably thankful to you for providing us with this trusted online casino malaysia 2019 invaluable related information. My spouse and I are easily grateful, quite frankly the documents we needed.
ReplyDeleteThere are certainly a lot of details like Joker123 that to take into consideration.Like your Posts.Thanks Keep Posting.
ReplyDeleteThanks for always being the source that explains things joker123 instead of just putting an unjustified answer out there. I loved this post.
ReplyDeleteI think text clustering is an easy tool that can be used in dissertations or assignments but it is not applicable for essay writings. This is why I take Essay Writing Service in New York from an online platform to make this slightly easier for me. They provide me with compact and quality content for my essays.
ReplyDeleteThanks for sharing this best stuff with us! Keep sharing! I am new in the blog writing. All types blogs and posts are not helpful for the readers. Here the author is giving good thoughts and suggestions to each and every reader through this article. write me an assignment literature-essay Quality of the content is the main element of the blog and this is the way of writing and presenting.
ReplyDeleteI Like your post. It gives more information to us.
ReplyDeleteSex Crimes Lawyer VA
artikel yang menarik
ReplyDeleteText clustering is a powerful technique for organizing and deduplicating large datasets. A non-parametric version of the K-medoids algorithm eliminates the need for predefining the number of clusters, making it ideal for adaptive scenarios like data de-duplication. This approach enhances efficiency by dynamically determining cluster structures. For example, businesses exploring smarter algorithms might also need to hire someone to take my TEAS exam for me to manage professional commitments alongside personal goals.
ReplyDelete