Sunday, September 21, 2014

Text Clustering, a non parametric version of K-medoids algorithm for data de-duplication

Abstract
It's presented a quick and effective method to leverage text clustering for data de-duplication and normalisation.
The customised version proposed in this post bypasses the very well known problem of the assignment of the number of clusters.
Text Clustering results: each row depicts a cluster found. 


Introduction
One of the most common problems in the data management  is about the consistency: customer addresses, company names and whatever attribute in string format can be represented in multiple formats, often mutually equivalent.
Data normalisation presents some challenges, especially in the following scenarios:
  • Variations/mutation of the trusted source of information are not a priori known.
  • The data to be normalised are characterised by high volume.
  • There are no deterministic and static rules to normalise the data.
  • ...you don't have plenty time to find the perfect solution :), and you have to deliver fast!
Why the brute force approach it's wasted time!
Let's assume that you already found for your specific use case the best distance to judge how far is a record  from another one.
To make the story a bit more realistic, let's assume hereafter that our attribute to be normalised is a string.
...I would hope that nobody of you really thought to create a big matrix to keep track of the all possible distances among the possible values of the attribute... Why? If you have for instance 10k possible values, you need to computate at least 49.995.000 comparison (assuming that your distance is symmetric!), that's because the complexity follows Binomial[N,2].
You might perform the brute force approach over a statistically representetive sample set, but in that case it's really hard the validation of the real accuracy of your results.

Why Partitional Clustering is efficient?
As you know, this is a "practical mean" blog, so I encourage you to refer to real books if you are looking for the theoretical proof. Anyhow it's easy to get convinced about the following:
  • number of comparisons to be performed assuming $K$ clusters, $I$ iterations, $N$ elements to be clustered: $N \cdot K \cdot I$.
  • number of comparison to update the centroids: $I \cdot (\binom{K_1}{2}+\binom{K_2}{2}+\cdots+\binom{K_K}{2}) << \binom{N}{2}$

The comparison of the computational complexity of the brute force approach with the partitional clustering is well depicted by the below example, where I assumed ten iterations to get the convergence, 10 clusters, a data set that grows from 1000 to 20.000 points:
The computational effort: in red the  brute force approach, in blue the clustering.
The approach in a nutshell 
1. Group together records that look similar each other using k-medoid clustering 
2. Store in a index the centroids found.
3. delete (only if the use case allows it) all the member of the clusters with exception of the centroids.
4. for each new record to be stored in the DB, perform the comparison with the elements of the index: if the distance is to high, add the new record to the index.

K-medoids vs K-means
The main difference of K-medoid respect the well known K-means is in the assignment of the centroids:
  • in K-means the centroid is the barycenter of the cluster, so it might not be a real record.
  • in K-medoids, the centroid is a record of the cluster that minimises the sum of the distances respect all the other points of the cluster. Of course, depending on the case, you might prefer minimise the variance, or even define more sophisticated minimisation criteria.
  • K-means is usually much faster, since the computation of the barycenter it's usually taken as average of the position of the records in the space,
  • K-medoids requires (for each cluster) the additional comparison of the distances of the points of the cluster from all the other points of the clusters (that explains the formula mentioned in the complexity formula.
K-medoids is more robust than k-means in the presence of noise and outliers.
...But there is still something that makes me unhappy with K-medoids: It's parametric, meaning it requires the priori knowledge of the number of clusters.
In the real world simply forget it! 

Customised version for auto-determination of number of clusters.
  1. Instantiate the standard k-medoids with arbitrary number of classes (in the case analysed later I used just two clusters as initialisation).
  2. after the operation to update the medoids,  compare the medoids with the others: if two medois are very similar, then merge the two clusters.
  3. calculate the mean distances (this computation can take advantages of the operations performed during the medoids updating) for each cluster: if the mean distance overcome a threshold (that indicates the sparsity of the points of the cluster) then split the cluster. In this experiment I used a wild approach: each point of the split cluster have been consider as new cluster.
  4. goto step 2 until convergence. 
You might argue that such customisation doesn't require the "number of cluster" as param but introduce two thresholds: one for the clustering merging and another one for the splitting.
Actually the threshold is just one:
  • if $\theta$ is the minimum acceptable distance for which two strings might be considered as the same, then $1- \theta$ is the threshold for which two strings are absolutely different!
The advantage of the approach lyes on the fact the estimation of the threshold can be easily estimated and verified. The estimation of #clusters it's not that easy!

Experiment: Addresses de-duplication.
As data set I downloaded from this link the addressees of the pizzeria in New York (I'm Italian, and I love pizza :D).
To assess the accuracy of the algorithm I randomised each entry introducing a gaussian noise.

Data Perturbation:
The perturbation method I used, is based on randomised swapping of characters of the string:
  1. Select the number $n$ of characters you want to swap. Such number is drawn within the range [0,StringLength[str]] using a gaussian distribution.
  2. Draw $n$ couples of integers (between[1,StringLength[str]]).
  3. For each couple swap the corresponding first character with the second.
Distance
The definition of distance is crucial: several options are available. In this experiment I used the common Nedelman-Wunsch similarity. I'm pretty sure that good results can be achieved also with Jaro-Winkler and Levenshtein distances.

Experiment Setting
I firstly estimated the param of the customised k-medoid algo: the Merging/Splitting threshold.
The value is dependent by the nature of the data.
  • What's the minimum acceptable distance between two strings to be considered the same?
I selected randomly several couples of strings, measured the distances and sorted by value.
I noticed that 70% of matching it's the good one.
In the next posts I'll show a technique more refined based on the convergence slope of the clustering... but that's another story.

Results:
The data set contains 50 records:
  • 10 different addresses
  •  for each address 5 deviations generated with the aforementioned perturbation.
The customised K-medoids found the best result (that is 10 clusters containing the 5 deviated addresses) ~20% of the cases (first bar of the below chart)
The below chart shows how many time (out of 100 trials) the alto found exactly $k$ number of clusters.
The first bar says that 22 times out 100 trials, the alto returned 10 clusters.
  • As you can see, in 70% of the cases the algorithm partitions the data set in [9,12] clusters!
Below the best and worst case
Best case: achieved 20% of the cases. Steps to converge ~26.  
Worst case: returned 17 clusters. Convergence after 40 steps.
Convergence of the algorithm:
The time to converge is the following:

  • On average the algorithm converges after 36 steps.
I noticed that longer is time to converge, lower is the probability to determine correctly the number of clusters.

As usual: stay tuned!
cristian