Tuesday, December 9, 2014

Partitional clustering: number of clusters and performances a quantitative analysis

Partitional clustering methods as k-medoid or k-means require an input parameter to specify the number of clusters to partition the data.
The complexity in time of such algorithms strictly depends on
  • the number of clusters used to initialise the computation.
  • the steps to update the centroids.
Whenever the similarity distance doesn't allow the determination of the centroid thru the analytical methods, the complexity in time tends to explode.
In the post I show an heuristic to minimise the complexity in time in non-Euclidean space.
Number of computational steps for standard k-mean executed. Chart depicts  the steps by number of points $N$ in the range [1k,10k] and  by number of clusters in the range [5,$N/10$].
The cardinality of the clusters has been simulated assuming uniform distribution.
The red line indicates the minimum number of steps.
K-means or k-medoids are widely used because its implementation is very easy  and the complexity in time is low (it's linear over the iterations*#clusters#*#points).
Is the complexity linear  for whatever problem?
The answer is no!

The problem
The problem arises when the objects to be clustered don't lie in a space for which the computation of the centroids can be done thru analytical methods.
In this context, the computational complexity to update the centroids becomes preponderant respect the other steps and the overall complexity is not any longer linear in time.

The complexity in case of non-Euclidean space
In a Euclidean space, the points of a cluster can be averaged. 
In non-Euclidean spaces, there is no guarantee that points have an “average” (...some puritan of the measure theory might stuck up their nose on that) so we have to choose one of the members of the cluster as centroid.
As explained in the former post, for each cluster the computation of the centroid requires (in case of symmetric distance) $\binom{K_i}{2}$ computational steps.
The overall complexity (for each iteration) is the following:
  $ \sum_{i=1}^k \binom{z_i}{2}+k \cdot n$ 

A dataset having several points might require a number of computations simply not feasible; in that case it's essential the minimisation and optimisation of each param of the algorithm to reduce the complexity.
A numeric example will clarify the point.
Let's assume a data set of 10.000 elements, and let's assume that the compression rate of the clustering is 90% (meaning the number of clusters are 1.000).

  • If the clusters cardinality is homogeneous, than the number of steps for the first iteration will be = 10.045.000.
  • If the clusters cardinality is highly unbalanced, for instance in case that a cluster contains 95% of the points, than the number of steps is = 50.360.655.
As you can see... even if the data set is relatively small the computational effort is very high!

Determination of the #clusters that minimises the complexity in time

There are no many choices ... the only option is to determine the number of clusters that minimises the computational steps.

You might argue that the # clusters should be selected according to the distribution of the data set, but it's also true the following:
  • In the real world you don't know a priori the exact number of clusters and all the heuristics to estimate them are quite often not really helpful in non-Euclidean space with several points. I prefer so, to have a sub-optimal clustering solution but with much faster answer :)
  • clustering analysis is usually one of the many analysis that usually you perform to understand your data, therefore even if the clustering is not perfect, it doesn't hurt too much. Clustering it's just one piece of the entire puzzle!
The approach
If the clustering divides the data set in clusters having more or less the same size, the  solution is quite straightforward, we can consider the minimum of the function described above, thru the analysis of first derivate.
Assuming that each $z_i = \frac{n}{k}$ then the formula can be rewritten as:
$ f(n,k)=\frac{n^2}{2 k}+k n-\frac{n}{2}$
The value of $k$ that minimises the first derivative of the such function is:
$ \frac{\partial f(n,k) }{\partial k}=\frac{\partial}{\partial k} \frac{n^2}{2 k}+k n-\frac{n}{2} = n-\frac{n^2}{2 k^2}$.
And it takes value = 0 with $k= \frac{\sqrt{n}}{\sqrt{2}}$.

In all the other cases for which the clusters size is not homogeneous, then we can easily simulate it!

Computations for a data set of 10k points.
 The computations are estimated by number of clusters (cluster size ~ [2,1000]) and with unbalanced clusters sizes.
The red line indicates the minimum. 
As you can see the difference between  the computational steps to be executed in case of "best" configuration of #clusters and all the other possible configurations is quite impressive: it's almost 1*10^7 in worst case. And this is for each iteration!
I let you calculate how much time you can save working with the best configuration :).
Stay tuned!

Sunday, September 21, 2014

Text Clustering, a non parametric version of K-medoids algorithm for data de-duplication

It's presented a quick and effective method to leverage text clustering for data de-duplication and normalisation.
The customised version proposed in this post bypasses the very well known problem of the assignment of the number of clusters.
Text Clustering results: each row depicts a cluster found. 

One of the most common problems in the data management  is about the consistency: customer addresses, company names and whatever attribute in string format can be represented in multiple formats, often mutually equivalent.
Data normalisation presents some challenges, especially in the following scenarios:
  • Variations/mutation of the trusted source of information are not a priori known.
  • The data to be normalised are characterised by high volume.
  • There are no deterministic and static rules to normalise the data.
  • ...you don't have plenty time to find the perfect solution :), and you have to deliver fast!
Why the brute force approach it's wasted time!
Let's assume that you already found for your specific use case the best distance to judge how far is a record  from another one.
To make the story a bit more realistic, let's assume hereafter that our attribute to be normalised is a string.
...I would hope that nobody of you really thought to create a big matrix to keep track of the all possible distances among the possible values of the attribute... Why? If you have for instance 10k possible values, you need to computate at least 49.995.000 comparison (assuming that your distance is symmetric!), that's because the complexity follows Binomial[N,2].
You might perform the brute force approach over a statistically representetive sample set, but in that case it's really hard the validation of the real accuracy of your results.

Why Partitional Clustering is efficient?
As you know, this is a "practical mean" blog, so I encourage you to refer to real books if you are looking for the theoretical proof. Anyhow it's easy to get convinced about the following:
  • number of comparisons to be performed assuming $K$ clusters, $I$ iterations, $N$ elements to be clustered: $N \cdot K \cdot I$.
  • number of comparison to update the centroids: $I \cdot (\binom{K_1}{2}+\binom{K_2}{2}+\cdots+\binom{K_K}{2}) << \binom{N}{2}$

The comparison of the computational complexity of the brute force approach with the partitional clustering is well depicted by the below example, where I assumed ten iterations to get the convergence, 10 clusters, a data set that grows from 1000 to 20.000 points:
The computational effort: in red the  brute force approach, in blue the clustering.
The approach in a nutshell 
1. Group together records that look similar each other using k-medoid clustering 
2. Store in a index the centroids found.
3. delete (only if the use case allows it) all the member of the clusters with exception of the centroids.
4. for each new record to be stored in the DB, perform the comparison with the elements of the index: if the distance is to high, add the new record to the index.

K-medoids vs K-means
The main difference of K-medoid respect the well known K-means is in the assignment of the centroids:
  • in K-means the centroid is the barycenter of the cluster, so it might not be a real record.
  • in K-medoids, the centroid is a record of the cluster that minimises the sum of the distances respect all the other points of the cluster. Of course, depending on the case, you might prefer minimise the variance, or even define more sophisticated minimisation criteria.
  • K-means is usually much faster, since the computation of the barycenter it's usually taken as average of the position of the records in the space,
  • K-medoids requires (for each cluster) the additional comparison of the distances of the points of the cluster from all the other points of the clusters (that explains the formula mentioned in the complexity formula.
K-medoids is more robust than k-means in the presence of noise and outliers.
...But there is still something that makes me unhappy with K-medoids: It's parametric, meaning it requires the priori knowledge of the number of clusters.
In the real world simply forget it! 

Customised version for auto-determination of number of clusters.
  1. Instantiate the standard k-medoids with arbitrary number of classes (in the case analysed later I used just two clusters as initialisation).
  2. after the operation to update the medoids,  compare the medoids with the others: if two medois are very similar, then merge the two clusters.
  3. calculate the mean distances (this computation can take advantages of the operations performed during the medoids updating) for each cluster: if the mean distance overcome a threshold (that indicates the sparsity of the points of the cluster) then split the cluster. In this experiment I used a wild approach: each point of the split cluster have been consider as new cluster.
  4. goto step 2 until convergence. 
You might argue that such customisation doesn't require the "number of cluster" as param but introduce two thresholds: one for the clustering merging and another one for the splitting.
Actually the threshold is just one:
  • if $\theta$ is the minimum acceptable distance for which two strings might be considered as the same, then $1- \theta$ is the threshold for which two strings are absolutely different!
The advantage of the approach lyes on the fact the estimation of the threshold can be easily estimated and verified. The estimation of #clusters it's not that easy!

Experiment: Addresses de-duplication.
As data set I downloaded from this link the addressees of the pizzeria in New York (I'm Italian, and I love pizza :D).
To assess the accuracy of the algorithm I randomised each entry introducing a gaussian noise.

Data Perturbation:
The perturbation method I used, is based on randomised swapping of characters of the string:
  1. Select the number $n$ of characters you want to swap. Such number is drawn within the range [0,StringLength[str]] using a gaussian distribution.
  2. Draw $n$ couples of integers (between[1,StringLength[str]]).
  3. For each couple swap the corresponding first character with the second.
The definition of distance is crucial: several options are available. In this experiment I used the common Nedelman-Wunsch similarity. I'm pretty sure that good results can be achieved also with Jaro-Winkler and Levenshtein distances.

Experiment Setting
I firstly estimated the param of the customised k-medoid algo: the Merging/Splitting threshold.
The value is dependent by the nature of the data.
  • What's the minimum acceptable distance between two strings to be considered the same?
I selected randomly several couples of strings, measured the distances and sorted by value.
I noticed that 70% of matching it's the good one.
In the next posts I'll show a technique more refined based on the convergence slope of the clustering... but that's another story.

The data set contains 50 records:
  • 10 different addresses
  •  for each address 5 deviations generated with the aforementioned perturbation.
The customised K-medoids found the best result (that is 10 clusters containing the 5 deviated addresses) ~20% of the cases (first bar of the below chart)
The below chart shows how many time (out of 100 trials) the alto found exactly $k$ number of clusters.
The first bar says that 22 times out 100 trials, the alto returned 10 clusters.
  • As you can see, in 70% of the cases the algorithm partitions the data set in [9,12] clusters!
Below the best and worst case
Best case: achieved 20% of the cases. Steps to converge ~26.  
Worst case: returned 17 clusters. Convergence after 40 steps.
Convergence of the algorithm:
The time to converge is the following:

  • On average the algorithm converges after 36 steps.
I noticed that longer is time to converge, lower is the probability to determine correctly the number of clusters.

As usual: stay tuned!

Tuesday, July 29, 2014

Fractals: parameters estimation (part IV)

In the former posts we discussed about the following points:
  1. There are special points in the contour of the fractal that can be used to derive its contour.
  2. Such points can be used to describe the fractal thru an iterative and deterministic algorithm. 
It's still open the main issue: can we leverage the two above findings to determine the fractals parameters?
The answer is yes, provided that we use a good technique to extract the contour points of the fractal image.

The algorithm (practical explanation)
We already noticed that points in the contour of the fractal are in tight relationship each other.
Since each linear transformation is described by exactly six parameters, for their determination we need at least 6 points for each transformation function.
The below image makes the concept clear:
Image-1: Relationship between points of the fractal contour

The steps to obtain the estimation of the params are the following:
  1. Extraction of the contour points thru convex hull algorithm;
  2. Build relationships among the points. 

Left Image: Extraction of contour points thru Convex Hull.
Right Image: Relationship among the points extracted.
How to build the relationships among the points?
As showed both in the deterministic algorithm presented in the last post and in the above image, the points preserve an ordering that can be leveraged to simplify the complexity of the problem.
Consider for instance the fern case:
  • the transformation that leads to the fixed point lying on the contour (highlighted in image-1) can be obtained (as explained in the points 1,2 of the deterministic algorithm) just creating the relationship among the consecutive points of the set $ l_1 $ described in Image-1 and in the below image:

Set of points to determine the Transformations
At this point we obtained the parameters of the first transformation, but still, we don't know neither the number of transformations we really need to describe the fractal image,  nor the probability associated to them.
For the first problem, we have to strategies: the brute force, that is, try all the possible combinations, or more wisely try the soft computing approach described below.

The fuzzy clustering approach (...unfortunately here some formulas are required)
The fuzzy clustering is a very well known algorithm that assigns the membership of a point to a cluster with a certain "degree", so the point $p_i$ might belong to the cluster $A$ with degree $x$ and to the cluster $B$ with degree $y$.

In this problem I modified the algorithm in this way:
  • the clusters are the mapping functions that we want to discover.
  • the probability associated to the fractal maps can be easily derived by the "size" of each cluster.
  • The fuzzy-membership function is based on the minimisation of the distance between the point obtained applying to $x_i $ the contractive map $x_i^{'} = \tau_j(x_i) $  and all the other points falling in a reasonable neighbour. The animation below describes it.

  • The update step of the centroids is aimed to minimise the distance between $ d(\tau (x_i)),\phi(x_i,\tau_j))$ computed over each $\tau$ and each $x_i$, where 
  • $\phi(x_i,\tau_j) = \arg\min_{x_s} d(\tau_j(x_i),x_s) $
  • To minimise the above function,  I used the gradient technique, working on the the following cost function:
  • $ E(\theta)= \sum_{i=1}^{n}{(\mu_{x_i}(\tau_j)\cdot[d(\tau_j(x_i),\phi(x_i,\tau_j))]^2)} $
  • For each mapping function,  the correction to each param is the following:
  • $\theta_i= \theta_i - \eta \frac{\partial E(\theta)}{\partial \theta_i}$

    Starting from the image of the fern (a jpg image containing 100.000 points) the application of the algorithm for the determination of the contractive maps gives the following result:
    Contractive maps detection algorithm: in black the points of the original fern, in red,green, blue the three estimated maps.
    • The results are not bad, the weakness of the algorithm lays on the extraction of the points of the Convex hull of the fractal.
    • The application of smarter contour extraction algorithm should improve further the accuracy of the algorithm.
    In the last posts about fractals I often abused the term contour of fractal. ...Theoretically speaking, a fractal cannot have a contour, but to make clear the pragmatical aspects I decided voluntarily such terminology.

    Saturday, June 7, 2014

    Fractals: a deterministic and recursive algorithm to reproduce it (Part II)

    Fern fractal estimation thru recursive and deterministic algorithm.

    A fractal can be described as fixed point of a Markov process: given the proper contractive maps it's easy to generate it.
    The question is: given the contractive maps, is it possible to generate the fractal using a pure deterministic approach?

    We already observed that the iteration of the contractive maps starting from a fixed point that lyes on the border of the fractal returns points of the convex hull of the fractal.

    • What's happen if we recursively apply such schema to the points obtained from the the above procedure?
    • Is it possible recreate the markov process (or at least approximate it) removing any probabilistic aspects?
    The below algorithm, given the contractive maps of the fractal, returns at each iteration better approximation of it.

    At each iteration it considers the fractal as better approximation of the contour of the original image.

    The Algorithm (by practical means)
    1. Consider a fixed point $P_0$ of a contractive map $\tau_1$ that lyes on the convex hull of the fractal.
    2. Choose randomly a point $P_i$ of the fractal and apply the above contractive map until the distance to $P_0$ is negligible.
    3. Map the point calculated at step 2 using sequentially all the contractive maps.
    4. Map each point obtained from point 3 the former step with $\tau_1$ till the distance to $P_0$ is negligible.
    5. If[ITERATIONS< K]:
      1. K--;
      2. For each point obtained from point 4 go to point 3.
    To explain how it works I plotted the list of points got using $K=1$ iterations of the algorithm:

    Bigger is $K$, more accurate will be the result.
    The above procedure works only if the contractive map $\tau_1$ has a fixed point that lyes on the convex hull of the fractal!!


    I iterated the algorithm with $K=4$ times. At each iteration the algorithm returns a deeper description of the contour of the fractal (...even though definition of contour for a fractal doesn't make any sense, it gives at least a practical meaning):
    Results of the Recursive Algorithm iterated with K=1,2,3,4. 

    If we try to overlap the results obtained with the original fern we get:
    Original Fern, and the overlap with the results obtained using the recursive algorithm (K=4).
    Conclusion and next steps
    I showed how to depict the IFS generated thru the markovian process as a recursive and completely deterministic process.
    We noticed also, that in the fractal there are special points (as for instance $P_0$) that play a crucial role to describe the IFS.

    The question now is the following:
    is it possible leverage such special points and the above recursive algorithm to estimate the parameters of the contractive maps?
    ... I will show you a procedure that partially answer the question and some other example of the algorithm above described!
    Stay Tuned.

    Saturday, May 24, 2014

    Fractal parameters estimation thru contour analysis - Part I

    I really love fractals theory. Nothing fascinates me as the amazing structures generated by a chaotic processes that describe (hyperbolic) fractals.
    It's fairly easy to draw a fractal image:
    • The input
      • A small set of very easy function (linear transformations)
      • A probability to choose one of the aforementioned functions.
    • The process
      • Choose randomly a starting point.
      • Choose randomly one of the above functions according to the above defined probability.
      • Map the starting point using the above selected function.
      • iterate as long as you want :)
    • Admire the meaning of the life!
      • just plot the points
    Wherever you are, whatever decision you make, despite the chaos, the overall picture will be always the same. More you will move, more clear the picture will be!

    When I approached for the first time fractals,  after the reading of the great Barnley's book (Fractals Everywhere) I opened the laptop and I wrote the routine to draw my first IFS!
    ...By mistake I also plotted the lines that join the points:

    ...It's quite a messy, isn't it?
    Sometimes it's better to give up to connect the points, just relax and wait until everything will be clear. If we focus on every single detail we risk to lose what really counts!

    What is a fractal (Iterated Fractal System)?
    There are many formal definitions to describe an IFS, in my opinion the most effective describe a fractal as the fixed point of the Markov process (mentioned above).
    The process converges to the fractal :)

    The powerful of the fractal
    In literature there are plenty articles in which fractals are used to solve complex problems (financial analysis, stock exchange preditictions, description of biological processes,...).
    Also in Computer Science fractals fond a good collocation, think about the image compression: the ivy leaf image takes around 25MB of space, all the information to obtain it thru the IFS takes less 1KB!!
    The problem 
    Given the ivy leaf image, what are the parameters of the Markov process to generate it?
    A bit more formally what we have to estimate is:
    • A set of functions: $\{ \tau_1, \cdots,\tau_N \}$, where $\tau_i(p_0)=A_1 \cdot p_0 +Q_1$, where:
    $ A= \begin{bmatrix}\alpha & \beta\\ \gamma & \delta \end{bmatrix} Q= \begin{bmatrix} \epsilon\\ \zeta \end{bmatrix} $
    • We need also the estimate the probability to choose each $\tau_j$
    The combination of the collage theorem and the box counting approach is the most common technique to solve the problem.
    When I was student I approached the problem from a different angle. I have to say that the results obtained were partials but I still think that something more can be done :)
    Before to start we need one more notion: the contractive maps (have a look at Banach Theorem)
    Under certains conditions the iteration  $\tau_j$ led to a fixed point:
    Example of contractive map applied to an ellipsoid.
    It converges to a fixed pointed.

    First conjecture:

    • An IFS is characterised by a fixed point that lies on its convex hull.
    • From a fixed point that lies on the border of the IFS, the iterations of the contractive maps that generate the fractal return the convex hull of the fractal.

    The ivy leaf IFS is generated by 4 contractive maps. Each color describes the map used to generate the point.
    The above animated gif shows the meaning of the conjecture.
    An experimental evidence about the fact that at least one fixed point lies on the convex hull of the fractal can be obtained changing the params of the maps:
    The light blue points depict the fixed point of the maps used to plot the fractals.
    Despite the changes in the maps, the fixed point on top of the leaf still lays on the convex hull of the fractal.
    In the next post I'll show you a nice recursive algorithm I found to obtain different levels of convex hull for an IFS.
    Stay tuned

    Friday, May 2, 2014

    Waiting Time Polynomials: tech explanation - last part

    This is the last step to complete the explanation of the waiting time polynomials formula.

    Unfortunately it's a bit technical, but I'm sure that it can be understood without deep math knowledge.
    At the very end if you can't explain it simply you don't understand it well enough! (A. Einstein)

    The trick
    Last time we left with the tally of overall waiting time $ w(x_i) = \phi(x_i)-|x_i| $ where $\phi(x_i)$ returns just the position of the last $|x_i|$ in the vector $V$.
    Let's have a look at the following example that will be used during the remaining steps.
    There are two questions that might be answered:

    • given $|x_1|= i, |x_2|= j, |x_3|= k $ what are the tuples $\{w(x_1),w(x_2),w(x_3)\}$?
    • given a $\{w(x_1)=I,w(x_2)=J,w(x_3)\}=K$, how many vectors $V$ can be built?
    To answer to the first question, I noticed that the three coloured cells are the only ones that really count.
    The idea is the following:

    1. consider the three cells as placeholders
    2. analyse the admitted values for each of them
    3. replace the placeholders with all the possible permutations of the alphabet $\{x_1,x_2,x_3\}$.

    Let's start with the case depicted in the above image, where we assumed that $\phi(x_1) < \phi(x_3) < \phi(x_2) $, then we have the following facts:
    • $ \phi(x_1)$ can take values between: $0 \leq \phi(x_1) \leq |x_1|+|x_2|+ |x_3|-2$
    • $ \phi(x_2)$ can take just one value: $|V|=|x_1|+|x_2|+ |x_3|$
    • The upper bound of $ \phi(x_3)$ is $|V|-1$ because it can slide till the second last element of $V$, that is $\phi(x_2)-1$
    • what about the lower bound of $\phi (x_3)$? We have two cases depicted in the below image:

    To sum up, so far we explained the following part of the formula (I hope you don't mind that I changed a bit the indexes notation):

    We have now to consider that for each configuration of $\{w(x_1)=I,w(x_2)=J,w(x_3)\}=K$ we can have more than one vector $V$.
    Do you like combinatorics? The remaining part it's just matter of tally, and the only formula we need is the formula for permutation with repetitions. I let you rub up the concept on your favourite website for trivial combinatorics.

    The formula can be split in two chunks, because we have to blocks of cells to be filled

    1. In how many ways can we fill the cells between the positions $[1,\phi(x_1)]$?
    2. In how many ways can we fill the cells between the positions $[phi(x_1),phi(x_2)]$? 

    Let's answer the first question we have to find the values for the denominator of the following:
    • we have $|x_1|-1$ cells that can be filled.
    • it contains all the instances of $x_1$ (except for the last the occupied $\phi(x_1)$)
    • the number of $x_3$ instances depends by $\phi(x_1)$ and $\phi(x_3)$:

    • the computation of the number of instances of $x_2$ in the first slot is straightforward, and it can easily derived by difference:
      • $\frac{(\phi(x_1)-1)!}{(|x_1|-1)!(|x_3|-Min(|x_3|,\phi(x_3)))!\#(x_3)!}$;
      • $(|x_1|-1)+(|x_3|-Min(|x_3|,\phi(x_3)))+\#(x_2)=\phi(x_1)-1$
      • so $ \#(x_2)= \phi(x_1)-|x_1|-(|x_3|-Min(|x_3|,\phi(x_3))) $
    This explains the following boxed part of the formula:

    The final step is to count in how many ways we can fill the slot 2 depicted by the interval $[phi(x_1),phi(x_2)]$ and to make the formula more readable let's rename $(|x_3|-Min(|x_3|,\phi(x_3))= \epsilon$.
    As we did for the first slot we have to identify the values of the denominator of the below formula:

    • Out of $|x_3|$ instances, $\epsilon$ have been placed in the slot 1, so the slot 2 contains exactly $|x_3|-1- \epsilon$.
    • again by difference we can get the instances of $x_2$:
      • the occurrences of $x_2$ before $\phi(x_3)$ are exactly $\phi(x_3)- (|x_1|+|x_3|)$
      • the occurrences of $x_2$ in the slot 1 (listed above) are: $ \#(x_2)= \phi(x_1)-|x_1|-\epsilon $
      • that is : $ \#(x_2)=\phi(x_3)- (|x_1|+|x_3|)- \phi(x_1)+|x_1|+ \epsilon$
      • finally we have: $ \#(x_2)=\phi(x_3)-\phi(x_1)-|x_3|+ \epsilon$

    That's complete the proof of the formula.
    It's quite easy now extend the formula to more variables. The only expedient to make it easier is to remove from the formula the operator $Min$ splitting the formulas in two branches. 
    I'll show it in paper.
    Note about Complexity
    What's the most painful point of this formula?
    ... The introduction of placeholders requires to apply the formula for each permutation of the variables involved. It means that having $k$ variables we need to calculate the formula $k!$
    Anyway I don't expect to use the formula with large set of variables, after all the principle of selecting the right and small set of features is always recommended!

    As usual Stay Tuned.

      Saturday, April 19, 2014

      Waiting Time Polynomials: how to derive the analytical formula: Part IV

      Introduction before you start
      I got many clarification requests about the Waiting Time Polynomials I published on the blog in the last three posts.
      The paper is almost ready to be submitted for review, but I think that some technical explanation might be interesting also for not academic audience.
      I consider myself a curious and hungry seasoned student, and I know how can be tedious read formulas and mathematical passages especially when it comes from a blog!!
      So why technical explanations?
      The answer is in the following quote of one of my favourite scientists, Gregory Chaitin. In "The quest for Omega" he wrote:

      The books I loved were books where the author’s personality shows through, books with lots of words, explanations and ideas, not just formulas and equations! I still think that the best way to learn a new idea is to see its history, to see why someone was forced to go through the painful and wonderful process of giving birth to a new idea! To the person who discovered it, a new idea seems inevitable, unavoidable. The first paper may be clumsy, the first proof may not be polished, but that is raw creation for you, just as messy as making love, just as messy as giving birth! But you will be able to see where the new idea comes from. If a proof is “elegant”, if it’s the result of two-hundred years of finicky polishing, it will be as inscrutable as a direct divine revelation, and it’s impossible to guess how anyone could have discovered or invented it. It will give you no insight, no, probably none at all. 

      That's the spirit that leads the following explanation!

      Definition of the problem
      Given an alphabet of 3 elements $\{X_1,X_2,X_3\}$, the function $w(X_i) $ counts  the number of failed trials before the last event $ X_i $.
      Consider now the following configuration: \[ \{\left\vert{X_1}\right\vert =i , \left\vert{X_2}\right\vert =j,\left\vert{X_3}\right\vert =k\}: i+j+k= Z \wedge i,j,k>0 \]

      • What are the admitted sequences  $\{w(X_1),w(X_2),w(X_3)\}$ ?

      Step I: Find all the possible configurations of events
      How can we list the sequences of length $Z$ that can be built with $ \{\left\vert{X_1}\right\vert =i , \left\vert{X_2}\right\vert =j,\left\vert{X_3}\right\vert =k\}: i+j+k= Z \wedge i,j,k>0$ ?

      Example of overall waiting time $w(x_i)$  in a succession of events.
      • once we set the values of the first two variables, the third it's determined by $Z-i-j$.
      • we imposed that all the variables occur at least once, so we $X_1$ can assume all the values between $[1,Z-2]$.
      • for each value of $X_1$ the variable $X_2$ can assume values between $[1,Z-i]$.
      •  $p_i$ is the probability that $X_i$ occur in a Bernullian trial.
      Now we have all the ingredients to make the cake:

       $ \sum_{i=1}^{Z}\sum_{j=1}^{Z-i}\sum_{k=1}^{Z-i-j}{p_1^ip_2^jp_3^k}$  

      In the first two summations,  $i$ assumes values between $[1,Z]$ just to keep the formula cleaned.
      ...I let you proof why the result doesn't change :).
      last point about this step:the limit of the above summation $ Z \rightarrow \infty = \frac{p_1 p_2 p_3}{\left(p_1-1\right) \left(p_2-1\right) \left(p_3-1\right)}$ 
      Such limit will be used to build the probabilistic density function.
      Curiosity (helpful for complexity analysis...):
      • The number of sequences that can be built with vectors of length $[3,Z]$ are $\binom{Z}{3}$
      • The number of sequences that can be built with vectors of length $Z$ are $\binom{Z}{2}$
      Step II: Waiting for an event!
      What's the easiest way to describe the overall waiting time for an event in a finite succession?
      There are many ways to get the $w(x_i)$, the easiest I found is given by the position of the last occurrence of $x_i$ minus the number of occurrences of $x_i$.
      For instance, let's consider $w(x_1)$:
      • The position of the last occurrence of $x_1= 8$;
      •  $\left \vert{X_1} \right \vert = 4 $ 
      • $w(X_1)=4$
      Where we are:
      The first two steps explain the circled pieces of the formula:

      What the "overall waiting time" for?
      For each event $X_i$ we are counting the holes among all the occurrences, so smaller is the overall waiting time, closer each other are the events $X_i$: it's a measure of proximity for the occurrences of $X_i$.
      What I did, is to extend such measure (it would be interesting to prove that it's really a measure!) to different kind of events (aleatory variables) ${X_1, X_2,...,X_n}$ over the discrete line of the time.
      There are several area for which such kind of analysis might be helpful, I showed last time an its application as powerful document classifier, where each variable $X_i$ is a word of a document.
      If we consider a document as a succession of $Z$ words, the proximity measure inducted by the waiting time polynomials is a sort of finger print for the document, since for similar documents we expect that the same words are characterised by similar overall waiting time.
      Moreover, the dependency among the words are considered, since we are taking in account simultaneously an arbitrary number of words (the alphabet ${X_1, X_2,...,X_n}$).

      In the next step I'll explain the logic to get the remaining pieces of the puzzle, that will make easier the generalisation of the approach to an arbitrary alphabet.
      Stay Tuned!