Text & Data Mining by practical means: Waiting Time Polynomials: how to derive the analytical formula: Part IV

Saturday, April 19, 2014

Waiting Time Polynomials: how to derive the analytical formula: Part IV

Introduction before you start
I got many clarification requests about the Waiting Time Polynomials I published on the blog in the last three posts.
The paper is almost ready to be submitted for review, but I think that some technical explanation might be interesting also for not academic audience.
I consider myself a curious and hungry seasoned student, and I know how can be tedious read formulas and mathematical passages especially when it comes from a blog!!
So why technical explanations?
The answer is in the following quote of one of my favourite scientists, Gregory Chaitin. In "The quest for Omega" he wrote:

The books I loved were books where the author’s personality shows through, books with lots of words, explanations and ideas, not just formulas and equations! I still think that the best way to learn a new idea is to see its history, to see why someone was forced to go through the painful and wonderful process of giving birth to a new idea! To the person who discovered it, a new idea seems inevitable, unavoidable. The first paper may be clumsy, the first proof may not be polished, but that is raw creation for you, just as messy as making love, just as messy as giving birth! But you will be able to see where the new idea comes from. If a proof is “elegant”, if it’s the result of two-hundred years of finicky polishing, it will be as inscrutable as a direct divine revelation, and it’s impossible to guess how anyone could have discovered or invented it. It will give you no insight, no, probably none at all.

That's the spirit that leads the following explanation!

Definition of the problem

Given an alphabet of 3 elements $\{X_1,X_2,X_3\}$, the function $w(X_i) $ counts the number of failed trials before the last event $ X_i $.
Consider now the following configuration: \[ \{\left\vert{X_1}\right\vert =i , \left\vert{X_2}\right\vert =j,\left\vert{X_3}\right\vert =k\}: i+j+k= Z \wedge i,j,k>0 \]

What are the admitted sequences $\{w(X_1),w(X_2),w(X_3)\}$ ?

Step I: Find all the possible configurations of events
How can we list the sequences of length $Z$ that can be built with $ \{\left\vert{X_1}\right\vert =i , \left\vert{X_2}\right\vert =j,\left\vert{X_3}\right\vert =k\}: i+j+k= Z \wedge i,j,k>0$ ?

Example of overall waiting time $w(x_i)$ in a succession of events.

once we set the values of the first two variables, the third it's determined by $Z-i-j$.
we imposed that all the variables occur at least once, so we $X_1$ can assume all the values between $[1,Z-2]$.
for each value of $X_1$ the variable $X_2$ can assume values between $[1,Z-i]$.
$p_i$ is the probability that $X_i$ occur in a Bernullian trial.

Now we have all the ingredients to make the cake:

$ \sum_{i=1}^{Z}\sum_{j=1}^{Z-i}\sum_{k=1}^{Z-i-j}{p_1^ip_2^jp_3^k}$

In the first two summations, $i$ assumes values between $[1,Z]$ just to keep the formula cleaned.

...I let you proof why the result doesn't change :).
last point about this step:the limit of the above summation $ Z \rightarrow \infty = \frac{p_1 p_2 p_3}{\left(p_1-1\right) \left(p_2-1\right) \left(p_3-1\right)}$
Such limit will be used to build the probabilistic density function.
Curiosity (helpful for complexity analysis...):

The number of sequences that can be built with vectors of length $[3,Z]$ are $\binom{Z}{3}$

The number of sequences that can be built with vectors of length $Z$ are $\binom{Z}{2}$

Step II: Waiting for an event!

What's the easiest way to describe the overall waiting time for an event in a finite succession?
There are many ways to get the $w(x_i)$, the easiest I found is given by the position of the last occurrence of $x_i$ minus the number of occurrences of $x_i$.
For instance, let's consider $w(x_1)$:

The position of the last occurrence of $x_1= 8$;
$\left \vert{X_1} \right \vert = 4 $
$w(X_1)=4$

Where we are:
The first two steps explain the circled pieces of the formula:

What the "overall waiting time" for?

For each event $X_i$ we are counting the holes among all the occurrences, so smaller is the overall waiting time, closer each other are the events $X_i$: it's a measure of proximity for the occurrences of $X_i$.

What I did, is to extend such measure (it would be interesting to prove that it's really a measure!) to different kind of events (aleatory variables) ${X_1, X_2,...,X_n}$ over the discrete line of the time.

Applications

There are several area for which such kind of analysis might be helpful, I showed last time an its application as powerful document classifier, where each variable $X_i$ is a word of a document.

If we consider a document as a succession of $Z$ words, the proximity measure inducted by the waiting time polynomials is a sort of finger print for the document, since for similar documents we expect that the same words are characterised by similar overall waiting time.
Moreover, the dependency among the words are considered, since we are taking in account simultaneously an arbitrary number of words (the alphabet ${X_1, X_2,...,X_n}$).

In the next step I'll explain the logic to get the remaining pieces of the puzzle, that will make easier the generalisation of the approach to an arbitrary alphabet.
Stay Tuned!
cristian