Text & Data Mining by practical means

Partitional clustering: number of clusters and performances a quantitative analysis

2014-12-09T13:57:00.002-08:00

Abstract
Partitional clustering methods as k-medoid or k-means require an input parameter to specify the number of clusters to partition the data.
The complexity in time of such algorithms strictly depends on

the number of clusters used to initialise the computation.
the steps to update the centroids.

Whenever the similarity distance doesn't allow the determination of the centroid thru the analytical methods, the complexity in time tends to explode.
In the post I show an heuristic to minimise the complexity in time in non-Euclidean space.

Number of computational steps for standard k-mean executed. Chart depicts the steps by number of points $N$ in the range [1k,10k] and by number of clusters in the range [5,$N/10$].
The cardinality of the clusters has been simulated assuming uniform distribution.
The red line indicates the minimum number of steps.

Introduction
K-means or k-medoids are widely used because its implementation is very easy and the complexity in time is low (it's linear over the iterations*#clusters#*#points).
Is the complexity linear for whatever problem?
The answer is no!

The problem
The problem arises when the objects to be clustered don't lie in a space for which the computation of the centroids can be done thru analytical methods.
In this context, the computational complexity to update the centroids becomes preponderant respect the other steps and the overall complexity is not any longer linear in time.

The complexity in case of non-Euclidean space
In a Euclidean space, the points of a cluster can be averaged.
In non-Euclidean spaces, there is no guarantee that points have an “average” (...some puritan of the measure theory might stuck up their nose on that) so we have to choose one of the members of the cluster as centroid.
As explained in the former post, for each cluster the computation of the centroid requires (in case of symmetric distance) $\binom{K_i}{2}$ computational steps.
The overall complexity (for each iteration) is the following:
$ \sum_{i=1}^k \binom{z_i}{2}+k \cdot n$

A dataset having several points might require a number of computations simply not feasible; in that case it's essential the minimisation and optimisation of each param of the algorithm to reduce the complexity.
A numeric example will clarify the point.
Let's assume a data set of 10.000 elements, and let's assume that the compression rate of the clustering is 90% (meaning the number of clusters are 1.000).

If the clusters cardinality is homogeneous, than the number of steps for the first iteration will be = 10.045.000.
If the clusters cardinality is highly unbalanced, for instance in case that a cluster contains 95% of the points, than the number of steps is = 50.360.655.

As you can see... even if the data set is relatively small the computational effort is very high!

Determination of the #clusters that minimises the complexity in time

There are no many choices ... the only option is to determine the number of clusters that minimises the computational steps.

You might argue that the # clusters should be selected according to the distribution of the data set, but it's also true the following:

In the real world you don't know a priori the exact number of clusters and all the heuristics to estimate them are quite often not really helpful in non-Euclidean space with several points. I prefer so, to have a sub-optimal clustering solution but with much faster answer :)
clustering analysis is usually one of the many analysis that usually you perform to understand your data, therefore even if the clustering is not perfect, it doesn't hurt too much. Clustering it's just one piece of the entire puzzle!

The approach
If the clustering divides the data set in clusters having more or less the same size, the solution is quite straightforward, we can consider the minimum of the function described above, thru the analysis of first derivate.
Assuming that each $z_i = \frac{n}{k}$ then the formula can be rewritten as:
$ f(n,k)=\frac{n^2}{2 k}+k n-\frac{n}{2}$
The value of $k$ that minimises the first derivative of the such function is:
$ \frac{\partial f(n,k) }{\partial k}=\frac{\partial}{\partial k} \frac{n^2}{2 k}+k n-\frac{n}{2} = n-\frac{n^2}{2 k^2}$.
And it takes value = 0 with $k= \frac{\sqrt{n}}{\sqrt{2}}$.

In all the other cases for which the clusters size is not homogeneous, then we can easily simulate it!

Computations for a data set of 10k points.
The computations are estimated by number of clusters (cluster size ~ [2,1000]) and with unbalanced clusters sizes.
The red line indicates the minimum.

As you can see the difference between the computational steps to be executed in case of "best" configuration of #clusters and all the other possible configurations is quite impressive: it's almost 1*10^7 in worst case. And this is for each iteration!

I let you calculate how much time you can save working with the best configuration :).

Stay tuned!

Text Clustering, a non parametric version of K-medoids algorithm for data de-duplication

2014-09-21T13:42:00.002-07:00

Abstract
It's presented a quick and effective method to leverage text clustering for data de-duplication and normalisation.
The customised version proposed in this post bypasses the very well known problem of the assignment of the number of clusters.

Text Clustering results: each row depicts a cluster found.

Introduction
One of the most common problems in the data management is about the consistency: customer addresses, company names and whatever attribute in string format can be represented in multiple formats, often mutually equivalent.
Data normalisation presents some challenges, especially in the following scenarios:

Variations/mutation of the trusted source of information are not a priori known.
The data to be normalised are characterised by high volume.
There are no deterministic and static rules to normalise the data.
...you don't have plenty time to find the perfect solution :), and you have to deliver fast!

Why the brute force approach it's wasted time!
Let's assume that you already found for your specific use case the best distance to judge how far is a record from another one.
To make the story a bit more realistic, let's assume hereafter that our attribute to be normalised is a string.
...I would hope that nobody of you really thought to create a big matrix to keep track of the all possible distances among the possible values of the attribute... Why? If you have for instance 10k possible values, you need to computate at least 49.995.000 comparison (assuming that your distance is symmetric!), that's because the complexity follows Binomial[N,2].
You might perform the brute force approach over a statistically representetive sample set, but in that case it's really hard the validation of the real accuracy of your results.

Why Partitional Clustering is efficient?
As you know, this is a "practical mean" blog, so I encourage you to refer to real books if you are looking for the theoretical proof. Anyhow it's easy to get convinced about the following:

number of comparisons to be performed assuming $K$ clusters, $I$ iterations, $N$ elements to be clustered: $N \cdot K \cdot I$.
number of comparison to update the centroids: $I \cdot (\binom{K_1}{2}+\binom{K_2}{2}+\cdots+\binom{K_K}{2}) << \binom{N}{2}$

The comparison of the computational complexity of the brute force approach with the partitional clustering is well depicted by the below example, where I assumed ten iterations to get the convergence, 10 clusters, a data set that grows from 1000 to 20.000 points:

The computational effort: in red the brute force approach, in blue the clustering.

The approach in a nutshell

1. Group together records that look similar each other using k-medoid clustering

2. Store in a index the centroids found.

3. delete (only if the use case allows it) all the member of the clusters with exception of the centroids.

4. for each new record to be stored in the DB, perform the comparison with the elements of the index: if the distance is to high, add the new record to the index.

K-medoids vs K-means

The main difference of K-medoid respect the well known K-means is in the assignment of the centroids:

in K-means the centroid is the barycenter of the cluster, so it might not be a real record.
in K-medoids, the centroid is a record of the cluster that minimises the sum of the distances respect all the other points of the cluster. Of course, depending on the case, you might prefer minimise the variance, or even define more sophisticated minimisation criteria.
K-means is usually much faster, since the computation of the barycenter it's usually taken as average of the position of the records in the space,
K-medoids requires (for each cluster) the additional comparison of the distances of the points of the cluster from all the other points of the clusters (that explains the formula mentioned in the complexity formula.

K-medoids is more robust than k-means in the presence of noise and outliers.

...But there is still something that makes me unhappy with K-medoids: It's parametric, meaning it requires the priori knowledge of the number of clusters.

In the real world simply forget it!

Customised version for auto-determination of number of clusters.

Instantiate the standard k-medoids with arbitrary number of classes (in the case analysed later I used just two clusters as initialisation).
after the operation to update the medoids, compare the medoids with the others: if two medois are very similar, then merge the two clusters.
calculate the mean distances (this computation can take advantages of the operations performed during the medoids updating) for each cluster: if the mean distance overcome a threshold (that indicates the sparsity of the points of the cluster) then split the cluster. In this experiment I used a wild approach: each point of the split cluster have been consider as new cluster.
goto step 2 until convergence.

You might argue that such customisation doesn't require the "number of cluster" as param but introduce two thresholds: one for the clustering merging and another one for the splitting.
Actually the threshold is just one:

if $\theta$ is the minimum acceptable distance for which two strings might be considered as the same, then $1- \theta$ is the threshold for which two strings are absolutely different!

The advantage of the approach lyes on the fact the estimation of the threshold can be easily estimated and verified. The estimation of #clusters it's not that easy!

Experiment: Addresses de-duplication.
As data set I downloaded from this link the addressees of the pizzeria in New York (I'm Italian, and I love pizza :D).
To assess the accuracy of the algorithm I randomised each entry introducing a gaussian noise.

Data Perturbation:
The perturbation method I used, is based on randomised swapping of characters of the string:

Select the number $n$ of characters you want to swap. Such number is drawn within the range [0,StringLength[str]] using a gaussian distribution.
Draw $n$ couples of integers (between[1,StringLength[str]]).
For each couple swap the corresponding first character with the second.

Distance
The definition of distance is crucial: several options are available. In this experiment I used the common Nedelman-Wunsch similarity. I'm pretty sure that good results can be achieved also with Jaro-Winkler and Levenshtein distances.

Experiment Setting
I firstly estimated the param of the customised k-medoid algo: the Merging/Splitting threshold.
The value is dependent by the nature of the data.

What's the minimum acceptable distance between two strings to be considered the same?

I selected randomly several couples of strings, measured the distances and sorted by value.
I noticed that 70% of matching it's the good one.
In the next posts I'll show a technique more refined based on the convergence slope of the clustering... but that's another story.

Results:
The data set contains 50 records:

10 different addresses
for each address 5 deviations generated with the aforementioned perturbation.

The customised K-medoids found the best result (that is 10 clusters containing the 5 deviated addresses) ~20% of the cases (first bar of the below chart)
The below chart shows how many time (out of 100 trials) the alto found exactly $k$ number of clusters.

The first bar says that 22 times out 100 trials, the alto returned 10 clusters.

As you can see, in 70% of the cases the algorithm partitions the data set in [9,12] clusters!

Below the best and worst case

Best case: achieved 20% of the cases. Steps to converge ~26.

Worst case: returned 17 clusters. Convergence after 40 steps.

Convergence of the algorithm:
The time to converge is the following:

On average the algorithm converges after 36 steps.

I noticed that longer is time to converge, lower is the probability to determine correctly the number of clusters.

As usual: stay tuned!
cristian

Fractals: parameters estimation (part IV)

2014-07-29T13:22:00.001-07:00

Introduction
In the former posts we discussed about the following points:

There are special points in the contour of the fractal that can be used to derive its contour.
Such points can be used to describe the fractal thru an iterative and deterministic algorithm.

It's still open the main issue: can we leverage the two above findings to determine the fractals parameters?
The answer is yes, provided that we use a good technique to extract the contour points of the fractal image.

The algorithm (practical explanation)
We already noticed that points in the contour of the fractal are in tight relationship each other.
Since each linear transformation is described by exactly six parameters, for their determination we need at least 6 points for each transformation function.
The below image makes the concept clear:

Image-1: Relationship between points of the fractal contour

The steps to obtain the estimation of the params are the following:

Extraction of the contour points thru convex hull algorithm;
Build relationships among the points.

Left Image: Extraction of contour points thru Convex Hull.
Right Image: Relationship among the points extracted.

How to build the relationships among the points?
As showed both in the deterministic algorithm presented in the last post and in the above image, the points preserve an ordering that can be leveraged to simplify the complexity of the problem.
Consider for instance the fern case:

the transformation that leads to the fixed point lying on the contour (highlighted in image-1) can be obtained (as explained in the points 1,2 of the deterministic algorithm) just creating the relationship among the consecutive points of the set $ l_1 $ described in Image-1 and in the below image:

Set of points to determine the Transformations

At this point we obtained the parameters of the first transformation, but still, we don't know neither the number of transformations we really need to describe the fractal image, nor the probability associated to them.

For the first problem, we have to strategies: the brute force, that is, try all the possible combinations, or more wisely try the soft computing approach described below.

The fuzzy clustering approach (...unfortunately here some formulas are required)

The fuzzy clustering is a very well known algorithm that assigns the membership of a point to a cluster with a certain "degree", so the point $p_i$ might belong to the cluster $A$ with degree $x$ and to the cluster $B$ with degree $y$.

In this problem I modified the algorithm in this way:

the clusters are the mapping functions that we want to discover.
the probability associated to the fractal maps can be easily derived by the "size" of each cluster.
The fuzzy-membership function is based on the minimisation of the distance between the point obtained applying to $x_i $ the contractive map $x_i^{'} = \tau_j(x_i) $ and all the other points falling in a reasonable neighbour. The animation below describes it.

The update step of the centroids is aimed to minimise the distance between $ d(\tau (x_i)),\phi(x_i,\tau_j))$ computed over each $\tau$ and each $x_i$, where
To minimise the above function, I used the gradient technique, working on the the following cost function:
$ E(\theta)= \sum_{i=1}^{n}{(\mu_{x_i}(\tau_j)\cdot[d(\tau_j(x_i),\phi(x_i,\tau_j))]^2)} $
For each mapping function, the correction to each param is the following:
$\theta_i= \theta_i - \eta \frac{\partial E(\theta)}{\partial \theta_i}$

Results

Starting from the image of the fern (a jpg image containing 100.000 points) the application of the algorithm for the determination of the contractive maps gives the following result:

Contractive maps detection algorithm: in black the points of the original fern, in red,green, blue the three estimated maps.

The results are not bad, the weakness of the algorithm lays on the extraction of the points of the Convex hull of the fractal.
The application of smarter contour extraction algorithm should improve further the accuracy of the algorithm.

Disclaimer

In the last posts about fractals I often abused the term contour of fractal. ...Theoretically speaking, a fractal cannot have a contour, but to make clear the pragmatical aspects I decided voluntarily such terminology.

Fractals: a deterministic and recursive algorithm to reproduce it (Part II)

2014-06-07T07:51:00.001-07:00

Abstract:

Fern fractal estimation thru recursive and deterministic algorithm.

A fractal can be described as fixed point of a Markov process: given the proper contractive maps it's easy to generate it.

The question is: given the contractive maps, is it possible to generate the fractal using a pure deterministic approach?

Problem

We already observed that the iteration of the contractive maps starting from a fixed point that lyes on the border of the fractal returns points of the convex hull of the fractal.

What's happen if we recursively apply such schema to the points obtained from the the above procedure?
Is it possible recreate the markov process (or at least approximate it) removing any probabilistic aspects?

The below algorithm, given the contractive maps of the fractal, returns at each iteration better approximation of it.

At each iteration it considers the fractal as better approximation of the contour of the original image.

The Algorithm (by practical means)

Consider a fixed point $P_0$ of a contractive map $\tau_1$ that lyes on the convex hull of the fractal.
Choose randomly a point $P_i$ of the fractal and apply the above contractive map until the distance to $P_0$ is negligible.
Map the point calculated at step 2 using sequentially all the contractive maps.
Map each point obtained from point 3 the former step with $\tau_1$ till the distance to $P_0$ is negligible.
If[ITERATIONS< K]:

K--;
For each point obtained from point 4 go to point 3.

To explain how it works I plotted the list of points got using $K=1$ iterations of the algorithm:

Bigger is $K$, more accurate will be the result.
The above procedure works only if the contractive map $\tau_1$ has a fixed point that lyes on the convex hull of the fractal!!

Results:

I iterated the algorithm with $K=4$ times. At each iteration the algorithm returns a deeper description of the contour of the fractal (...even though definition of contour for a fractal doesn't make any sense, it gives at least a practical meaning):

Results of the Recursive Algorithm iterated with K=1,2,3,4.

If we try to overlap the results obtained with the original fern we get:

Original Fern, and the overlap with the results obtained using the recursive algorithm (K=4).

Conclusion and next steps
I showed how to depict the IFS generated thru the markovian process as a recursive and completely deterministic process.
We noticed also, that in the fractal there are special points (as for instance $P_0$) that play a crucial role to describe the IFS.

The question now is the following:
is it possible leverage such special points and the above recursive algorithm to estimate the parameters of the contractive maps?
... I will show you a procedure that partially answer the question and some other example of the algorithm above described!
Stay Tuned.
Cristian

Fractal parameters estimation thru contour analysis - Part I

2014-05-24T09:09:00.000-07:00

Introduction
I really love fractals theory. Nothing fascinates me as the amazing structures generated by a chaotic processes that describe (hyperbolic) fractals.
It's fairly easy to draw a fractal image:

The input

A small set of very easy function (linear transformations)
A probability to choose one of the aforementioned functions.

The process

Choose randomly a starting point.
Choose randomly one of the above functions according to the above defined probability.
Map the starting point using the above selected function.
iterate as long as you want :)

Admire the meaning of the life!

just plot the points

Wherever you are, whatever decision you make, despite the chaos, the overall picture will be always the same. More you will move, more clear the picture will be!

When I approached for the first time fractals, after the reading of the great Barnley's book (Fractals Everywhere) I opened the laptop and I wrote the routine to draw my first IFS!
...By mistake I also plotted the lines that join the points:

...It's quite a messy, isn't it?
Sometimes it's better to give up to connect the points, just relax and wait until everything will be clear. If we focus on every single detail we risk to lose what really counts!

What is a fractal (Iterated Fractal System)?
There are many formal definitions to describe an IFS, in my opinion the most effective describe a fractal as the fixed point of the Markov process (mentioned above).
The process converges to the fractal :)

The powerful of the fractal
In literature there are plenty articles in which fractals are used to solve complex problems (financial analysis, stock exchange preditictions, description of biological processes,...).
Also in Computer Science fractals fond a good collocation, think about the image compression: the ivy leaf image takes around 25MB of space, all the information to obtain it thru the IFS takes less 1KB!!
The problem
Given the ivy leaf image, what are the parameters of the Markov process to generate it?
A bit more formally what we have to estimate is:

A set of functions: $\{ \tau_1, \cdots,\tau_N \}$, where $\tau_i(p_0)=A_1 \cdot p_0 +Q_1$, where:

$ A= \begin{bmatrix}\alpha & \beta\\ \gamma & \delta \end{bmatrix} Q= \begin{bmatrix} \epsilon\\ \zeta \end{bmatrix} $

We need also the estimate the probability to choose each $\tau_j$

The combination of the collage theorem and the box counting approach is the most common technique to solve the problem.
When I was student I approached the problem from a different angle. I have to say that the results obtained were partials but I still think that something more can be done :)
Before to start we need one more notion: the contractive maps (have a look at Banach Theorem)
Under certains conditions the iteration $\tau_j$ led to a fixed point:

Example of contractive map applied to an ellipsoid.
It converges to a fixed pointed.

First conjecture:

An IFS is characterised by a fixed point that lies on its convex hull.
From a fixed point that lies on the border of the IFS, the iterations of the contractive maps that generate the fractal return the convex hull of the fractal.

The ivy leaf IFS is generated by 4 contractive maps. Each color describes the map used to generate the point.

The above animated gif shows the meaning of the conjecture.

An experimental evidence about the fact that at least one fixed point lies on the convex hull of the fractal can be obtained changing the params of the maps:

The light blue points depict the fixed point of the maps used to plot the fractals.

Despite the changes in the maps, the fixed point on top of the leaf still lays on the convex hull of the fractal.

In the next post I'll show you a nice recursive algorithm I found to obtain different levels of convex hull for an IFS.

Stay tuned

Cristian.

Waiting Time Polynomials: tech explanation - last part

2014-05-02T09:23:00.004-07:00

This is the last step to complete the explanation of the waiting time polynomials formula.

Unfortunately it's a bit technical, but I'm sure that it can be understood without deep math knowledge.
At the very end if you can't explain it simply you don't understand it well enough! (A. Einstein)

The trick
Last time we left with the tally of overall waiting time $ w(x_i) = \phi(x_i)-|x_i| $ where $\phi(x_i)$ returns just the position of the last $|x_i|$ in the vector $V$.
Let's have a look at the following example that will be used during the remaining steps.

There are two questions that might be answered:

given $|x_1|= i, |x_2|= j, |x_3|= k $ what are the tuples $\{w(x_1),w(x_2),w(x_3)\}$?
given a $\{w(x_1)=I,w(x_2)=J,w(x_3)\}=K$, how many vectors $V$ can be built?

To answer to the first question, I noticed that the three coloured cells are the only ones that really count.
The idea is the following:

consider the three cells as placeholders
analyse the admitted values for each of them
replace the placeholders with all the possible permutations of the alphabet $\{x_1,x_2,x_3\}$.

Let's start with the case depicted in the above image, where we assumed that $\phi(x_1) < \phi(x_3) < \phi(x_2) $, then we have the following facts:

$ \phi(x_1)$ can take values between: $0 \leq \phi(x_1) \leq |x_1|+|x_2|+ |x_3|-2$
$ \phi(x_2)$ can take just one value: $|V|=|x_1|+|x_2|+ |x_3|$
The upper bound of $ \phi(x_3)$ is $|V|-1$ because it can slide till the second last element of $V$, that is $\phi(x_2)-1$
what about the lower bound of $\phi (x_3)$? We have two cases depicted in the below image:

To sum up, so far we explained the following part of the formula (I hope you don't mind that I changed a bit the indexes notation):

We have now to consider that for each configuration of $\{w(x_1)=I,w(x_2)=J,w(x_3)\}=K$ we can have more than one vector $V$.
Do you like combinatorics? The remaining part it's just matter of tally, and the only formula we need is the formula for permutation with repetitions. I let you rub up the concept on your favourite website for trivial combinatorics.

The formula can be split in two chunks, because we have to blocks of cells to be filled

In how many ways can we fill the cells between the positions $[1,\phi(x_1)]$?
In how many ways can we fill the cells between the positions $[phi(x_1),phi(x_2)]$?

Let's answer the first question we have to find the values for the denominator of the following:

\[\frac{(\phi(x_1)-1)!}{\#(x_1)!\#(x_2)!\#(x_3)!}\]

we have $|x_1|-1$ cells that can be filled.
it contains all the instances of $x_1$ (except for the last the occupied $\phi(x_1)$)
the number of $x_3$ instances depends by $\phi(x_1)$ and $\phi(x_3)$:

the computation of the number of instances of $x_2$ in the first slot is straightforward, and it can easily derived by difference:

$\frac{(\phi(x_1)-1)!}{(|x_1|-1)!(|x_3|-Min(|x_3|,\phi(x_3)))!\#(x_3)!}$;
$(|x_1|-1)+(|x_3|-Min(|x_3|,\phi(x_3)))+\#(x_2)=\phi(x_1)-1$
so $ \#(x_2)= \phi(x_1)-|x_1|-(|x_3|-Min(|x_3|,\phi(x_3))) $

This explains the following boxed part of the formula:

The final step is to count in how many ways we can fill the slot 2 depicted by the interval $[phi(x_1),phi(x_2)]$ and to make the formula more readable let's rename $(|x_3|-Min(|x_3|,\phi(x_3))= \epsilon$.

As we did for the first slot we have to identify the values of the denominator of the below formula:

\[\frac{(\phi(x_3)-\phi(x_1)-1)!}{\#(x_2)!\#(x_3)!}\]

Out of $|x_3|$ instances, $\epsilon$ have been placed in the slot 1, so the slot 2 contains exactly $|x_3|-1- \epsilon$.
again by difference we can get the instances of $x_2$:

the occurrences of $x_2$ before $\phi(x_3)$ are exactly $\phi(x_3)- (|x_1|+|x_3|)$
the occurrences of $x_2$ in the slot 1 (listed above) are: $ \#(x_2)= \phi(x_1)-|x_1|-\epsilon $
that is : $ \#(x_2)=\phi(x_3)- (|x_1|+|x_3|)- \phi(x_1)+|x_1|+ \epsilon$
finally we have: $ \#(x_2)=\phi(x_3)-\phi(x_1)-|x_3|+ \epsilon$

That's complete the proof of the formula.
It's quite easy now extend the formula to more variables. The only expedient to make it easier is to remove from the formula the operator $Min$ splitting the formulas in two branches.
I'll show it in paper.
Note about Complexity
What's the most painful point of this formula?
... The introduction of placeholders requires to apply the formula for each permutation of the variables involved. It means that having $k$ variables we need to calculate the formula $k!$
Anyway I don't expect to use the formula with large set of variables, after all the principle of selecting the right and small set of features is always recommended!

As usual Stay Tuned.
Cristian.

Waiting Time Polynomials: how to derive the analytical formula: Part IV

2014-04-19T05:42:00.002-07:00

Introduction before you start
I got many clarification requests about the Waiting Time Polynomials I published on the blog in the last three posts.
The paper is almost ready to be submitted for review, but I think that some technical explanation might be interesting also for not academic audience.
I consider myself a curious and hungry seasoned student, and I know how can be tedious read formulas and mathematical passages especially when it comes from a blog!!
So why technical explanations?
The answer is in the following quote of one of my favourite scientists, Gregory Chaitin. In "The quest for Omega" he wrote:

The books I loved were books where the author’s personality shows through, books with lots of words, explanations and ideas, not just formulas and equations! I still think that the best way to learn a new idea is to see its history, to see why someone was forced to go through the painful and wonderful process of giving birth to a new idea! To the person who discovered it, a new idea seems inevitable, unavoidable. The first paper may be clumsy, the first proof may not be polished, but that is raw creation for you, just as messy as making love, just as messy as giving birth! But you will be able to see where the new idea comes from. If a proof is “elegant”, if it’s the result of two-hundred years of finicky polishing, it will be as inscrutable as a direct divine revelation, and it’s impossible to guess how anyone could have discovered or invented it. It will give you no insight, no, probably none at all.

That's the spirit that leads the following explanation!

Definition of the problem

Given an alphabet of 3 elements $\{X_1,X_2,X_3\}$, the function $w(X_i) $ counts the number of failed trials before the last event $ X_i $.
Consider now the following configuration: \[ \{\left\vert{X_1}\right\vert =i , \left\vert{X_2}\right\vert =j,\left\vert{X_3}\right\vert =k\}: i+j+k= Z \wedge i,j,k>0 \]

What are the admitted sequences $\{w(X_1),w(X_2),w(X_3)\}$ ?

Step I: Find all the possible configurations of events
How can we list the sequences of length $Z$ that can be built with $ \{\left\vert{X_1}\right\vert =i , \left\vert{X_2}\right\vert =j,\left\vert{X_3}\right\vert =k\}: i+j+k= Z \wedge i,j,k>0$ ?

Example of overall waiting time $w(x_i)$ in a succession of events.

once we set the values of the first two variables, the third it's determined by $Z-i-j$.
we imposed that all the variables occur at least once, so we $X_1$ can assume all the values between $[1,Z-2]$.
for each value of $X_1$ the variable $X_2$ can assume values between $[1,Z-i]$.
$p_i$ is the probability that $X_i$ occur in a Bernullian trial.

Now we have all the ingredients to make the cake:

$ \sum_{i=1}^{Z}\sum_{j=1}^{Z-i}\sum_{k=1}^{Z-i-j}{p_1^ip_2^jp_3^k}$

In the first two summations, $i$ assumes values between $[1,Z]$ just to keep the formula cleaned.

...I let you proof why the result doesn't change :).
last point about this step:the limit of the above summation $ Z \rightarrow \infty = \frac{p_1 p_2 p_3}{\left(p_1-1\right) \left(p_2-1\right) \left(p_3-1\right)}$
Such limit will be used to build the probabilistic density function.
Curiosity (helpful for complexity analysis...):

The number of sequences that can be built with vectors of length $[3,Z]$ are $\binom{Z}{3}$

The number of sequences that can be built with vectors of length $Z$ are $\binom{Z}{2}$

Step II: Waiting for an event!

What's the easiest way to describe the overall waiting time for an event in a finite succession?
There are many ways to get the $w(x_i)$, the easiest I found is given by the position of the last occurrence of $x_i$ minus the number of occurrences of $x_i$.
For instance, let's consider $w(x_1)$:

The position of the last occurrence of $x_1= 8$;
$\left \vert{X_1} \right \vert = 4 $
$w(X_1)=4$

Where we are:
The first two steps explain the circled pieces of the formula:

What the "overall waiting time" for?

For each event $X_i$ we are counting the holes among all the occurrences, so smaller is the overall waiting time, closer each other are the events $X_i$: it's a measure of proximity for the occurrences of $X_i$.

What I did, is to extend such measure (it would be interesting to prove that it's really a measure!) to different kind of events (aleatory variables) ${X_1, X_2,...,X_n}$ over the discrete line of the time.

Applications

There are several area for which such kind of analysis might be helpful, I showed last time an its application as powerful document classifier, where each variable $X_i$ is a word of a document.

If we consider a document as a succession of $Z$ words, the proximity measure inducted by the waiting time polynomials is a sort of finger print for the document, since for similar documents we expect that the same words are characterised by similar overall waiting time.
Moreover, the dependency among the words are considered, since we are taking in account simultaneously an arbitrary number of words (the alphabet ${X_1, X_2,...,X_n}$).

In the next step I'll explain the logic to get the remaining pieces of the puzzle, that will make easier the generalisation of the approach to an arbitrary alphabet.
Stay Tuned!
cristian

Coefficients of Waiting Time Polynomials listed in OEIS Database!

2014-04-08T12:41:00.002-07:00

I'm happy to announce that the sequence generated by coefficients of waiting time polynomial has been listed in the OEIS database (The Online Encyclopedia of Integer Sequences).
The sequence is: A239700.
In the next posts I'm going to explain how I derived the analytical formula of the polynomials.
As usual: Stay Tuned!
cristian

Coefficients of Waiting Time Polynomials: a nice representation

2014-03-28T13:59:00.000-07:00

I was doing some simulation for the paper I'm writing about the waiting time polynomials, and I got something unexpected.
If we sort in lexicographic order all the polynomials generated for a given length of the events vector, and we consider the coefficients, what you get is an apparently irregular series of integers.
To capture some regularity, I decided to plot such matrix:

The matrix plot, obtained considering an alphabet of three words and vectors having length=35.

Isn't nice?
Sometimes, an appropriate graphical representation helps in capture interesting aspects of your data

Stay Tuned
Cristian

Waiting Time Polynomials Document Classifier - Part III

2014-03-18T15:02:00.001-07:00

Abstract
In the post is presented an innovative definition of polynomials associated to waiting time processes (analyzed in the former posts). Such polynomials are here successfully used as document classifier. Comparative tests with SVM show significant accuracy improvements.

Boolean Classification tests based on test set of 8k randomly generated documents composed using an alphabet of three words.

Introduction
To encourage who found quite intricate the formula I presented a couple of posts ago, I'm going to present you an its practical application that might be a good incentive to analyze the formal aspects with more attention :)
What I show to you today is one of several application of such approach: a document classifier having higher accuracy than traditional methods as SVM (trained with gaussian kernel) and Neural Networks back propagated.

Characteristics of the classifier

It's a supervised learning algorithm
It's completely non parametric
It can be used natively to classify multiple classes datasets.

The Algorithm
Let's assume to have a training set composed by two classes of documents: Cl_1, Cl_2.

Learning Phase: Estimation of Geometrically distributed Random Variables.

Define an alphabet of three words {w1,w2,w3} using frequencies criteria or more sophisticated techniques.
For each class of training set:

estimate parameters {p1, p2, p3} of the respective {X1(w1),X2(w2),X3(w3)} geometrically distributed random variables.

Calculate the polynomials associated to {X1(w1),X2(w2),X3(w3)} using:

Testing Phase: document classification

for each document Di of the test set:

Identify the number of occurrences of {w1,w2,w3}: {O_w1,O_w2,O_w3}
Select the polynomial for which:
{O_w1,O_w2,O_w3} =p1^O_w1 p2^O_w2 p3^O_w3.
Calculate the value of the polynomials P_Cl_1, P_Cl_2 using:

{p1, p2, p3} estimated for Cl_1
{p1, p2, p3} estimated for Cl_2

Classify the document:

If (P_Cl_1>P_Cl_2) Di belongs to Cl_1
Else Di belongs to Cl_2

Examples
1. How to select the polynomial

Let's consider the polynomials calculated using the aforementioned formula and assume that in document Di the word w1 is repeated 2 times, the word w2 is repeated 2 times and w3 is repeated 1 time.
Then, In the step 1 of the testing phase we have to choose the polynomial boxed in the below list:

Polynomials generated for O_w1+ O_w2+O_w3=5

2. Same polynomial, different values of p1, p2, p3.

- How many polynomials are generated for O_w1+ O_w2+O_w3 = 25?

The answer is straightforward: 276 that is, all the possible configurations of 3 addends for which the sum =25. In the Encyclopedia of Integer Sequences, there are exciting explanation on such series.

- How does the polynomial change for different values of p1, p2, p3?

It's quite impressive how two polynomials can be different despite the setting of {p1, p2, p3} is almost identical.

Look at this example for which:

On the left chart I plotted a polynomial with p1= 0.33, p2= 0.33, p3=0.34.
On the right chart I plotted a polynomial with p1= 0.4, p2= 0.3, p3=0.3.

In the first case the polynomial takes the maximum value for O_w1=8, O_w2= 8, O_w3 =9 ...not a big surprise!
In the second case the polynomial takes the maximum value for O_w1=15, O_w2= 9, O_w3 =1. In this case the result is more complicated to be explained!

Document classification comparative test

To test the accuracy of my method I performed a comparative test using a randomly generated document set.

The training set was composed by 100 documents (used to train a Gaussian Kernel SVM and to estimate params p_i for my method).
The test set was composed by:

Cl_1: 4000 documents randomly generated using a configuration of {p1, p2, p3}.
Cl_2: 4000 documents randomly generated using a different configuration of {p1, p2, p3}.

The accuracy has been tested using different configurations of {p1, p2, p3}, and considering different size of documents.

Results

First experiment:

just the first 25 words have been considered to estimate the params {p1, p2, p3}, to train the SVM and to test the the accuracy.

Accuracy results considering just the first 25 words.

Second experiment:

Same test as above using the first 35 words of the documents.

Accuracy results considering just the first 35 words.

All the results showed there are referred to accuracy achieved on average using 10 different randomly generated test set and trying to configure SVM params to maximize the accuracy.

Considerations

As you can see the method based on definition of "Waiting Time Polynomials" that I'm proposing, performs significantly better than SVM.

More comparative tests will be shown in the publication I'm writing on such topic.

Further notes

Processes based on waiting time or geometrically distributed random variables are extremely important for risk assessment and risk evaluation.

I'll show you in another post some application of such polynomials in this field.

As usual, stay tuned

cristian

The Mind Map project: a new concept of Enterprise Knowledge Management

2014-03-06T09:51:00.000-08:00

Abstract
A project to build an innovative Information Management tool to extract, correlate and expose unstructured information.

A screen shot of one of the mindmap generated by the tool.
Different colors have been used to depict sub-concepts.

The demo
Time ago I published in this blog some posts where I presented some algorithms I designed to extract relevant information from documents and more in general unstructured content (as tweets, blogs post, web pages).
I don't want spend too much words, I guess a demo better describes the idea I have in mind.
It's still a prototype, and a lot of work is still required, but in this video I hope you appreciate the idea.
In the video I tested the prototype of the application using a wikipedia page.

PS
To optimize the video, watch it on youtube with the option "quality HD".

...Looking forward to receive feedback!

Stay Tuned

cristian

Statistical Distribution to describe a Document - My Contribute Part II

2014-03-03T23:15:00.001-08:00

Abstract
it's presented the second step to generalize a probabilistic density function (based on geometrical distribution) to describe a document thru the underlying stochastic process.

Marginal PDF and CDF for document described by three words.
In red has been depicted the distribution of a document having size = 35, in Yellow size=34, in Blue size 33.

The marginal distribution that describes the density of a document for a fixed size composed by three words is the following:

Example of alpha, beta, gamma parameters determination.

Introduction

The bag of words approach and almost all the related techniques to extract features from a document are based on the manipulation of frequencies associated to the words of the document. Such methodologies tend to fail when documents are characterized exactly by the same bag of words, and by the same frequency.

The proposed approach bypasses such problem since it takes in account the waiting time of the words in a document.

From 2 variables to 3 variables

In the former post I presented the easiest case: a document depicted by two words. The generalization of the problem it's not painless process.

To understand the tricks used to tackle the problem I'm going to explain the passage from 2 to 3 variables.

...To be continued.

PS: I'm back!

Document as Stochastic Process: My contribute - part 1

2013-10-29T14:39:00.000-07:00

Abstract
A statistical discrete distribution (based on geometric random variable) to depict a document. In the post it's presented a probabilistic density function to describe a document thru the underlying stochastic process.

PDF-CDF charts and PDF analytical formula for document of length Z.

Results for a document of two words.

Introduction
We often discussed about the importance of the features extraction step in order to represent a document.

So far, I mostly presented techniques based on "bag of words" and some more sophisticated approach to capture the relation among such features (the graph entropy, mutual information, and so on).

In the last weeks I started to think about a different perspective in order to capture the essence of a document.

The idea

What is a document from statistical point of view?

I tried to give an answer, and I came to the following (expected) conclusion:

A document is a stochastic process of random variables - the words of the document.
In a document as in a whatever statistic process, the occurrences of the random variables and functions over them doesn't provide complete description of the process.
The joint distribution of the waiting times between two occurrences of the same variable (word) encloses all the significative links among the words.

The idea is to depict a document as stochastic process of "waiting time" random variables.

This formalization allows us to think to a document as a statistic distribution characterized by its own density and cumulate function.
The waiting time is very well depicted by geometric distribution.

Formalization

A document is a process of geometrically distributed random variables:

N= size of the document, j = #words.

where each random variable is described as follow:

Let's consider the following document composed by a sequence of two words: "a" and "b".

The Problem

What is the probability function associated to a document Dk?

Given the following document:

The Probability of Dk is given by:

The below image should explain better what I meant:

Probability of Document Dk
q = 1-p

Probability Density Function

-> From now, if not clearly specified I will refer to a document of two words (as above described).

Questions:

What is the analytical representation of the above function?
Is it the above function a probability function?

1. Analytical formulation

This is the funny part: I like combinatoric... the art of tally!

To solve the problem, I'm sure, there are several strategies, I used just the way that is more aligned to my mind setting: I put all the options on the table and I start to check to gather them together according to some criteria :)

The formulation of the "p" is a joke: the sum of the exponents must be equal to the size of the vector.
the formulation for the "q" requires some deeper thinking.

As you know, I don't like too much technicalities, so I spare you the theoretical aspects.
I think it's more worthwhile to show how the solution comes!

Exponents:
Let's put the q_a and q_b exponents properly sorted on a matrix:

The split over the anti-diagonal makes the calculus easier.

Coefficients:

The same configuration of exponents can be related to different configurations of the words in the document:

The different configurations (20) of two words in a document having length = 8.

Let's put the the occurrences of the different configurations of words for the same exponents configuration in a matrix:

# occurrences of the different configurations of words for the same exponents configuration.

Do you see the similarities respect the two halves of the matrix?

Is it easier now find the "low" to build the above matrix?

... If you don't like play around with combinatorics, you can ask help to The On-Line Encyclopedia of Integer Sequences®!

Ok, ok you are too lazy...: the solution is Binomial[k+i,k].

The Analytical formula

The above matrixes suggest that the analytical formula is composed by two main addends: one for the orange side and another one for the green side.
For a document of two words the Probability Density Function is the following:

PDF for a document composed by two words.
qa = 1-pa = pb; pa+pb = 1.

I don't want enter in theoretical details to proof that the above formula is probability function (after all this is a blog by practical means!).
The only point I want to remark is that the convergence to a PDF is guaranteed for a length of document that tends to Infinity.
The above formula can be further reduced and simplified (in the easy case of 2 words) if you consider the following conditions:

qa = 1-pa = pb => qb=pa (pa+pb = 1).

Basically in the above case the PDF is function of 1 variable.

PDF and CDF for a document of 2 words.

In the CDF, we sum the probability contribute of all documents composed by two words having length =1, 2, ..., Infinity.
As you can see, the contribute of documents having length >50 is negligible for whatever value of pa.

These kind of distributions based on geometric variables are often used in maintenance and reliability problems, so I encourage you to have some reading on that!

Next Steps
In the next posts I'm going to explain

some properties of the function
its extension to multiple variables
entropy
practical applications

Last Consideration
I did the above exercise just for fun, with out any pretension to have discovered something new and or original: if you have some paper/book/reference about identical or similar work, please don't hesitate to contact me!

Don't rush, stay tuned
cristian

Image objects recognition: a basic approach

2013-06-30T06:06:00.002-07:00

To complete the brief dissertation on the most popular problems of the image analysis science, I did a couple of experiments on a very intriguing topic: the identification of specific objects within an image.
I didn't use "state of art" algorithms (...my knowledge on the field is not deep and I wont use techniques that I don't fully understand!).
The target of this experiment isn't to show the best way, but just to show the rudiments of the matter.

Objective
Given an image called target image, and given another image called feature, the objective is to identify the features within the target image.

Technique
I decided to play with a technique that I studied on signal processing course at University: the matrix correlation.

The matrix correlation is typically used to detect similarities between two 2D signals, which are often saved in matrices.

The magnitude of the correlation shows how similar the signals are. If the correlation is large, the two signals are considered to be very similar. Alternatively, if correlation is zero, the two signals are considered to be independent. Analytical definition and explanation are available in all books of signal processing.

Experiment: Ball detection

In the first example I will try to detect the balls in the target image:

On the left the target image. On the right the feature (the ball!) to be identified in the image.

To solve the problem, I tried to apply the algorithm directly to the original images, but the results were terrible, so I applied some image manipulation technique in order to make more digestible the input to the main algorithm: I never be tired to say that preprocessing is crucial to solve a problem!

The algorithms used here are: edge detection and binarization. So I feed the correlation algorithm with the following input:

The input used to feed the correlation algorithm obtained with Binarization and eEge Detection techniques.

Results

Finally we can show the output: we can check if the algorithm identified correctly in the target image the balls!

To highlight the results I played around with colors and image merging :)

Results: balls identified in the image

As you can see the results are not so bad, and playing around with a preprocessing parameters I'm sure it's possible to improve the accuracy.

Experiment: Painting Recognition

I Applied the same technique to identify the paintings hung on a wall:

Target image: paintings hung on a wall

Results:

The techniques used are same of the above experiment, so let's go directly to the results:

Paintings detected

Considerations

The technique is quite rough and the accuracy depends on

Preprocessing technique applied
Measure used to compute the correlation (in both experiment I used cosine distance)

The algorithm per se seems to be not so general purpose to be used for such objective, but it's a good starting point to approach the problem. Maybe if the feature to be detected is always the same (for instance as in face recognition), the accuracy is acceptable.

Stay Tuned

Cristian

Image comparison: SURF algorithm

2013-05-15T13:39:00.000-07:00

In the former post I showed a naif approach that can be easily implemented and that works quite well under the condition that the image is not subjected to distortions.
I decided to deepen a bit the topic just for the sake of the curiosity ( curiosity and imagination are the skills that I most appreciate at work!).

SURF stands for "Speeded Up Robust Features" and it has been published by Herbert Bay, Andreas Ess, Tinne Tuytelaars, Luc Van Gool in 2008. The original paper is available here.

I strongly recommend to read the paper: it's self contained and it's very clear to be read also by neophyte as I.
As usual I go straight to the experiments and to discuss the results.

Experiment
I considered again the same data set of the last post, and I tried to compare all the possible combinations of pairs of images to check the accuracy of the algorithm.

The first step of the algorithm is to determine the key features of an image.

key features detection

Once the features have been detected it's possible to use them for the comparison through euclidean measures or similar.

In the below image I plotted the common points and their position between two images:

Common points found between two images. In yellow I depicted them and their position.

If you zoom a bit the image you will notice that the common points found are thousands. To find them the algorithm took around four minutes.

In a productive environment this amount of time could be not acceptable.

To improve the situation and reduce the computational time, I processed the image with binarization filters; after this preprocessing the time has been reduced from 4 minutes to 25 seconds.

Results

In the below matrix I listed all the comparison done to check the accuracy.

The correct pairs lay on the principal diagonal of the matrix.

In red have been listed the correct pairs

To show graphically the results I plotted over the matrix the number of common points found for each pair. The colors chosen highlight that number.

Conclusions

As you can see for all the pairs on the principal diagonal the algorithm detected the highest number of common points.

In the next post I will show how to use this algorithm to identify logos in a document.

Stay tuned

Cristian

Image comparison for dummies: an easy and fast approach

2013-03-30T08:29:00.002-07:00

Since I started to blog, I presented different methods, techniques and tricks to solve problems related to Data mining, and text mining (... I'm still puzzled about the difference between text mining and text analytics).

Today I would like to present a different approach to solve a common problem in the text analytics area: the document comparison problem.

The problem
Usually if you want to compare two documents in order to check if they are pretty much the same you can apply some easy tricks in order to check frequency and position of the words.
Sometime the documents require an OCR processing in order to get the "text rendition" for the documents, and such process usually is a very expensive in term of computational time (often it takes more than 10 seconds per page): what if you have to perform many comparison checks and you are short in time?
...In that case I suggest to consider the documents as images!

Image processing approach
This branch of the computer science evolved quite fast in the last decade and it achieves a level of complexity extremely high.
I'm not an expert at all, and many libraries (often free as opencv) are available and they offer several functionalities: image comparison, image recognition and so on.
Unfortunately I'm stubborn and curious and I like try to do experiments :).

So I decided to implement a very easy algorithm able to compare with high accuracy documents very similar.

Algorithm:
1. Resize the image (in order to have images having the same number of pixels)
2. Binarize the image
3. compute the histogram distribution
4. perform the comparison considering the Kullback distance between the two empirical distributions got from the histograms.

Considerations:

such algorithm is not robust to linear transformation/distortions of the image (but this is not case when you compare documents...).
it's very fast respect the solutions that require OCR processing.
it could be easily optimized considering sub-block of the images (and give higher weights to blocks in the middle of the image).
through some assumptions based on eigenvectors it could be helpful also in case of image linear transformations.

Experiment

I selected 10 random images (pictures, medical analysis, text, screenshot):

Original sample set

For each document, after some image adjust filter, I computed the histograms:

Histograms for the above images

I introduced some noise in each of the documents and I computed again the histograms:

Noised sample set.

Histograms for the noised images

As you can see, despite the noise introduced is quite big, the characteristic histogram seems to preserve quite well the shape.

Results:

I compared each possible possible combination of the pairs composed by original image and noised image and I ranked the result of the comparison.

...The test has been passed with an accuracy of 100%.

I have to say that the images are quite heterogeneous among each other, but further tests (I'll publish them in one of the next posts) highlighted that the method is quite robust (with all the limitations explained above).

Here is some of the best matching found:

Some of the best matching found

...I have to say that the image processing is a very fascinating science!
Stay tuned
cristian

Mind maps of #textanalytics - automatic generation

2013-02-19T14:07:00.000-08:00

What I want to show you today is "snippet" of a project I'm working during my spare time.
The project has the aim to gather, provide and represent the knowledge in a different way.
For the time being I don't want to disclose too much info, but the main idea is to leverage my concept of the graph entropy to capture information in a document (or in a data set) to create a new kind of indexer.

One of the output of the algorithm is the generation of the mind map.

Experiment
Through the API provided by Twitter I downloaded the last 200 tweets from the hashtag #textanalytics.
I processed the tweets and here is the results:

#textanalytics mind map

In red have been depicted the first 10 relevant words extracted through the graph entropy criteria.
It's interesting notice how the map shows intuitively the links among the words.
Of course the map can be enriched with more links or more relevant words: the choice depends on the user choice.
In the next posts I'll show you how the clustering techniques discussed in the former posts can be profitably used to create homogeneous group of words related each others.
what do you think?
Stay tuned
cristian

Graph Clustering through "prominent vertexes": some clarification

2013-01-05T05:58:00.001-08:00

I received many mails about the last post, and most of them asking explanations about the technique I developed.
I have to clarify that:

This method is just a consideration about the role played by some vertexes in a generic (directed) graph.
The results posted in the former graph can be a lot improved (I'm working on it among the several streams that I opened around the concept of entropy graph!).
the approach I'm going to explain in this post is not the same I used in the former because I prefer focus on the main idea respect the optimizations (but exential) aspects.

Let's consider the following graph:

Let's consider two nodes: "cbr" and "zx", then let's take all the paths that connect the two nodes:

All the possible paths between the node "cbr" and "zx"

We have now to consider the intersection (excluding the starting and ending points) among the all paths:

such nodes are essential in order to reach the node "zx" from "cbr"because excluding just one of them, you cannot reach the ending point!

The prominent nodes for the pair ("cbr","zx"):

In the above procedure we found out the prominent nodes related to the pair "cbr", "zx".

Of course in order to determine the prominent nodes of the entire graph you have to consider all the pairs of nodes: that's the reason that justify the complexity >Binomial [n,2] of the procedure (n = order of the graph). The complexity doesn't consider the complexity to obtain the paths between the nodes!

The final result can be obtained taking the intersection of all "prominent nodes" for each pairs of nodes!

As you can imagine this is an expensive way to obtain the prominent nodes: I'll provide you an update about improvement on it.

Stay tuned

cristian

Graph clustering: an approach based on "prominent vertexes"

2012-12-18T15:23:00.001-08:00

I was doing some experiments on the entropy graph, and I noticed that under some conditions it is a good marker of special vertexes.
I'll try to explain the concept without formulas: that's really challenging because of my English :)
These vertexes play a special role that I decided (arbitrarily) to call "Prominent Vertexes": basically if you remove them, you introduce some disconnections on the graph; the remaining connected components are the clusters that induce the graph partitioning. To be more precise the connected components return just the N-1 clusters, the last cluster is obtained by the complement operation between the original graph and the N-1 clusters.

The intuitive explanation

Consider all the possible paths between two nodes A and B in a graph
Let's take for the above paths the intersection of them.
The result is a set of nodes that are essential to connect the two nodes: these nodes are the prominent nodes for the vertexes A and B

The above schema can be applied (with some expedient) to all the Binomial[N,2] pairs of vertexes.

Some result

1st example

Let's consider the following graph and the respective "prominent vertexes":

In yellow have been depicted the prominent nodes for the graph

The clusters obtained considering the connected components are the following:

The 3 clusters obtained on the graph

As usual I compared my approach against another technique: A. Clauset, "Finding Local Community Structure in Networks," Physical Review E, 72, 026132, 2005.

The 6 clusters obtained through the "Clauset" method.

Comment:

As you can see the proposed approach seems to be much less fragmented respect the Clauset method.

2nd example

Here is another example:

The 2 clusters obtained with the prominent vertexes method,

The 6 clusters obtained through the Clauset method.

Comment:

As in the former example the "prominent vertexes" method seems to behave better (even if the result is not optimal).

Considerations

PRO

The method proposed seems to be a promising graph clustering algorithm.
The method doesn't require parameters (e.g. number of clusters).
The method doesn't fragment to much the nodes.

CONS

The complexity is (for the current implementation) quite high ~Binomial[Graph Order, 2].
The method requires improvements and theoretical explanation.
The method has lower accuracy when the vertexes degree is very small.

The code will be released as soon as the formalization of the Entropy Graph is completed.

Looking forward to receive your comments.

Stay tuned & Merry Christmas!

Document Clustering and Graph Clustering: graph entropy as linkage function

2012-11-14T15:10:00.002-08:00

Let's continue our discussion about the applications of the graph entropy concept.
Today I'm going to show how we can re-use the same concept on the document clustering.
What I want to highlight is that through such methodology it's possible to:

extract from a document the relevant words (as discussed here);
clustering of the words of a document (as discussed here);
clustering set of documents;
clustering a graph and assign a ranking score to each cluster by homogeneity criteria;

The Experiment

For this experiment I chose a subset of the standard dataset called "the 20 Newsgroup dataset".

The documents of this data set are email selected from different newsgroups.

I selected some documents from the categories: Hockey, Motorcycle, Atheist and Electronics.

You can download here the documents used for this experiment.

The procedure

For each document build the respective graph (filtering by stopwords list, and stemming the words).
For each graph calculate the entropy graph value of each vertex (word).
For each graph extract the first k relevant vertex (sorting by entropy graph value)
Perform hierarchical clustering (in this case I used once again an approach based on Simulated annealing).

The results

Here you are the early results obtained through the clustering.

Each line depicts a cluster (the elements are the filename). The clusters have been sorted by homogeneity criteria assigned by the clustering algorithm.

Considerations

The results are promising but not perfect. The accuracy achieved is

75.0% for first cluster
62.5% for second cluster
50.0 % for the fifth cluster
37.5 % for the third and fourth cluster

The reasons of the bad aggregations are related to the class "atheist". Such class contains documents characterized by high variety: documents very short, text not strictly pertinent to the topic, topic quite general.

Moreover the clustering algorithm hasn't been optimized/customize for such problem.

Let's try to analyze the results under a different perspective:

Something more...

I mentioned before that such approach works over the concept of entropy graph.

Here you are the graph representation of the words related to each cluster.

For each cluster I selected the first relevant 150 words.

In Red has been highlighted the first cluster

In dark Blue Has been Highlighted the second cluster

In Yellow has been depicted the third cluster

In Green has been depicted the fourth cluster.

In Brown has been depicted the fifth cluster.

Notes

Such representation of the clustering highlights that the linkage function used in this experiment depicts quite well the different areas of the graph.

The Red cluster grouped together the right block of vertexes.
The Brown cluster grouped the right side of the graph
The Blue cluster focused on the core of the graph
The yellow cluster grouped together the corona of the core of the graph
The green cluster grouped together the corona of the peripheral of the graph.

The next steps

In the next post I'll try to refine the results, optimizing the clustering algorithm, and trying to use different kind of clustering algos.

Stay Tuned

cristian

Key words through graph entropy Hierarchical clustering

2012-10-23T14:49:00.000-07:00

In the last post I showed how to extract key words from a text through a principle called graph entropy.
Today I'm going to show another application of the graph entropy in order to extract clusters of key words.
Why
The key words of a document depict the main topic of the content, but if the document is big, often, there are many different sub topics related to the main.
In this perspective, a clusters of keywords should make easier for the reader the identification of the key points of a document.
Moreover, imagine to implement a search engine based on clusters of relevant words instead of the common indexing of atomic words: it enables documents comparison, taxonomies definition, and much more!
How
The definition of graph entropy I'm studying on, assigns to each word of the document a relevance score and a sub graph of words topologically closed to it.
The clustering should maximize the relevance score obtained merging two words in the same cluster.
It's easy to understand that we have to face a combinatoric maximization problem.
The idea is to take advantage of the Simulated annealing (a bit revisited and adapted to the scope) in order to identify sub-optimal merging solution at each step of the merging phase of the hierarchical clustering.
Experiment
I decided to adopt as document test the complete version of the file we used in the last post: Nuclear_weapon.
Here you are the clusters of first 100 relevant words extracted:

The three clusters obtained.

It's interesting highlight the following considerations:

The first cluster merged together words as "material,uranium, plutonium, isotope" and "war, attack, arm", and also "proliferation, movement, control, development".
The second cluster (which has the lowest rank) aggregates words as "japan, japanese, place, israel, iraq,american", and "ton, tnt, yeld"
The third cluster (which has the highest rank) describes quite well the primary topic, merging all the most important words of the document!

Of course, the procedure is still in "incubator" phase, and the accuracy of the clusters rests on the performance of the Annealing clustering (...maybe different algorithms in this context perform better... but just to show a rough solution I guess it's enough :D)
This is the optimization process for the last merging stage (I presume that temperature schedule requires an adjustment):

Optimization curve through Simulated Annealing Hierarchical Clustering (last merging stage)

Next steps:
Looking forward to receive comments, and suggestions.
...It would be interesting using such methodology to create a new kind of full text search engine, totally independent by frequency of the words and frequency of visits.

The doc

here you are the document parsed and colored through the clustering assignment (have been highlighted just the first 100 relevant features ranked through the Graph Entropy method).

http://www.ziddu.com/download/20701582/nuclearWeapons.pdf.html

Stay tuned

cristian.

Graph Entropy to extract relevant words

2012-09-24T13:26:00.002-07:00

I would share with you some early results about a research I'm doing in the field of "graph entropy" applied to text mining problem.

There are many definitions of graph entropy, my favorite is very well described in the work of J.Körner: Coding of an information source having ambiguous alphabet and the entropy of graphs (1973).

Why Graph Entropy is so important?
Based on the main concept of entropy the following assumptions are true:

The entropy of a graph should be a functional of the stability of the structure (so that it depicts in some way the distribution of the edges of the graph).
Sub sets of vertexes quite isolated from the rest of the graph are characterized by a high stability (low entropy).
It's quite easy use the entropy as a measure for graph clustering.

As you can imagine a smart definition of graph entropy can be helpful in many problems related to text mining.

Let's see an application of graph entropy to extract relevant words in a document.

The experiment as been done using the first section of the definition of "nuclear weapons".

Results

Graph Entropy:

In red have been highlighted the relevant words extracted through Graph Entropy

Here you are the words extracted (first 25th) - In red I depicted words that in my opinion shouldn't be selected:

weapon, reaction, nuclear release, consider, acknowledge, explosive, weaponsa, detonate, test, ton, bomb, energy, tnt, first, possess, small, device, unite, hiroshima, chronologically, thermonuclear, force, nagasaki

Frequency based:

In red have been highlighted the relevant words extracted through Frequency relevance.

Words extracted through the frequency relevance:

nuclear, weapon, bomb, fission, test, possess, bombing, detonate, state, unite, tnt, ton, first, energy, release, acknowledge, weaponsa, status, japan, nagasaki, hiroshima, japanese, name, code, type

Closeness Centrality:

In red have been highlighted the relevant words extracted through Closeness Centrality method.

Words extracted through the frequency relevance:

nuclear, weapon, detonate, possess, nagasaki, first, thermonuclear, small, force, estimate, debut, fabricate, succeed, radiation, tnt, acknowledge, consider, believe, hiroshima, know, nation, boy, explode, matte, date.

Considerations:

The method based on graph entropy seems provide the more accurate results (5 errors respect 9 and 11 of the other methods).

The graph entropy depicts better the core of the graph containing the relevant words.

I tried to expand the number of relevant features and the accuracy of the other two methods tends to worsen quickly:

First 40 relevant words using Graph Entropy, Frequency method and Closeness Centrality.

Notice how the graph Entropy preserves better the core of the graph respect the other two methods.

Stay tuned
cristian

Function minimization: Simulated Annealing led by variance criteria vs Nelder Mead

2012-08-12T01:58:00.000-07:00

Most of the datamining problems can be reduced as a minimization/maximization problem.
Whenever you are looking for the best trade off between costs and benefits you are solving a minimization problem.

Often the number of the variables that affects the cost function is high and the domain of these variables is in a dense set and, last but not least: in the real world problem often, you have no clue about the analytical form of the function.
Formally such problems belong to the multidimensional unconstrained optimization family.

Examples

Let's consider easy scenarios where the function cost is conditionated just by two parameters.

Schaffer f6 function: in the contour plot dark purple to depict local minimums.

Another function where minimum ischaracterized by two valleys

This function presents several local minimums well depicted in the contour plot

This function is characterized by a several local minimums having high range of values

As you can see the above problems have in common a massive presence of local minimum :).

Let's see how to handle these problems through an approach that I define hybrid, that is, obtained mixing different methods and properties.

Simulated Annealing (SA) "variance based" approach

Disclaimer: I had no the time to check whether this method has been already published somewhere. The underlying idea is quite simple, so I would assume that someone has already spent time in proofing convergence and better variations (and actually I don't think is a rocket science proof it).

SA, belongs to the family of numerical method based on "search strategy" and its convergence requirements are related to the stationary conditions inducted by the underlying markovian process.

In a former post I showed an its application in a discrete domain, where at each iteration we chose the next candidate solution by comparison with the current solution. The new solution was found "fishing randomly" a candidate in a discrete space.

In the above problems the solution space is continuos, so we need a smart strategy to extract the new candidate. Let's see my approach!

Variance trick

Let's consider a sampling in two different region of a function f :

sampling_1 in a region having a smooth minimum and evaluate such points
sampling_2 in a region having a spiky minimum and evaluate such points

Take now the variance of these evaluated sets.
How does the variance of f(sampling_1) change respect the variance of f(sampling_2)?
Here you are the answer:

As you can see the variance can be used as indicator of a minimum region of the cost function.

Instead of explore randomly the solution space, the approach I propose is led by the variance used as above.

What happens if the smooth valley is a global minimum?

There are two intuitive factors to avoid that the algorithm jams in a local minimum:

1. The acceptance law admit also pejorative solutions.

2. The sampling procedure moderate by variance (if the variance is chosen high enough) allows explorations of better regions.

The algorithm

The meta code of the Simulated Annealing based on variance criteria

Many optimizations can be done to the algorithm.

For instance we could condition the annealing procedure only if the

argMin(evaluate(f,newSet)) > argMin(evaluate(f,curset))

Tests
Let's see how the approach works.
As usual, all my experiments are compared with other techniques. In this case I chose as comparative method the well known "Nelder Mead" method.
I tried to optimize the Nelder Mead method playing around its param setting: shrink ratio, contraction ratio and reflection ratio.
All experiments have been done using the same initial points.
For the Nelder Mead method I plotted just the unique solutions found.
The approach I propose has been led using always the same setting (3000 cycles, variance sampling = 0.75).
Experiment 1.

On the left side it has been represented the solution found by Nelder Mead method.
On the right side the solution found by the SA presented in this post: the blue point depicts the starting point and the red point depicts the solution found.
The last chart shows the space solution explored.

Notice that method found a better solution respect Nelder Mead approach.

Experiment 2

SA found a better solution in really few steps.

Experiment 3

This experiment shows once again the good performance of this approach even if the problem is very challenging!

Experiment 4

The convergence, as shown above, has been reached very quickly.

Conclusion

The approach seems promising and works well in different contexts: I'm quite sure that many improvements can be implemented.

I'm looking forward to receive your comments, ideas and other comparative tests.

Stay tuned!

cristian

Data Mining: Tools and Certificates

2012-07-10T11:09:00.000-07:00

As member of many Linkedin groups related to data mining & text mining I read many threads related to certificates that should help either in job seeking and consolidating curriculum, and many other threads about miraculous tools able to solve whatever problem.
Is being certified really worth it?
In my experience I think that a certificate in a specific data mining tool could be a positive point on the curriculum, but it doesn't really help to improve your knowledge on the field.
Let me explain better (which is not easy with my bad English): The certificates system is a market and its target is to generate profit or to promote products.

Data mining tools
My question is: do you really think that can exist a tool able to embrace all aspects related to data mining?

I guess that the number of problems data mining related are so high that maybe we could use the Cantor diagunalization to proof that are uncountable :)

In my opinion is too naive the common thought that through a software, clicking here and there you can obtain tangible benefits in mining your data.
The "data mining" definition has been created by marketing industries just to summarize in a buzz word technics of applied statistics and applied mathematics to the data stored in your hard disk.
I don't want say that tools are useless, but it should be clear that tools are only a mean to solve a problem, not the solution.

In the real world the problems are never standard and really seldom you can take an algorithm as is to solve them! ...maybe I'm unlucky but I never solved a real problem through a standard method.
The tool X is able to load Terabyte of data. And so what? A good data miner should know that you cannot consider the entire population to analyze a phenomena, you should be able to sample your population in order to ensure the required margin of accuracy! ... this technic is simply called Statistic!
If you really want to claim "I know very well this approach", you must be able to implement it by your self: only implementing it by your self you can deeply understand in which context the algorithm works, under which conditions it performs better than other tools and so on. Don't rely only on one paper that compare few techniques: if you change just one of the conditions the results are terrible different.
Without theory you cannot go deep: Let's consider a tool as Mathematica or R or ... These tools allow the user to have access to a large library of pre defined algorithms and routine, they provide visualization functions to show results in a fancy way, and last but not least they provide a complete language programming to code whatever you want. I love them, but I couldn't do anything without the theory behind the problem. Mathematica can provide me the algorithm to cluster a data set through k-means: but how can I be sure that it is the right algo for your problem? (click here to have a demo).

Actually I would prefer attend a course to deepen some aspects of multivariate statistic or seminars on new methodology to solve some problem respect pay plenty money to know every single detail of a tool, that maybe will not be in the market in the next 5 years.

I know that the companies often are looking for certified guys on a famous tool just because they bought it and they need to reduce the time to "integrate" a new resource on a team. Fair enough! ...but I think it is ridiculous require certificates as strict requirement!

I'm really curious to know your experiences and opinions.
cristian

Simulated Annealing: How to boost performance through Matrix Cost rescaling

2012-07-03T11:46:00.001-07:00

One of the most widely algorithm used in Machine Learning is the Simulated Annealing (SA).
The reason of its celebrity lays in:

Simplicity of implementation
broad spectrum of applicability

There are many implementations of it available for almost all languages, and many books and papers on the topic; personally the best book I would like to suggest is the Simulated Annealing and Boltzmann Machines: A Stochastic Approach to Combinatorial Optimization and Neural Computing wrote by Korst & Aarts.

The problem
In this post I'm going to describe a very canonic combinatoric problem (NP hard complexity) often approached through annealing methods: the complete Asymmetric Travelling Salesman Problem (ATSP).
The ATSP problem, as you know, can be taken as base to model several problems as for instance network routing problems, chemistry and physics problems. The ATSP is so charming (at least for me) because the nature of the problem is very easy and the way to find out the best solution is extremely trivial: you have to explore all the possible solutions!
The only problem is the number of solutions you have to compare:

with 10 towns you have to explore ~ 3.6*10^6 solution.
with 50 towns you have to explore ~ 3 *10^64 solution.

In the past (I'm not aware if it is still open) there were many competitions to find the best solution of the asymmetric and complete TSP with more than 30000 towns.

...Considering the time required to obtain the optimal solution, in many cases a sub optimal solution obtained exploring less than 1% (as usually happens with SA) of the solution space could be enough!

The experiments
I considered two instances of the problem, the first one with 10 towns and the second one with 50 towns.
Even if for this kind of problem there are more efficient algorithms to face TSP, I chose the simulated annealing just to show how versatile it is and to show how much the rescaling operation is essential in order to maximize the performance of your algorithm.

Why rescaling? (I'll try to be technical as little as possible)
The SA bases its powerful on the ability to overcome the local minimum through the acceptance at time "t" of a bad solution respect a better solution found at the time "t-1". It assigns an acceptance probability based on the sigmoidal function that decrease if the cost of the new solution is higher respect the current solution.
So if the cost distance between the current solution and the new solution is very high, the acceptance probability will be very small.
This probability is mitigated by the temperature schedule that allow more acceptance rate (of bad solutions) when the temperature of the system is high, and it is reduced once the temperature of the system decreases during the exploration of the solution space.
The aim of the rescaling is to increase the distance between towns having high cost and to reduce the cost of short distance between towns.
To obtain it I apply a "quasi" n log n function to the cost matrix.
(A publication on the topic should be ready by the end of the year).

Results
All experiments as usual as been done through "home made" application written in Mathematica.

The cost matrix has been built using a uniform distribution with integer values (1 to 10 for the first experiment and 1 to 100 for the second one).
Just to have a reference point I assigned to the cost matrix the minimum value (1) in correspondence of the external perimeter of the "town system"
The ATSP with 10 towns has optimal solution cost = 9. The ATSP with 50 towns has optimal solution cost = 49.

After 2000 cycles, SA found a solution having cost = 13. (the graph on the II quadrant represents the acceptance prob.).
The initial temperature beta has been set = 5.

After 1000 cycles, SA with MatrixCost rescaled found the optimal solution having cost 9.
The initial temperature beta has been set = 5.

I repeated the same experiment increasing the initial temperature to 15.

SA launched with beta= 15. After 5000 cycles the sub optimal solution found has cost = 14

SA with matrix cost rescaled after 2000 cycles found better solution (same initialization of the above experiment)

A larger test has been done to understand the statistical behavior:

The blue line represents the solution costs of the traditional SA.
The red line represents the solution costs of the SA with matrix cost rescaled

And to conclude, here you are the results obtained with higher complexity (50 towns).

ATSP (50 towns) comparison between traditional SA and SA with rescaled matrix cost.

Let's try to increase the number of cycles from 1000 to 3000.

The % of success of the new strategy seems increase with the increase of the number of cycles.

Let's see what happens increasing once again the number of cycles:

The assumption seems to be confirmed: the new strategy performs better than the traditional SA even if the number of cycles is higher.

Conclusion

Often the accuracy/performance of an algorithms depends on the way you feed it with the data! Rescaling your data is quite always essential!
The strategy requires further tests and validation.

...We will discuss about the function used to rescale the matrix costs and some other application of SA on different problems in the next posts.

cheers & stay tuned

P.S.

Special thx to Tom V. : he is on the top of the list of the contributors for the sponsorship program.