tag:blogger.com,1999:blog-76311162701951752282024-03-17T20:01:54.052-07:00Text & Data Mining by practical meansText mining, data mining, predictive analytics: A space to exchange ideas that work in enterprise contexts.Cristian Mesianohttp://www.blogger.com/profile/04880057603671195464noreply@blogger.comBlogger51125tag:blogger.com,1999:blog-7631116270195175228.post-28492042618393785692014-12-09T13:57:00.002-08:002015-01-15T13:35:20.199-08:00Partitional clustering: number of clusters and performances a quantitative analysis <b>Abstract</b><br />
Partitional clustering methods as k-medoid or k-means require an input parameter to specify the number of clusters to partition the data.<br />
The complexity in time of such algorithms strictly depends on<br />
<ul>
<li>the number of clusters used to initialise the computation.</li>
<li>the steps to update the centroids.</li>
</ul>
Whenever the similarity distance doesn't allow the determination of the centroid thru the analytical methods, the complexity in time tends to explode.<br />
In the post I show an heuristic to minimise the complexity in time in non-Euclidean space.<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgakC-vENcnPVzWR7RVO7iokw4oqirjSkrykm-iUo9m6z939x4ON5pOrjxj8Y9PErdmKDQwVwi6HkV6zaxzmM7fEvy4wxIQrOawiI_zCxRzQRIFixcvEqNfL419W0sGNvFTEuCB9pefICc/s1600/clusterAnalysis.bmp" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgakC-vENcnPVzWR7RVO7iokw4oqirjSkrykm-iUo9m6z939x4ON5pOrjxj8Y9PErdmKDQwVwi6HkV6zaxzmM7fEvy4wxIQrOawiI_zCxRzQRIFixcvEqNfL419W0sGNvFTEuCB9pefICc/s1600/clusterAnalysis.bmp" height="444" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Number of computational steps for standard k-mean executed. Chart depicts the steps by number of points $N$ in the range [1k,10k] and by number of clusters in the range [5,$N/10$]. <br />
The cardinality of the clusters has been simulated assuming uniform distribution. <br />
The red line indicates the minimum number of steps.</td></tr>
</tbody></table>
<b>Introduction</b><br />
K-means or k-medoids are widely used because its implementation is very easy and the complexity in time is low (it's linear over the iterations*#clusters#*#points).<br />
Is the complexity linear for whatever problem?<br />
The answer is no!<br />
<br />
<b>The problem</b><br />
The problem arises when the objects to be clustered don't lie in a space for which the computation of the centroids can be done thru analytical methods.<br />
In this context, the computational complexity to update the centroids becomes preponderant respect the other steps and the overall complexity is not any longer linear in time.<br />
<br />
<b>The complexity in case of non-Euclidean space</b><br />
<span style="font-family: CMR10; font-size: 10pt;">I</span><span style="font-family: inherit;">n a Euclidean space, the points of a cluster
can be averaged. </span><br />
<span style="font-family: inherit;">In non-Euclidean
spaces, there is no guarantee that points have an “average” (...some puritan of the measure theory might stuck up their nose on that) so we have to choose one of the members of the cluster as centroid.</span><br />
<span style="font-family: inherit;">As explained in the former post, for each cluster the computation of the centroid requires (in case of </span>symmetric<span style="font-family: inherit;"> distance) </span>$\binom{K_i}{2}$<span style="font-family: inherit;"> computational steps.</span><br />
<span style="font-family: inherit;">The overall complexity (for each iteration) is the following:</span><br />
<span style="font-family: inherit;"> </span> $ \sum_{i=1}^k \binom{z_i}{2}+k \cdot n$ <span style="font-family: inherit;"></span><br />
<span style="font-family: inherit;"><br /></span>A dataset having several points might require a number of computations simply not feasible; in that case it's essential the minimisation and optimisation of each param of the algorithm to reduce the complexity.<br />
A numeric example will clarify the point.<br />
Let's assume a data set of 10.000 elements, and let's assume that the compression rate of the clustering is 90% (meaning the number of clusters are 1.000).<br />
<br />
<ul>
<li>If the clusters cardinality is homogeneous, than the number of steps for the first iteration will be = 10.045.000.</li>
<li>If the clusters cardinality is highly unbalanced, for instance in case that a cluster contains 95% of the points, than the number of steps is = 50.360.655.</li>
</ul>
As you can see... even if the data set is relatively small the computational effort is very high!<br />
<br />
<b>Determination of the #clusters that minimises the complexity in time</b><br />
<br />
There are no many choices ... the only option is to determine the number of clusters that minimises the computational steps.<br />
<br />
You might argue that the # clusters should be selected according to the distribution of the data set, but it's also true the following:<br />
<ul>
<li>In the real world you don't know a priori the exact number of clusters and all the heuristics to estimate them are quite often not really helpful in non-Euclidean space with several points. I prefer so, to have a sub-optimal clustering solution but with much faster answer :)</li>
<li>clustering analysis is usually one of the many analysis that usually you perform to understand your data, therefore even if the clustering is not perfect, it doesn't hurt too much. Clustering it's just one piece of the entire puzzle!</li>
</ul>
<b>The approach</b><br />
If the clustering divides the data set in clusters having more or less the same size, the solution is quite straightforward, we can consider the minimum of the function described above, thru the analysis of first derivate.<br />
Assuming that each $z_i = \frac{n}{k}$ then the formula can be rewritten as:<br />
$ f(n,k)=\frac{n^2}{2 k}+k n-\frac{n}{2}$<br />
The value of $k$ that minimises the first derivative of the such function is:<br />
$ \frac{\partial f(n,k) }{\partial k}=\frac{\partial}{\partial k} \frac{n^2}{2 k}+k n-\frac{n}{2} = n-\frac{n^2}{2 k^2}$.<br />
And it takes value = 0 with $k= \frac{\sqrt{n}}{\sqrt{2}}$.<br />
<br />
In all the other cases for which the clusters size is not homogeneous, then we can easily simulate it!<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjBPXItk6Dqvr8wONZhPirlT5A4MuDpe0L08tw2uRknY_HfBpZDTZ4dE-4Vi4dAKn340FYcMNsRKYiMJQ6Ubljw27HwEsfkgzwONz2dfaxpev3B-OF-BnDXPZT86P3XcVL658H4-uBcr38/s1600/clusterAnalysis1.bmp" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjBPXItk6Dqvr8wONZhPirlT5A4MuDpe0L08tw2uRknY_HfBpZDTZ4dE-4Vi4dAKn340FYcMNsRKYiMJQ6Ubljw27HwEsfkgzwONz2dfaxpev3B-OF-BnDXPZT86P3XcVL658H4-uBcr38/s1600/clusterAnalysis1.bmp" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Computations for a data set of 10k points.<br />
The computations are estimated by number of clusters (cluster size ~ [2,1000]) and with unbalanced clusters sizes.<br />
The red line indicates the minimum. </td></tr>
</tbody></table>
As you can see the difference between the computational steps to be executed in case of "best" configuration of #clusters and all the other possible configurations is quite impressive: it's almost 1*10^7 in worst case. And this is for each iteration!<br />
<div>
I let you calculate how much time you can save working with the best configuration :).</div>
<div>
Stay tuned!</div>
<div>
c.</div>
Cristian Mesianohttp://www.blogger.com/profile/04880057603671195464noreply@blogger.com35tag:blogger.com,1999:blog-7631116270195175228.post-46250395575074036832014-09-21T13:42:00.002-07:002014-09-21T13:45:00.146-07:00Text Clustering, a non parametric version of K-medoids algorithm for data de-duplication<b>Abstract</b><br />
It's presented a quick and effective method to leverage text clustering for data de-duplication and normalisation.<br />
The customised version proposed in this post bypasses the very well known problem of the assignment of the number of clusters.<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgnDu7XuSerJoR7hkod5XDp9aL0qVtTlFt0flMJUN1t4h5bVhmh08oPSh0mAmWuaynhKPdaJxOpOhQrPRa0JmOYw3iQWD-MXRnvLABOdvxap6ucm5lAESWhfloQu3jmndAJTy6cAxL65Zg/s1600/resBest_5.bmp" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgnDu7XuSerJoR7hkod5XDp9aL0qVtTlFt0flMJUN1t4h5bVhmh08oPSh0mAmWuaynhKPdaJxOpOhQrPRa0JmOYw3iQWD-MXRnvLABOdvxap6ucm5lAESWhfloQu3jmndAJTy6cAxL65Zg/s1600/resBest_5.bmp" height="308" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Text Clustering results: each row depicts a cluster found. </td></tr>
</tbody></table>
<br />
<b><br /></b>
<b>Introduction</b><br />
One of the most common problems in the data management is about the consistency: customer addresses, company names and whatever attribute in string format can be represented in multiple formats, often mutually equivalent.<br />
Data normalisation presents some challenges, especially in the following scenarios:<br />
<ul>
<li>Variations/mutation of the trusted source of information are not a priori known.</li>
<li>The data to be normalised are characterised by high volume.</li>
<li>There are no deterministic and static rules to normalise the data.</li>
<li>...you don't have plenty time to find the perfect solution :), and you have to deliver fast!</li>
</ul>
<b>Why the brute force approach it's wasted </b><b>time</b><b>!</b><br />
Let's assume that you already found for your specific use case the best distance to judge how far is a record from another one.<br />
To make the story a bit more realistic, let's assume hereafter that our attribute to be normalised is a string.<br />
...I would hope that nobody of you really thought to create a big matrix to keep track of the all possible distances among the possible values of the attribute... Why? If you have for instance 10k possible values, you need to computate at least 49.995.000 comparison (assuming that your distance is symmetric!), that's because the complexity follows Binomial[N,2].<br />
You might perform the brute force approach over a statistically representetive sample set, but in that case it's really hard the validation of the real accuracy of your results.<br />
<b><br /></b>
<b>Why Partitional Clustering is efficient?</b><br />
As you know, this is a "practical mean" blog, so I encourage you to refer to real books if you are looking for the theoretical proof. Anyhow it's easy to get convinced about the following:<br />
<ul>
<li>number of comparisons to be performed assuming $K$ clusters, $I$ iterations, $N$ elements to be clustered: $N \cdot K \cdot I$.</li>
<li>number of comparison to update the centroids: $I \cdot (\binom{K_1}{2}+\binom{K_2}{2}+\cdots+\binom{K_K}{2}) << \binom{N}{2}$</li>
</ul>
<br />
The comparison of the computational complexity of the brute force approach with the partitional clustering is well depicted by the below example, where I assumed ten iterations to get the convergence, 10 clusters, a data set that grows from 1000 to 20.000 points:<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi4x_pE90Dg5TniWPNLpKNWL6y2K3UW298A4BZe7aWcKHbl04HkVBAa6cMYPxVxAsexwkaV4t9wSbUG37Zhjk8BytCCb4-w_L5bT6zbAE5ppXwo-rBGeNeOcs9hzsRxnBcIAObkmmP5fyo/s1600/Untitled-1.bmp" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi4x_pE90Dg5TniWPNLpKNWL6y2K3UW298A4BZe7aWcKHbl04HkVBAa6cMYPxVxAsexwkaV4t9wSbUG37Zhjk8BytCCb4-w_L5bT6zbAE5ppXwo-rBGeNeOcs9hzsRxnBcIAObkmmP5fyo/s1600/Untitled-1.bmp" height="225" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">The computational effort: in red the brute force approach, in blue the clustering.</td></tr>
</tbody></table>
<b>The approach in a nutshell </b><br />
<div>
1. Group together records that look similar each other using k-medoid clustering </div>
<div>
2. Store in a index the centroids found.</div>
<div>
3. delete (only if the use case allows it) all the member of the clusters with exception of the centroids.</div>
<div>
4. for each new record to be stored in the DB, perform the comparison with the elements of the index: if the distance is to high, add the new record to the index.</div>
<div>
<b><br /></b></div>
<div>
<b>K-medoids vs K-means</b></div>
<div>
The main difference of K-medoid respect the well known K-means is in the assignment of the centroids:</div>
<div>
<ul>
<li>in K-means the centroid is the barycenter of the cluster, so it might not be a real record.</li>
<li>in K-medoids, the centroid is a record of the cluster that minimises the sum of the distances respect all the other points of the cluster. Of course, depending on the case, you might prefer minimise the variance, or even define more sophisticated minimisation criteria.</li>
<li>K-means is usually much faster, since the computation of the barycenter it's usually taken as average of the position of the records in the space,</li>
<li>K-medoids requires (for each cluster) the additional comparison of the distances of the points of the cluster from all the other points of the clusters (that explains the formula mentioned in the complexity formula.</li>
</ul>
<span style="font-family: inherit;">K-medoids is more robust than k-means in the presence of noise and
outliers.</span></div>
<div>
<span style="font-family: inherit;">...But there is still something that makes me unhappy with K-medoids: It's parametric, meaning it requires the priori knowledge of the number of clusters.</span></div>
<div>
<span style="font-family: inherit;">In the real world simply forget it! </span></div>
<div>
<span style="font-family: TimesNewRomanPSMT;"><br /></span></div>
<div>
<span style="font-family: TimesNewRomanPSMT;"><b>Customised version for auto-determination of number of clusters.</b></span></div>
<div>
<ol>
<li><span style="font-family: TimesNewRomanPSMT;"><i>Instantiate the standard k-medoids with arbitrary number of classes (in the case analysed later I used just two clusters as initialisation).</i></span></li>
<li><i><span style="font-family: TimesNewRomanPSMT;">after the operation to update the medoids, compare the </span><span style="font-family: TimesNewRomanPSMT;">medoids with the others: if two medois are very similar, then merge the two clusters.</span></i></li>
<li><span style="font-family: TimesNewRomanPSMT;"><i>calculate the mean distances (this computation can take advantages of the operations performed during the medoids updating) for each cluster: if the mean distance overcome a threshold (that indicates the sparsity of the points of the cluster) then split the cluster. In this experiment I used a wild approach: each point of the split cluster have been consider as new cluster.</i></span></li>
<li><span style="font-family: TimesNewRomanPSMT;"><i>goto step 2 until convergence. </i></span></li>
</ol>
</div>
<div>
You might argue that such customisation doesn't require the "number of cluster" as param but introduce two thresholds: one for the clustering merging and another one for the splitting.<br />
Actually the threshold is just one:<br />
<ul>
<li>if $\theta$ is the minimum acceptable distance for which two strings might be considered as the same, then $1- \theta$ is the threshold for which two strings are absolutely different!</li>
</ul>
The advantage of the approach lyes on the fact the estimation of the threshold can be easily estimated and verified. The estimation of #clusters it's not that easy!<br />
<b><br /></b>
<b>Experiment: Addresses de-duplication.</b><br />
As data set I downloaded from this <a href="http://gmaps-samples.googlecode.com/svn/trunk/articles-phpsqlsearch/phpsqlsearch_data.csv" target="_blank">link</a> the addressees of the pizzeria in New York (I'm Italian, and I love pizza :D).<br />
To assess the accuracy of the algorithm I randomised each entry introducing a gaussian noise.<br />
<b><br /></b>
<b>Data Perturbation:</b><br />
The perturbation method I used, is based on randomised swapping of characters of the string:<br />
<ol>
<li>Select the number $n$ of characters you want to swap. Such number is drawn within the range [0,StringLength[str]] using a gaussian distribution.</li>
<li>Draw $n$ couples of integers (between[1,StringLength[str]]).</li>
<li>For each couple swap the corresponding first character with the second.</li>
</ol>
<b>Distance</b><br />
The definition of distance is crucial: several options are available. In this experiment I used the common <a href="http://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm" target="_blank">Nedelman-Wunsch</a> similarity. I'm pretty sure that good results can be achieved also with Jaro-Winkler and Levenshtein distances.<br />
<br />
<div>
<b>Experiment Setting</b><br />
I firstly estimated the param of the customised k-medoid algo: the Merging/Splitting threshold.<br />
The value is dependent by the nature of the data.<br />
<ul>
<li>What's the minimum acceptable distance between two strings to be considered the same?</li>
</ul>
</div>
I selected randomly several couples of strings, measured the distances and sorted by value.<br />
I noticed that 70% of matching it's the good one.<br />
In the next posts I'll show a technique more refined based on the convergence slope of the clustering... but that's another story.<br />
<br />
<b>Results:</b><br />
The data set contains 50 records:<br />
<ul>
<li>10 different addresses</li>
<li> for each address 5 deviations generated with the aforementioned perturbation.</li>
</ul>
The customised K-medoids found the best result (that is 10 clusters containing the 5 deviated addresses) ~20% of the cases (first bar of the below chart)<br />
The below chart shows how many time (out of 100 trials) the alto found exactly $k$ number of clusters.<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh5DhaVSFOJm0BAh-DNyLE5M5FFNSjS7VHwYp_V-u5p2XNi7ive9qh5kPW3NSTSNVpIVz_HlFT7ri-26YPbV6b_MVq1jMgO9jOIMTNwN1kiEofOPUt-SsCFZj3_a-67Bn0AO-H0vD-4cZg/s1600/barchart.bmp" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh5DhaVSFOJm0BAh-DNyLE5M5FFNSjS7VHwYp_V-u5p2XNi7ive9qh5kPW3NSTSNVpIVz_HlFT7ri-26YPbV6b_MVq1jMgO9jOIMTNwN1kiEofOPUt-SsCFZj3_a-67Bn0AO-H0vD-4cZg/s1600/barchart.bmp" height="275" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">The first bar says that 22 times out 100 trials, the alto returned 10 clusters.</td></tr>
</tbody></table>
<ul>
<li>As you can see, in 70% of the cases the algorithm partitions the data set in [9,12] clusters!</li>
</ul>
Below the best and worst case<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhv_sRyCvZtt9n5Zb6A632qrwsgffPMye_4q15n8zz9twCfsI258P18tqEZ9FGz09B884T6HxYi7yVrsj9lgeM00AaBwc3k8RlYOHzpG7csl0OlyoDu1zcaeD8x15vEfQVDkleLvW5hIKM/s1600/resBest_5.bmp" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhv_sRyCvZtt9n5Zb6A632qrwsgffPMye_4q15n8zz9twCfsI258P18tqEZ9FGz09B884T6HxYi7yVrsj9lgeM00AaBwc3k8RlYOHzpG7csl0OlyoDu1zcaeD8x15vEfQVDkleLvW5hIKM/s1600/resBest_5.bmp" height="310" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Best case: achieved 20% of the cases. Steps to converge ~26. </td></tr>
</tbody></table>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh4uPJRruPDf7gYySAvfUsZ5XX0I7UyPZhCmFZwlpw-90v7_2jbsqI6s-2xqjPTAs2GxIy7AgsHRjSKbYfhwYq3n8OlX4_ql-P-GqsdQBBjtV5Kf0vGqfwRTiXOFdcC2aWFrY2UxSg20As/s1600/worst.bmp" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh4uPJRruPDf7gYySAvfUsZ5XX0I7UyPZhCmFZwlpw-90v7_2jbsqI6s-2xqjPTAs2GxIy7AgsHRjSKbYfhwYq3n8OlX4_ql-P-GqsdQBBjtV5Kf0vGqfwRTiXOFdcC2aWFrY2UxSg20As/s1600/worst.bmp" height="360" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Worst case: returned 17 clusters. Convergence after 40 steps.</td></tr>
</tbody></table>
<b>Convergence of the algorithm:</b><br />
The time to converge is the following:<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhoOLiBPfWzm4pO0Oh_ofrKSUqFFsxTIeaXlkazFyo1bRBVV8rQx7BOHqwEq1ClwlFoYETxo1L_1Vz4ycL8HCf_t5UA7fv1zxQniYxhHLWY8Z0SiI_lBSLAA0IEl6paU5zpPHA3BrhyB3o/s1600/convergenceTime.bmp" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhoOLiBPfWzm4pO0Oh_ofrKSUqFFsxTIeaXlkazFyo1bRBVV8rQx7BOHqwEq1ClwlFoYETxo1L_1Vz4ycL8HCf_t5UA7fv1zxQniYxhHLWY8Z0SiI_lBSLAA0IEl6paU5zpPHA3BrhyB3o/s1600/convergenceTime.bmp" /></a></div>
<br />
<ul>
<li>On average the algorithm converges after 36 steps.</li>
</ul>
I noticed that longer is time to converge, lower is the probability to determine correctly the number of clusters.<br />
<br />
As usual: stay tuned!<br />
cristian<br />
<b><br /></b></div>
Cristian Mesianohttp://www.blogger.com/profile/04880057603671195464noreply@blogger.com13tag:blogger.com,1999:blog-7631116270195175228.post-84904168495023210372014-07-29T13:22:00.001-07:002014-07-30T02:06:53.638-07:00Fractals: parameters estimation (part IV)<b>Introduction</b><br />
In the former posts we discussed about the following points:<br />
<ol>
<li>There are special points in the contour of the fractal that can be used to derive its contour.</li>
<li>Such points can be used to describe the fractal thru an iterative and deterministic algorithm. </li>
</ol>
It's still open the main issue: can we leverage the two above findings to determine the fractals parameters?<br />
The answer is yes, provided that we use a good technique to extract the contour points of the fractal image.<br />
<br />
<b>The algorithm (practical explanation)</b><br />
We already noticed that points in the contour of the fractal are in tight relationship each other.<br />
Since each linear transformation is described by exactly six parameters, for their determination we need at least 6 points for each transformation function.<br />
The below image makes the concept clear:<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh7FOwKlB_Up7_WnhiuddE66MIEQd6u9-K5ASl4PA59zXxhUZ6in4LNeZuLTQpUOhVjxnMUATH5TqCFTlISmXTC7cTrA1Ga4b6w4cmzJZ0QenLGx8zoHj2F_78ww78j2msb0HTy8AQulq4/s1600/paramEstimation.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh7FOwKlB_Up7_WnhiuddE66MIEQd6u9-K5ASl4PA59zXxhUZ6in4LNeZuLTQpUOhVjxnMUATH5TqCFTlISmXTC7cTrA1Ga4b6w4cmzJZ0QenLGx8zoHj2F_78ww78j2msb0HTy8AQulq4/s1600/paramEstimation.png" height="394" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td class="tr-caption" style="font-size: 13px;">Image-1: Relationship between points of the fractal contour</td></tr>
</tbody></table>
</td></tr>
</tbody></table>
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
The steps to obtain the estimation of the params are the following:<br />
<ol>
<li>Extraction of the contour points thru convex hull algorithm;</li>
<li>Build relationships among the points. </li>
</ol>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiA4kX3ySMoSah1ANvBuSgj-99O5hMJrY5VmqCpzrZ4DAbETriRfSBXKeU8HdLODjg5oCFRNpRet8HDIWDDTR7nJUTMGkwPUBd6qPp0avw8VjnECGve2QkqZiyLontf3LJaM1UeIy3bvws/s1600/CHAndRelationship.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiA4kX3ySMoSah1ANvBuSgj-99O5hMJrY5VmqCpzrZ4DAbETriRfSBXKeU8HdLODjg5oCFRNpRet8HDIWDDTR7nJUTMGkwPUBd6qPp0avw8VjnECGve2QkqZiyLontf3LJaM1UeIy3bvws/s1600/CHAndRelationship.png" height="252" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Left Image: Extraction of contour points thru Convex Hull.<br />
Right Image: Relationship among the points extracted.</td></tr>
</tbody></table>
<b>How to build the relationships among the points?</b><br />
As showed both in the deterministic algorithm presented in the last post and in the above image, the points preserve an ordering that can be leveraged to simplify the complexity of the problem.<br />
Consider for instance the fern case:<br />
<ul>
<li>the transformation that leads to the fixed point lying on the contour (highlighted in image-1) can be obtained (as explained in the points 1,2 of the deterministic algorithm) just creating the relationship among the consecutive points of the set $ l_1 $ described in <i>Image-1</i> and in the below image:</li>
</ul>
<br />
<div style="text-align: center;">
<br /></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhcBR9A89o7iwRfCrrjrNhxSsPG4uAI-7X_R-vFRLvOtzH7ifHKRFdt0BWQmKo2QqrNApbDFGGetqbpTp-y3FQglixaxcyhiRe1KQ1WgrDgbyk7zv2euDbuqW6uiv2cMwvTM2WUe31bW8A/s1600/sets2.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhcBR9A89o7iwRfCrrjrNhxSsPG4uAI-7X_R-vFRLvOtzH7ifHKRFdt0BWQmKo2QqrNApbDFGGetqbpTp-y3FQglixaxcyhiRe1KQ1WgrDgbyk7zv2euDbuqW6uiv2cMwvTM2WUe31bW8A/s1600/sets2.png" height="640" width="452" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Set of points to determine the Transformations</td></tr>
</tbody></table>
At this point we obtained the parameters of the first transformation, but still, we don't know neither the number of transformations we really need to describe the fractal image, nor the probability associated to them.<br />
<div>
For the first problem, we have to strategies: the brute force, that is, try all the possible combinations, or more wisely try the soft computing approach described below.</div>
<div>
<b><br /></b></div>
<div>
<b>The fuzzy clustering approach (...unfortunately here some formulas are required)</b></div>
<div>
The fuzzy clustering is a very well known algorithm that assigns the membership of a point to a cluster with a certain "degree", so the point $p_i$ might belong to the cluster $A$ with degree $x$ and to the cluster $B$ with degree $y$.</div>
<div>
<br />
In this problem I modified the algorithm in this way:</div>
<div>
<ul>
<li>the clusters are the mapping functions that we want to discover.</li>
<li>the probability associated to the fractal maps can be easily derived by the "size" of each cluster.</li>
<li>The fuzzy-membership function is based on the minimisation of the distance between the point obtained applying to $x_i $ the contractive map $x_i^{'} = \tau_j(x_i) $ and all the other points falling in a reasonable neighbour. The animation below describes it.</li>
</ul>
<ol>
</ol>
</div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div style="text-align: center;">
<iframe allowfullscreen='allowfullscreen' webkitallowfullscreen='webkitallowfullscreen' mozallowfullscreen='mozallowfullscreen' width='680' height='440' src='https://www.blogger.com/video.g?token=AD6v5dzC8KZkXyp81Ff716ZcUbO4madNpJxkUKcZeTAc6nmV5mcrbdaQNNXLfmUKYHghrZRie7RJyBBqRmruGVGeQg' class='b-hbp-video b-uploaded' frameborder='0'></iframe></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
<ul>
<li style="text-align: left;"><span style="font-family: inherit;">The update step of the centroids is aimed to minimise the distance between $ d(\tau (x_i)),\phi(x_i,\tau_j))$ </span>computed<span style="font-family: inherit;"> over each $\tau$ and each $x_i$, where </span></li>
$\phi(x_i,\tau_j) = \arg\min_{x_s} d(\tau_j(x_i),x_s) $
<li><span style="font-family: inherit;">To minimise the above function, I used the gradient technique, working on the the following cost function:</span></li>
<li><span style="font-family: inherit; text-align: center;">$ E(\theta)= \sum_{i=1}^{n}{(\mu_{x_i}(\tau_j)\cdot[d(\tau_j(x_i),\phi(x_i,\tau_j))]^2)} $</span></li>
<li style="text-align: left;">For each mapping function, the correction to each param is the following:</li>
<li><span style="text-align: center;">$\theta_i= \theta_i - \eta \frac{\partial E(\theta)}{\partial \theta_i}$</span></li>
</ul>
<span style="font-family: inherit;"></span><br />
<span style="font-family: inherit;">
</span>
<br />
<ul></ul>
<ul>
</ul>
<div style="text-align: center;">
<br /></div>
<div style="text-align: left;">
<b>Results</b></div>
<div style="text-align: left;">
Starting from the image of the fern (a jpg image containing 100.000 points) the application of the algorithm for the determination of the contractive maps gives the following result:</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg8j9bDg_XB6cxhQIUGuvT0zIMVTwCCd2mAUAZ18uBtAQ5Ol03nFPq0Pkqn-yfSaWZj07NDm7bI1GF_XmHWdCiMzgvHpJj33hoHfHxB86VhnikzTUaKKK9F_zxVwTw4MnurSRJ7mvXymnQ/s1600/res.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg8j9bDg_XB6cxhQIUGuvT0zIMVTwCCd2mAUAZ18uBtAQ5Ol03nFPq0Pkqn-yfSaWZj07NDm7bI1GF_XmHWdCiMzgvHpJj33hoHfHxB86VhnikzTUaKKK9F_zxVwTw4MnurSRJ7mvXymnQ/s1600/res.png" height="402" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Contractive maps detection algorithm: in black the points of the original fern, in red,green, blue the three estimated maps.</td></tr>
</tbody></table>
<div style="text-align: left;">
</div>
<ul>
<li>The results are not bad, the weakness of the algorithm lays on the extraction of the points of the Convex hull of the fractal.</li>
<li>The application of smarter contour extraction algorithm should improve further the accuracy of the algorithm.</li>
</ul>
<div>
<b>Disclaimer</b></div>
<div>
In the last posts about fractals I often abused the term contour of fractal. ...Theoretically speaking, a fractal cannot have a contour, but to make clear the pragmatical aspects I decided voluntarily such terminology.</div>
<div>
<br /></div>
Cristian Mesianohttp://www.blogger.com/profile/04880057603671195464noreply@blogger.com21tag:blogger.com,1999:blog-7631116270195175228.post-45080447095411574642014-06-07T07:51:00.001-07:002014-06-07T08:02:44.204-07:00Fractals: a deterministic and recursive algorithm to reproduce it (Part II) <b>Abstract:</b><br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjxpGBX05D0vzcJvlVkiST-F089Gvu8zdFcZGcgI2xg9j72Jl8C8L9esbEpeDaoTtHL_j44IIkw0IgYYwWeRalrGhXO2hlfdvTjIDpETdTpm8FUV3tORTk9ZT59-uklQ23q2jeZtFFf5HA/s1600/recursionalgo.bmp" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjxpGBX05D0vzcJvlVkiST-F089Gvu8zdFcZGcgI2xg9j72Jl8C8L9esbEpeDaoTtHL_j44IIkw0IgYYwWeRalrGhXO2hlfdvTjIDpETdTpm8FUV3tORTk9ZT59-uklQ23q2jeZtFFf5HA/s1600/recursionalgo.bmp" height="198" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Fern fractal estimation thru recursive and deterministic algorithm.</td></tr>
</tbody></table>
<b><br /></b>
A fractal can be described as fixed point of a Markov process: given the proper contractive maps it's easy to generate it.<br />
<div>
The question is: given the contractive maps, is it possible to generate the fractal using a pure deterministic approach?</div>
<div>
<b><br /></b></div>
<div>
<b>Problem</b></div>
<div>
We already observed that the iteration of the contractive maps starting from a fixed point that lyes on the border of the fractal returns points of the convex hull of the fractal.</div>
<div>
<br />
<ul>
<li>What's happen if we recursively apply such schema to the points obtained from the the above procedure?</li>
<li>Is it possible recreate the markov process (or at least approximate it) removing any probabilistic aspects?</li>
</ul>
</div>
<div>
The below algorithm, given the contractive maps of the fractal, returns at each iteration better approximation of it.<br />
<br />
At each iteration it considers the fractal as better approximation of the contour of the original image.<br />
<br /></div>
<div>
<b>The Algorithm (by practical means)</b></div>
<div>
<ol>
<li><i>Consider a fixed point $P_0$ of a contractive map $\tau_1$ that lyes on the convex hull of the fractal.</i></li>
<li><i>Choose randomly a point $P_i$ of the fractal and apply the above contractive map until the distance to $P_0$ is negligible.</i></li>
<li><i>Map the point calculated at step 2 using sequentially all the contractive maps.</i></li>
<li><i>Map each point obtained from point 3 the former step with $\tau_1$ till the distance to $P_0$ is negligible.</i></li>
<li><i>If[ITERATIONS< K]:</i></li>
<ol>
<li><i>K--;</i></li>
<li><i>For each point obtained from point 4 go to point 3.</i></li>
</ol>
</ol>
</div>
<div>
To explain how it works I plotted the list of points got using $K=1$ iterations of the algorithm:<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjpM0mLNykZN2JvS3PgoYuIpylNDMZextJWH8aVtexU3VJwAPM079o8vHGRw34YnBdbzy8zevjM-nMPk6n4Mvges7Ju1twuyCjT55qVvq4BbiBVTbxkiC5kFG63b4KOYzLQge1kZV-VYNM/s1600/gifR1.gif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjpM0mLNykZN2JvS3PgoYuIpylNDMZextJWH8aVtexU3VJwAPM079o8vHGRw34YnBdbzy8zevjM-nMPk6n4Mvges7Ju1twuyCjT55qVvq4BbiBVTbxkiC5kFG63b4KOYzLQge1kZV-VYNM/s1600/gifR1.gif" height="411" width="640" /></a></div>
<br />
<br />
<br />
Bigger is $K$, more accurate will be the result.<br />
<b>The above procedure works only if the contractive map $\tau_1$ has a fixed point that lyes on the convex hull of the fractal!!</b><br />
<br />
<b>Results:</b><br />
<br />
I iterated the algorithm with $K=4$ times. At each iteration the algorithm returns a deeper description of the contour of the fractal (...even though definition of contour for a fractal doesn't make any sense, it gives at least a practical meaning):<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjL9d57pzIDYc32sgcvebnrFCTi2p4qz1wa_VY6tLYE9-gzos0-hrerKVlSUyUBn0rFlpm4wmOUCeW7VYwYYKsJNJ7zupdogMdrvfzZMJz9vWpT6qYkMmP_Cx0yGEg51TFguGUyV9AXMLQ/s1600/recursionalgo.bmp" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjL9d57pzIDYc32sgcvebnrFCTi2p4qz1wa_VY6tLYE9-gzos0-hrerKVlSUyUBn0rFlpm4wmOUCeW7VYwYYKsJNJ7zupdogMdrvfzZMJz9vWpT6qYkMmP_Cx0yGEg51TFguGUyV9AXMLQ/s1600/recursionalgo.bmp" height="396" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Results of the Recursive Algorithm iterated with K=1,2,3,4. </td></tr>
</tbody></table>
<br />
If we try to overlap the results obtained with the original fern we get:<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjK5ZcYT1xtBpDJPGD591eb7hTyjclGtvudX32uzMK2SgF7lYE0e0A-YpiKZkVOdu5xinJT1Hu3NCXyYnM5tzBNY1Ft7T9lOnUcfNRxsBiP9RfZgPnWdyTMmrmQq39lyhZx6PfhkjDzumQ/s1600/algoRec2.bmp" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjK5ZcYT1xtBpDJPGD591eb7hTyjclGtvudX32uzMK2SgF7lYE0e0A-YpiKZkVOdu5xinJT1Hu3NCXyYnM5tzBNY1Ft7T9lOnUcfNRxsBiP9RfZgPnWdyTMmrmQq39lyhZx6PfhkjDzumQ/s1600/algoRec2.bmp" height="640" width="608" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Original Fern, and the overlap with the results obtained using the recursive algorithm (K=4).</td></tr>
</tbody></table>
<b>Conclusion and next steps</b><br />
I showed how to depict the IFS generated thru the markovian process as a recursive and completely deterministic process.<br />
We noticed also, that in the fractal there are special points (as for instance $P_0$) that play a crucial role to describe the IFS.<br />
<br />
The question now is the following:<br />
is it possible leverage such special points and the above recursive algorithm to estimate the parameters of the contractive maps?<br />
... I will show you a procedure that partially answer the question and some other example of the algorithm above described!<br />
Stay Tuned.<br />
Cristian</div>
Cristian Mesianohttp://www.blogger.com/profile/04880057603671195464noreply@blogger.com7tag:blogger.com,1999:blog-7631116270195175228.post-85228004887441193552014-05-24T09:09:00.000-07:002014-07-30T02:11:42.913-07:00Fractal parameters estimation thru contour analysis - Part I<b>Introduction</b><br />
I really love fractals theory. Nothing fascinates me as the amazing structures generated by a chaotic processes that describe (hyperbolic) fractals.<br />
It's fairly easy to draw a fractal image:<br />
<ul>
<li><i>The input</i></li>
<ul>
<li>A small set of very easy function (linear transformations)</li>
<li>A probability to choose one of the aforementioned functions.</li>
</ul>
<li><i>The process</i></li>
<ul>
<li>Choose randomly a starting point.</li>
<li>Choose randomly one of the above functions according to the above defined probability.</li>
<li>Map the starting point using the above selected function.</li>
<li>iterate as long as you want :)</li>
</ul>
<li><i>Admire the meaning of the life!</i></li>
<ul>
<li>just plot the points</li>
</ul>
</ul>
Wherever you are, whatever decision you make, despite the chaos, the overall picture will be always the same. More you will move, more clear the picture will be!<br />
<br />
When I approached for the first time fractals, after the reading of the great Barnley's book (Fractals Everywhere) I opened the laptop and I wrote the routine to draw my first IFS!<br />
...By mistake I also plotted the lines that join the points:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<iframe allowfullscreen='allowfullscreen' webkitallowfullscreen='webkitallowfullscreen' mozallowfullscreen='mozallowfullscreen' width='680' height='400' src='https://www.blogger.com/video.g?token=AD6v5dzJyM4LXdhKsT1QHS3AHkBG2zunrDxFmVk_X6OlYrvcHpxET3yhaw_2GwVXdcVDmm4od0ZNesTMHCjKRHd-uQ' class='b-hbp-video b-uploaded' frameborder='0'></iframe></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
...It's quite a messy, isn't it?<br />
Sometimes it's better to give up to connect the points, just relax and wait until everything will be clear. If we focus on every single detail we risk to lose what really counts!<br />
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<br />
<br />
<div style="text-align: center;">
<iframe allowfullscreen='allowfullscreen' webkitallowfullscreen='webkitallowfullscreen' mozallowfullscreen='mozallowfullscreen' width='680' height='450' src='https://www.blogger.com/video.g?token=AD6v5dxliJCJ_xIvdrBnx9LygDqDRH7HB-qlSsgZmqOFntr18SBonns-ZeKgT_xDN_PLEKQWrD9hcFzM_C3Lyw-xLw' class='b-hbp-video b-uploaded' frameborder='0'></iframe></div>
<br />
<b>What is a fractal (Iterated Fractal System)?</b><br />
There are many formal definitions to describe an IFS, in my opinion the most effective describe a fractal as the fixed point of the Markov process (mentioned above).<br />
The process converges to the fractal :)<br />
<br />
<b>The powerful of the fractal</b><br />
In literature there are plenty articles in which fractals are used to solve complex problems (financial analysis, stock exchange preditictions, description of biological processes,...).<br />
Also in Computer Science fractals fond a good collocation, think about the image compression: the ivy leaf image takes around 25MB of space, all the information to obtain it thru the IFS takes less 1KB!!<br />
<b>The problem </b><br />
Given the ivy leaf image, what are the parameters of the Markov process to generate it?<br />
A bit more formally what we have to estimate is:<br />
<ul>
<li>A set of functions: $\{ \tau_1, \cdots,\tau_N \}$, where $\tau_i(p_0)=A_1 \cdot p_0 +Q_1$, where:</li>
</ul>
<div>
<div style="font-family: Monaco; font-size: 12px;">
<div style="text-align: center;">
$ A= \begin{bmatrix}\alpha & \beta\\ \gamma & \delta \end{bmatrix} Q= \begin{bmatrix} \epsilon\\ \zeta \end{bmatrix} $</div>
</div>
</div>
<div>
<ul>
<li>We need also the estimate the probability to choose each $\tau_j$</li>
</ul>
The combination of the <i>collage theorem</i> and the box counting approach is the most common technique to solve the problem.<br />
When I was student I approached the problem from a different angle. I have to say that the results obtained were partials but I still think that something more can be done :)<br />
Before to start we need one more notion: the contractive maps (have a look at Banach Theorem)<br />
Under certains conditions the iteration $\tau_j$ led to a fixed point:<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjuieorZ1EgJZ65YIkCzBBSdttwfIefS5avXLIPGau6BIQEIGCvPrm4Fgn6FaBDCY_hjTptgWPa7SrAikKhojngexs5TUsmWJSNgsIxuXHFBZdMT7D3o09Q407J6sT9n6tgovrRFseibd0/s1600/puntiFissi2.gif" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjuieorZ1EgJZ65YIkCzBBSdttwfIefS5avXLIPGau6BIQEIGCvPrm4Fgn6FaBDCY_hjTptgWPa7SrAikKhojngexs5TUsmWJSNgsIxuXHFBZdMT7D3o09Q407J6sT9n6tgovrRFseibd0/s1600/puntiFissi2.gif" height="320" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Example of contractive map applied to an ellipsoid.<br />
It converges to a fixed pointed.</td></tr>
</tbody></table>
<br />
<br />
<b>First conjecture:</b><br />
<br />
<ul>
<li>An IFS is characterised by a fixed point that lies on its convex hull.</li>
<li>From a fixed point that lies on the border of the IFS, the iterations of the contractive maps that generate the fractal return the convex hull of the fractal.</li>
</ul>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjtPLXahdUL37YUl15huYLio7uyj_wuTydWSPmbLMPwlevV3fLoe16RcMvJp0kxMLSORsgMutuSeMcozKL6U4kblyvBIHLhNiWBqiL7B3CBMWhLSVI6G_Hk_BY6rK-yHZMhXtgzKArAtSo/s1600/bordiDalPF.gif" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjtPLXahdUL37YUl15huYLio7uyj_wuTydWSPmbLMPwlevV3fLoe16RcMvJp0kxMLSORsgMutuSeMcozKL6U4kblyvBIHLhNiWBqiL7B3CBMWhLSVI6G_Hk_BY6rK-yHZMhXtgzKArAtSo/s1600/bordiDalPF.gif" height="316" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">The ivy leaf IFS is generated by 4 contractive maps. Each color describes the map used to generate the point.</td></tr>
</tbody></table>
The above animated gif shows the meaning of the conjecture.</div>
<div>
An experimental evidence about the fact that at least one fixed point lies on the convex hull of the fractal can be obtained changing the params of the maps:</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgbxoxh7F6EH08syKr-pv7aJks4oBsOrZjKyZ_v4HA_eqWRZ0_FtT-dI6PmToqTuJF0m1PQgEAdTFc5FYbcy74aw659XO2QKINMOt6yRJ7NXc38iIDklzfd4L-FnKi-ascU7dg4HwAV8ho/s1600/congetturaPF.gif" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgbxoxh7F6EH08syKr-pv7aJks4oBsOrZjKyZ_v4HA_eqWRZ0_FtT-dI6PmToqTuJF0m1PQgEAdTFc5FYbcy74aw659XO2QKINMOt6yRJ7NXc38iIDklzfd4L-FnKi-ascU7dg4HwAV8ho/s1600/congetturaPF.gif" height="247" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">The light blue points depict the fixed point of the maps used to plot the fractals.</td></tr>
</tbody></table>
<div>
Despite the changes in the maps, the fixed point on top of the leaf still lays on the convex hull of the fractal.</div>
<div>
In the next post I'll show you a nice recursive algorithm I found to obtain different <i>levels </i>of convex hull for an IFS.</div>
<div>
Stay tuned</div>
<div>
Cristian.</div>
<div>
<br />
<br />
<br />
<br /></div>
Cristian Mesianohttp://www.blogger.com/profile/04880057603671195464noreply@blogger.com4tag:blogger.com,1999:blog-7631116270195175228.post-90912006848343563532014-05-02T09:23:00.004-07:002014-05-03T04:59:39.030-07:00Waiting Time Polynomials: tech explanation - last partThis is the last step to complete the explanation of the waiting time polynomials formula.<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div>
Unfortunately it's a bit technical, but I'm sure that it can be understood without deep math knowledge.<br />
At the very end <i>if you can't explain it simply you don't understand it well enough! (A. Einstein)</i><br />
<br />
<b>The trick</b><br />
Last time we left with the tally of overall waiting time $ w(x_i) = \phi(x_i)-|x_i| $ where $\phi(x_i)$ returns just the position of the last $|x_i|$ in the vector $V$.<br />
Let's have a look at the following example that will be used during the remaining steps.<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiaHvLxGLVoe9iihGw9Q5LLJC3xZDyEyrR93i-ViuA-ovZIg3GwIzTeDqRhl78fYpbBigxlZCVe2Q6OPpwlTF4gBsBLSIs-BkjgU3WvHb0jR5aBRS62ZLZu9Ta2dwiJexQnIwXGVTkw7PU/s1600/forpaperExample_1.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiaHvLxGLVoe9iihGw9Q5LLJC3xZDyEyrR93i-ViuA-ovZIg3GwIzTeDqRhl78fYpbBigxlZCVe2Q6OPpwlTF4gBsBLSIs-BkjgU3WvHb0jR5aBRS62ZLZu9Ta2dwiJexQnIwXGVTkw7PU/s1600/forpaperExample_1.jpg" /></a></div>
There are two questions that might be answered:<br />
<br />
<ul>
<li>given $|x_1|= i, |x_2|= j, |x_3|= k $ what are the tuples $\{w(x_1),w(x_2),w(x_3)\}$?</li>
<li>given a $\{w(x_1)=I,w(x_2)=J,w(x_3)\}=K$, how many vectors $V$ can be built?</li>
</ul>
To answer to the first question, I noticed that the three coloured cells are the only ones that really count.<br />
The idea is the following:<br />
<br />
<ol>
<li>consider the three cells as placeholders</li>
<li>analyse the admitted values for each of them</li>
<li>replace the placeholders with all the possible permutations of the alphabet $\{x_1,x_2,x_3\}$.</li>
</ol>
<br />
Let's start with the case depicted in the above image, where we assumed that $\phi(x_1) < \phi(x_3) < \phi(x_2) $, then we have the following facts:<br />
<ul>
<li>$ \phi(x_1)$ can take values between: $0 \leq \phi(x_1) \leq |x_1|+|x_2|+ |x_3|-2$</li>
<li>$ \phi(x_2)$ can take just one value: $|V|=|x_1|+|x_2|+ |x_3|$</li>
<li>The upper bound of $ \phi(x_3)$ is $|V|-1$ because it can slide till the second last element of $V$, that is $\phi(x_2)-1$</li>
<li>what about the lower bound of $\phi (x_3)$? We have two cases depicted in the below image:</li>
</ul>
</div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjDs4jnlIS3Jp2HC6DM5oJrnNOi1l-csHMk0UuEgVu2z4Mya4q11m9_i2fMUPd1zjqFbUz6qzxbsGHseKsd-YMQuQfmUcnhclZWImNpk6h9WQg0yJQf2SIKhLzpIZBmQvhH9PKmVzeEgZs/s1600/x1x2x3phi3.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjDs4jnlIS3Jp2HC6DM5oJrnNOi1l-csHMk0UuEgVu2z4Mya4q11m9_i2fMUPd1zjqFbUz6qzxbsGHseKsd-YMQuQfmUcnhclZWImNpk6h9WQg0yJQf2SIKhLzpIZBmQvhH9PKmVzeEgZs/s1600/x1x2x3phi3.png" height="327" width="400" /></a></div>
<br />
<br />
To sum up, so far we explained the following part of the formula (I hope you don't mind that I changed a bit the indexes notation):<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgB222xwlysN7cL34R_XqLY0zrdfJRbeKVWErREzSppgrXSeUyd6pArKaxWTEKYWsdRzual3EiYfMnLED40yLJ6j8U0k6SbFkgmAvHEPTEreHNPSoKPM-9tQmDXtHulFzYUSCHCV161R-A/s1600/formula_q_indexes.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgB222xwlysN7cL34R_XqLY0zrdfJRbeKVWErREzSppgrXSeUyd6pArKaxWTEKYWsdRzual3EiYfMnLED40yLJ6j8U0k6SbFkgmAvHEPTEreHNPSoKPM-9tQmDXtHulFzYUSCHCV161R-A/s1600/formula_q_indexes.png" height="71" width="400" /></a></div>
<br /></div>
<div>
We have now to consider that for each configuration of $\{w(x_1)=I,w(x_2)=J,w(x_3)\}=K$ we can have more than one vector $V$.<br />
Do you like combinatorics? The remaining part it's just matter of tally, and the only formula we need is the formula for permutation with repetitions. I let you rub up the concept on your favourite website for trivial combinatorics.<br />
<br />
The formula can be split in two chunks, because we have to blocks of cells to be filled<br />
<br />
<ol>
<li>In how many ways can we fill the cells between the positions $[1,\phi(x_1)]$?</li>
<li>In how many ways can we fill the cells between the positions $[phi(x_1),phi(x_2)]$? </li>
</ol>
<br />
Let's answer the first question we have to find the values for the denominator of the following:<br />
<div style="font-family: Monaco; font-size: 12px;">
\[\frac{(\phi(x_1)-1)!}{\#(x_1)!\#(x_2)!\#(x_3)!}\]</div>
<ul>
<li>we have $|x_1|-1$ cells that can be filled.</li>
<li>it contains all the instances of $x_1$ (except for the last the occupied $\phi(x_1)$)</li>
<li>the number of $x_3$ instances depends by $\phi(x_1)$ and $\phi(x_3)$:</li>
</ul>
<ul>
</ul>
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgiwjSl4WndBeCnWcf1a_dg8PQulf-RfNpGPtFdjHzyd0j51p6TEh1TmeGqS3E4R-SnWnQL1pIn2LdJm7m9NfE1AUInAdJE1yqcJHmkU2q8TuAdlJYNaJVK1wj27Ny86ZSUb9eajXL4FR8/s1600/slot1x1x2x3.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgiwjSl4WndBeCnWcf1a_dg8PQulf-RfNpGPtFdjHzyd0j51p6TEh1TmeGqS3E4R-SnWnQL1pIn2LdJm7m9NfE1AUInAdJE1yqcJHmkU2q8TuAdlJYNaJVK1wj27Ny86ZSUb9eajXL4FR8/s1600/slot1x1x2x3.png" height="327" width="400" /></a></div>
<br />
<ul>
<li>the computation of the number of instances of $x_2$ in the first slot is straightforward, and it can easily derived by difference:</li>
<ul>
<li><span style="font-family: inherit;">$\frac{(\phi(x_1)-1)!}{(|x_1|-1)!(|x_3|-Min(|x_3|,\phi(x_3)))!\#(x_3)!}$;</span></li>
<li>$(|x_1|-1)+(|x_3|-Min(|x_3|,\phi(x_3)))+\#(x_2)=\phi(x_1)-1$</li>
<li>so $ \#(x_2)= \phi(x_1)-|x_1|-(|x_3|-Min(|x_3|,\phi(x_3))) $</li>
</ul>
</ul>
This explains the following boxed part of the formula:</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjZrGFsEwkz8k3X9-CBHc6WWkn98yByjtfBbAVmNTEn2f4eHb3h0k1-CVBdZnmMWyEPNK0IwBJ9HohyphenhyphenScaUax9Vp3AB9Fpm42q3IFjvNOPEYxhloTYexlnRT7d_Ul9FAhwRN2-8e9DSR-A/s1600/formula_slot1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjZrGFsEwkz8k3X9-CBHc6WWkn98yByjtfBbAVmNTEn2f4eHb3h0k1-CVBdZnmMWyEPNK0IwBJ9HohyphenhyphenScaUax9Vp3AB9Fpm42q3IFjvNOPEYxhloTYexlnRT7d_Ul9FAhwRN2-8e9DSR-A/s1600/formula_slot1.png" height="70" width="400" /></a></div>
<div>
<br /></div>
<div>
The final step is to count in how many ways we can fill the slot 2 depicted by the interval $[phi(x_1),phi(x_2)]$ and to make the formula more readable let's rename $(|x_3|-Min(|x_3|,\phi(x_3))= \epsilon$.</div>
<div>
As we did for the first slot we have to identify the values of the denominator of the below formula:</div>
<div>
<div>
<div style="font-family: Monaco; font-size: 12px;">
\[\frac{(\phi(x_3)-\phi(x_1)-1)!}{\#(x_2)!\#(x_3)!}\]</div>
<br />
<ul>
<li>Out of $|x_3|$ instances, $\epsilon$ have been placed in the slot 1, so the slot 2 contains exactly $|x_3|-1- \epsilon$.</li>
<li>again by difference we can get the instances of $x_2$:</li>
<ul>
<li>the occurrences of $x_2$ before $\phi(x_3)$ are exactly $\phi(x_3)- (|x_1|+|x_3|)$</li>
<li>the occurrences of $x_2$ in the slot 1 (listed above) are: $ \#(x_2)= \phi(x_1)-|x_1|-\epsilon $</li>
<li>that is : $ \#(x_2)=\phi(x_3)- (|x_1|+|x_3|)- \phi(x_1)+|x_1|+ \epsilon$</li>
<li>finally we have: $ \#(x_2)=\phi(x_3)-\phi(x_1)-|x_3|+ \epsilon$</li>
</ul>
</ul>
<br />
<span style="font-family: inherit;">That's complete the proof of the formula.</span><br />
<span style="font-family: inherit;">It's quite easy now extend the formula to more variables. The only expedient to make it easier is to remove from the formula the operator $Min$ splitting the formulas in two branches. </span><br />
<span style="font-family: inherit;">I'll show it in paper.</span><br />
<span style="font-family: inherit;"><b>Note about Complexity</b></span><br />
What's the most painful point of this formula?<br />
<span style="font-family: inherit;">... The introduction of placeholders requires to apply the formula for each permutation of the variables involved. It means that having $k$ variables we need to calculate the formula $k!$</span><br />
<span style="font-family: inherit;">Anyway I don't expect to use the formula </span>with large set of variables, after all the principle of selecting the right and small set of features is always recommended!<br />
<br />
<span style="font-family: inherit;">As usual Stay Tuned.</span><br />
<span style="font-family: inherit;">Cristian.</span></div>
<ul></ul>
</div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
Cristian Mesianohttp://www.blogger.com/profile/04880057603671195464noreply@blogger.com3tag:blogger.com,1999:blog-7631116270195175228.post-42621368039004871342014-04-19T05:42:00.002-07:002014-04-19T10:51:30.384-07:00Waiting Time Polynomials: how to derive the analytical formula: Part IV<div>
<b>Introduction b</b><b>efore you start</b><br />
I got many clarification requests about the Waiting Time Polynomials I published on the blog in the last three posts.<br />
The paper is almost ready to be submitted for review, but I think that some technical explanation might be interesting also for not academic audience.<br />
I consider myself a curious and hungry seasoned student, and I know how can be tedious read formulas and mathematical passages especially when it comes from a blog!!<br />
So why technical explanations? <br />
The answer is in the following quote of one of my favourite scientists, Gregory Chaitin. In "The quest for Omega" he wrote:<br />
<span style="font-size: x-small;"><i><span style="font-family: 'CMR12';"><br /></span></i>
</span><i><span style="font-family: 'CMR12';">The books I loved were books where the author’s personality shows through,
books with lots of words, explanations and ideas, not just formulas and
equations! I still think that the best way to learn a new idea is to see
its history, to see why someone was forced to go through the painful and
wonderful process of giving birth to a new idea! To the person who discovered
it, a new idea seems inevitable, unavoidable. The first paper may be clumsy,
the first proof may not be polished, but that is raw creation for you, just as
messy as making love, just as messy as giving birth! But you </span><span style="font-family: 'CMBX12';">will </span><span style="font-family: 'CMR12';">be able
to see where the new idea comes from. If a proof is “elegant”, if it’s the
result of two-hundred years of finicky polishing, it will be as inscrutable as a
direct divine revelation, and it’s impossible to guess how anyone could have
discovered or invented it. It will give you no insight, no, probably none at
all.</span><span style="font-family: 'CMR12';"> </span></i><br />
<div class="page" title="Page 23">
<div class="layoutArea">
<div class="column">
<span style="font-family: CMR12;"><br /></span>
<br />
<span style="font-family: inherit;">That's the spirit that leads the following explanation!</span><br />
<br /></div>
</div>
</div>
<b>Definition of the problem</b></div>
<div>
Given an alphabet of 3 elements <span style="font-size: x-small;">$\{X_1,X_2,X_3\}$</span>, the function <span style="font-size: x-small;">$w(X_i) $</span> counts the number of failed trials before the last event <span style="font-size: x-small;">$ X_i $</span>.<br />
Consider now the following configuration:<span style="text-align: center;"> </span><span style="font-size: x-small; text-align: center;">\[ \{\left\vert{X_1}\right\vert =i , \left\vert{X_2}\right\vert =j,\left\vert{X_3}\right\vert =k\}: i+j+k= Z \wedge i,j,k>0 \]</span><br />
<br />
<ul>
<li>What are the admitted sequences <span style="font-size: x-small;">$\{w(X_1),w(X_2),w(X_3)\}$ </span>?</li>
</ul>
<br />
<b>Step I: Find all the possible configurations of events</b><br />
How can we list the sequences of length $Z$ that can be built with <span style="text-align: center;"><span style="font-size: x-small;">$ \{\left\vert{X_1}\right\vert =i , \left\vert{X_2}\right\vert =j,\left\vert{X_3}\right\vert =k\}: i+j+k= Z \wedge i,j,k>0$</span> ?</span><br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgfaZecCkxdatn8mp_tY0jFgl2k7jj3nO0h97HY_xj8sPfxeDMJhzxulcn3aoXVX5ZbpQXawrywkj45D1bYhPAum_5vkG2D-0_t7Cl5PuxcBi3Du33Mr0IIHBIgJBQ_cEUTOPSE0-sss-I/s1600/vector.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgfaZecCkxdatn8mp_tY0jFgl2k7jj3nO0h97HY_xj8sPfxeDMJhzxulcn3aoXVX5ZbpQXawrywkj45D1bYhPAum_5vkG2D-0_t7Cl5PuxcBi3Du33Mr0IIHBIgJBQ_cEUTOPSE0-sss-I/s1600/vector.jpg" height="158" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Example of overall waiting time $w(x_i)$ in a succession of events.</td></tr>
</tbody></table>
</div>
<div>
<ul>
<li><span style="text-align: center;">once we set the values of the first two variables, the third it's determined by <span style="font-size: x-small;">$Z-i-j$</span>.</span></li>
<li><span style="text-align: center;">we imposed that all the variables </span>occur at least once, so we $X_1$ can assume all the values between <span style="font-size: x-small;">$[1,Z-2]$.</span></li>
<li>for each value of <span style="font-size: x-small;">$X_1$</span> the variable <span style="font-size: x-small;">$X_2$</span> can assume values between <span style="font-size: x-small;">$[1,Z-i]$.</span></li>
<li> $p_i$ is the probability that $X_i$ occur in a Bernullian trial.</li>
</ul>
Now we have all the ingredients to make the cake:</div>
<div>
<br />
<div style="text-align: center;">
<span style="font-size: x-small;"> <span style="font-family: Monaco;">$ \sum_{i=1}^{Z}\sum_{j=1}^{Z-i}\sum_{k=1}^{Z-i-j}{p_1^ip_2^jp_3^k}$</span><span style="font-family: Monaco;"> </span> </span></div>
</div>
<div>
<br /></div>
<div>
In the first two summations, $i$ assumes values between <span style="font-size: x-small;">$[1,Z]$</span> just to keep the formula cleaned.</div>
<div>
...I let you proof why the result doesn't change :).<br />
last point about this step:the limit of the above summation<span style="font-size: x-small;"> </span><span style="font-size: 13px;">$ Z \rightarrow \infty = \frac{p_1 p_2 p_3}{\left(p_1-1\right) \left(p_2-1\right) \left(p_3-1\right)}$</span><span style="font-size: 10px;"> </span><span style="font-size: 13px;"></span><br />
Such limit will be used to build the probabilistic density function.<br />
Curiosity (helpful for complexity analysis...):<br />
<ul>
<li>The number of sequences that can be built with vectors of length <span style="font-size: x-small;">$[3,Z]$</span> are <span style="font-size: x-small;">$\binom{Z}{3}$</span></li>
</ul>
<ul>
<li>The number of sequences that can be built with vectors of length <span style="font-size: x-small;">$Z$</span> are<span style="color: #2020c0; font-size: x-small;"> </span><span style="font-size: x-small;">$\binom{Z}{2}$</span></li>
</ul>
<b>Step II: Waiting for an event!</b></div>
<div>
<div>
<span style="text-align: center;">What's the easiest way to describe the overall waiting time for an event in a finite succession?</span><br />
<span style="text-align: center;">There are many ways to get the <span style="font-size: x-small;">$w(x_i)$</span>, the easiest I found is given by the position of the last occurrence of<span style="font-size: x-small;"> $x_i$ </span>minus the number of occurrences of<span style="font-size: x-small;"> $x_i$</span>.</span><br />
For instance, let's consider <span style="font-size: x-small;">$w(x_1)$</span>:</div>
<div style="text-align: left;">
<ul>
<li>The position of the last occurrence of <span style="font-size: x-small;">$x_1= 8$</span>;</li>
<li><span style="font-size: x-small;"> <span style="text-align: center;">$\left \vert{X_1} \right \vert = 4 $ </span></span></li>
<li><span style="font-size: x-small;">$w(X_1)=4$</span></li>
</ul>
<b>Where we are:</b><br />
The first two steps explain the circled pieces of the formula:<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhL47NSylmrnHPhvhyfWjseRsQwzwV4aio_8dAU5hhSPsr3Db7FhGekmL1b97Ivqq85MSIdu6Tk7mJl4cNJir12Bd9AO-ZdSWXiZ40B1d9dlR7KP5J2tjAGtrwDeiW7K3ChZU53RblC8fs/s1600/formulaStepIV.bmp" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhL47NSylmrnHPhvhyfWjseRsQwzwV4aio_8dAU5hhSPsr3Db7FhGekmL1b97Ivqq85MSIdu6Tk7mJl4cNJir12Bd9AO-ZdSWXiZ40B1d9dlR7KP5J2tjAGtrwDeiW7K3ChZU53RblC8fs/s1600/formulaStepIV.bmp" height="70" width="400" /></a></div>
<br />
<b>What the "<i>overall waiting time</i>" for?</b></div>
<div style="text-align: left;">
For each event <span style="font-size: x-small;">$X_i$ </span>we are counting the holes among all the occurrences, so smaller is the overall waiting time, closer each other are the events <span style="font-size: x-small;">$X_i$</span>: it's a measure of proximity for the occurrences of <span style="font-size: x-small;">$X_i$</span>.</div>
<div style="text-align: left;">
What I did, is to extend such measure (it would be interesting to prove that it's really a measure!) to different kind of events (aleatory variables) <span style="font-size: x-small;">${X_1, X_2,...,X_n}$</span> over the discrete line of the time.</div>
<div style="text-align: left;">
<b>Applications</b></div>
<div style="text-align: left;">
There are several area for which such kind of analysis might be helpful, I showed last time an its application as powerful document classifier, where each variable <span style="font-size: x-small;">$X_i$</span> is a word of a document.</div>
<div style="text-align: left;">
If we consider a document as a succession of <span style="font-size: x-small;">$Z$</span> words, the proximity measure inducted by the waiting time polynomials is a sort of finger print for the document, since for similar documents we expect that the same words are characterised by similar overall waiting time.<br />
Moreover, the dependency among the words are considered, since we are taking in account simultaneously an arbitrary number of words (the alphabet <span style="font-size: x-small;">${X_1, X_2,...,X_n}$</span>).</div>
<div style="text-align: left;">
<br />
In the next step I'll explain the logic to get the remaining pieces of the puzzle, that will make easier the generalisation of the approach to an arbitrary alphabet.<br />
Stay Tuned!<br />
cristian</div>
<ul></ul>
</div>
Cristian Mesianohttp://www.blogger.com/profile/04880057603671195464noreply@blogger.com161tag:blogger.com,1999:blog-7631116270195175228.post-11517365870013432482014-04-08T12:41:00.002-07:002014-04-09T12:48:25.294-07:00Coefficients of Waiting Time Polynomials listed in OEIS Database!I'm happy to announce that the sequence generated by coefficients of waiting time polynomial has been listed in the OEIS database (The Online Encyclopedia of Integer Sequences).<br />
The sequence is: <a href="https://oeis.org/A239700" target="_blank">A239700</a>.<br />
In the next posts I'm going to explain how I derived the analytical formula of the polynomials.<br />
As usual: Stay Tuned!<br />
cristianCristian Mesianohttp://www.blogger.com/profile/04880057603671195464noreply@blogger.com3tag:blogger.com,1999:blog-7631116270195175228.post-33518631146971672952014-03-28T13:59:00.000-07:002014-04-07T08:40:12.645-07:00Coefficients of Waiting Time Polynomials: a nice representation I was doing some simulation for the paper I'm writing about the <a href="http://textanddatamining.blogspot.ch/2014/03/waiting-time-polynomials-document.html" target="_blank">waiting time polynomials</a>, and I got something unexpected.<br />
If we sort in lexicographic order all the polynomials generated for a given length of the events vector, and we consider the coefficients, what you get is an apparently irregular series of integers.<br />
To capture some regularity, I decided to plot such matrix:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhb6FiLPwQlnNg7jeZYBy4UakbfsvoHl8DOqKF_paGAK9bSkNdve6z15gZVG5irYbfEo6W9cH6OcVy9hXsPsQombSAE679CdYeb_PAZc4kulwZhlIm4d2KYqsdGiphyXFDWkvSuxwf_Www/s1600/coefficientMatrix.bmp" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhb6FiLPwQlnNg7jeZYBy4UakbfsvoHl8DOqKF_paGAK9bSkNdve6z15gZVG5irYbfEo6W9cH6OcVy9hXsPsQombSAE679CdYeb_PAZc4kulwZhlIm4d2KYqsdGiphyXFDWkvSuxwf_Www/s1600/coefficientMatrix.bmp" height="372" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">The matrix plot, obtained considering an alphabet of three words and vectors having length=35.</td></tr>
</tbody></table>
<br />
Isn't nice?<br />
Sometimes, an appropriate graphical representation helps in capture interesting aspects of your data<br />
<br />
Stay Tuned<br />
CristianCristian Mesianohttp://www.blogger.com/profile/04880057603671195464noreply@blogger.com2tag:blogger.com,1999:blog-7631116270195175228.post-85563826640073463122014-03-18T15:02:00.001-07:002014-03-24T12:47:01.296-07:00Waiting Time Polynomials Document Classifier - Part III<b>Abstract</b><br />
In the post is presented an innovative definition of polynomials associated to waiting time processes (analyzed in the former <a href="http://textanddatamining.blogspot.ch/2014/03/statistical-distribution-to-describe.html" target="_blank">posts</a>). Such polynomials are here successfully used as document classifier. Comparative tests with SVM show significant accuracy improvements.<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEji_ATL53FcJYvJAykNm3PCsYFHA-ns2jHSf0TClQnAKOnNizznna8U6kVNqNFW0KVm0AtcfvQBS7EVY4MgppkliHoU-msW8Jp8XgdIHXiMrUs-2tjEJpOPVV6Zj6jOnvR6aZyLUtFevDg/s1600/WTPvsSVMsize35.bmp" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEji_ATL53FcJYvJAykNm3PCsYFHA-ns2jHSf0TClQnAKOnNizznna8U6kVNqNFW0KVm0AtcfvQBS7EVY4MgppkliHoU-msW8Jp8XgdIHXiMrUs-2tjEJpOPVV6Zj6jOnvR6aZyLUtFevDg/s1600/WTPvsSVMsize35.bmp" height="361" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Boolean Classification tests based on test set of 8k randomly generated documents composed using an alphabet of three words.</td></tr>
</tbody></table>
<b>Introduction</b><br />
To encourage who found quite intricate the formula I presented a couple of posts ago, I'm going to present you an its practical application that might be a good incentive to analyze the formal aspects with more attention :)<br />
What I show to you today is one of several application of such approach: a document classifier having higher accuracy than traditional methods as SVM (trained with gaussian kernel) and Neural Networks back propagated.<br />
<br />
<b>Characteristics of the classifier</b><br />
<ul>
<li>It's a supervised learning algorithm</li>
<li>It's completely non parametric</li>
<li>It can be used natively to classify multiple classes datasets.</li>
</ul>
<br />
<b>The Algorithm</b><br />
Let's assume to have a training set composed by two classes of documents: <em>Cl_1, Cl_2.</em><br />
<strong></strong><br />
<strong> </strong><u>Learning Phase: Estimation of Geometrically distributed Random Variables.</u><br />
<ol>
<li><em>Define an alphabet of three words {w1,w2,w3} using frequencies criteria or more sophisticated techniques.</em></li>
<li><em>For each class of training set:</em></li>
<ul>
<li><em>estimate parameters {p1, p2, p3} of the respective {X1(w1),X2(w2),X3(w3)} geometrically distributed random variables.</em></li>
</ul>
<li style="text-align: center;"><em>Calculate the polynomials associated to {X1(w1),X2(w2),X3(w3)} using:</em></li>
</ol>
<em></em><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjjRwr_ajZ-E3xqIH9BbjQYwQbKhs0i-h7yyhXoD2wb0hdKOr7txVYnimQ6sPNuMi3KtY1mxWFDNvmCMekNXVPAq47xtHXR6TagB_AOpFpkdBAE8gSx-LqU7rYvJxmvfF7nHvmqwyLayd8/s1600/formulaCorrected.bmp" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjjRwr_ajZ-E3xqIH9BbjQYwQbKhs0i-h7yyhXoD2wb0hdKOr7txVYnimQ6sPNuMi3KtY1mxWFDNvmCMekNXVPAq47xtHXR6TagB_AOpFpkdBAE8gSx-LqU7rYvJxmvfF7nHvmqwyLayd8/s1600/formulaCorrected.bmp" height="70" width="400" /></a></div>
<br />
<br />
<strong> </strong><u>Testing Phase: document classification</u><br />
<ol>
<li><em> for each document Di of the test set:</em></li>
<ul>
<li><em>Identify the number of occurrences of {w1,w2,w3}: {O_w1,O_w2,O_w3}</em></li>
<li><em>Select the polynomial for which:</em></li>
<li style="text-align: center;"><em> {O_w1,O_w2,O_w3} =p1^O_w1 p2^O_w2 p3^O_w3.</em></li>
<li><em>Calculate the value of the polynomials P_Cl_1,<span class="Apple-style-span" style="font-style: normal;"><em> P_Cl_2 </em></span>using:</em></li>
<ol>
<li><em><span class="Apple-style-span" style="font-style: normal;"><em>{p1, p2, p3} estimated for Cl_1</em></span></em></li>
<li><i><span class="Apple-style-span" style="font-style: normal;"><em>{p1, p2, p3} </em></span></i><em><span class="Apple-style-span" style="font-style: normal;"><em>estimated for Cl_2</em></span></em></li>
</ol>
</ul>
<li><i>Classify the document:</i></li>
<ol>
<li><i>If (<span class="Apple-style-span" style="font-style: normal;"><em>P_Cl_1</em></span>></i><em><span class="Apple-style-span" style="font-style: normal;"><em>P_Cl_2) Di belongs to Cl_1</em></span></em></li>
<li><em><span class="Apple-style-span" style="font-style: normal;"><em>Else </em></span></em><em><span class="Apple-style-span" style="font-style: normal;"><em>Di belongs to Cl_2</em></span></em></li>
</ol>
</ol>
<b>Examples</b><br />
<i>1. How to select the polynomial</i><br />
<br />
Let's consider the polynomials calculated using the aforementioned formula and assume that in document <i>Di </i>the word <i>w1</i> is repeated 2 times, the word <i>w2 </i>is repeated 2 times and <i>w3 </i>is repeated 1 time.<br />
Then, In the step 1 of the testing phase we have to choose the polynomial boxed in the below list:<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh3wD0u2GGj6GmPQ5YaUGeKU-L3qlPWl5VB-kSnijWRrOnzivbch2y54miWz9Hyjd0ENBxOZQLcg_xaR8RiO10oGeLrb_0vpd8AR6AtRJfr63lbBG26LjOqnLhwkma8nqgM4kEeCvHAMUM/s1600/exampleOfPolynomial.bmp" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh3wD0u2GGj6GmPQ5YaUGeKU-L3qlPWl5VB-kSnijWRrOnzivbch2y54miWz9Hyjd0ENBxOZQLcg_xaR8RiO10oGeLrb_0vpd8AR6AtRJfr63lbBG26LjOqnLhwkma8nqgM4kEeCvHAMUM/s1600/exampleOfPolynomial.bmp" height="50" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Polynomials generated for <i>O_w1+ O_w2+O_w3</i>=5</td></tr>
</tbody></table>
<i>2. Same polynomial, different values of </i><i><span class="Apple-style-span" style="font-style: normal;"><em>p1, p2, p3.</em></span></i><br />
<div>
<br /></div>
<div>
- How many polynomials are generated for <i>O_w1+ O_w2+O_w3 </i>= 25?</div>
<div>
The answer is straightforward: 276 that is, all the possible configurations of 3 addends for which the sum =25. In the Encyclopedia of Integer Sequences, there are exciting explanation on such <a href="https://oeis.org/A000217" target="_blank">series</a>.</div>
<div>
<br /></div>
<div>
- How does the polynomial change for different values of <i><span class="Apple-style-span" style="font-style: normal;"><em>p1, p2, p3?</em></span></i></div>
<div>
It's quite impressive how two polynomials can be different despite the setting of <i>{p1, p2, p3}</i><i style="font-style: normal;"> </i>is almost identical.</div>
<div>
Look at this example for which:</div>
<div>
<ul>
<li>On the left chart I plotted a polynomial with <em style="font-style: normal;">p1= 0.33, p2= 0.33, p3=0.34.</em></li>
<li>On the right chart I plotted a polynomial with <em style="font-style: normal;">p1= 0.4, p2= 0.3, p3=0.3.</em></li>
</ul>
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjE4aXz7yjy49pwKUIGEsItiQ6VnxOdrUnt3sRXpu3SfoeyQ_VeygcQL4I8HMGxe3-dyStcOU4HixxK_11Qu9CWkY1obk-djvwPb1kOYEapDEYgeL36urr-OhUMj607Q5KHGLfezRjvWr4/s1600/plotValues_almostthesame.bmp" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjE4aXz7yjy49pwKUIGEsItiQ6VnxOdrUnt3sRXpu3SfoeyQ_VeygcQL4I8HMGxe3-dyStcOU4HixxK_11Qu9CWkY1obk-djvwPb1kOYEapDEYgeL36urr-OhUMj607Q5KHGLfezRjvWr4/s1600/plotValues_almostthesame.bmp" height="175" width="400" /></a></div>
<div>
<ul>
<li>In the first case the polynomial takes the maximum value for <i>O_w1=8, O_w2= 8, O_w3 </i>=9 ...not a big surprise!</li>
<li>In the second case the polynomial takes the maximum value for <i>O_w1=15, O_w2= 9, O_w3 </i>=1. In this case the result is more complicated to be explained!</li>
</ul>
<b>Document classification comparative test</b></div>
<div>
<br /></div>
<div>
To test the accuracy of my method I performed a comparative test using a randomly generated document set.</div>
<div>
<ul>
<li>The training set was composed by 100 documents (used to train a Gaussian Kernel SVM and to estimate params p_i for my method).</li>
<li>The test set was composed by:</li>
<ul>
<li><i>Cl_1: </i>4000 documents randomly generated using a configuration of <i>{p1, p2, p3}.</i></li>
<li><i>Cl_2: </i>4000<i> </i>documents randomly generated using a different configuration of <i>{p1, p2, p3}.</i></li>
</ul>
<li>The accuracy has been tested using different configurations of <i>{p1, p2, p3}, </i>and considering different size of documents.</li>
</ul>
<b>Results</b></div>
<div>
First experiment: </div>
<div>
just the first 25 words have been considered to estimate the params <i>{p1, p2, p3}, </i>to train the SVM and to test the the accuracy.</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhOmO-ONn3ATDZo0LxaAeIcyjoyQ4QlMkL8w4m_LHMGdj6Rtbk6lI58g4CRba2ZzgUKij2b0RCo2hN0UBub8v6NtaD0ftqebT7aC8K8NxWRXLpZhyphenhyphenUI6IKEpF84NMXTruOMfnNUH4dSRxs/s1600/WTPvsSVMsize25.bmp" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhOmO-ONn3ATDZo0LxaAeIcyjoyQ4QlMkL8w4m_LHMGdj6Rtbk6lI58g4CRba2ZzgUKij2b0RCo2hN0UBub8v6NtaD0ftqebT7aC8K8NxWRXLpZhyphenhyphenUI6IKEpF84NMXTruOMfnNUH4dSRxs/s1600/WTPvsSVMsize25.bmp" height="361" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Accuracy results considering just the first 25 words.</td></tr>
</tbody></table>
<div>
Second experiment: </div>
<div>
Same test as above using the first 35 words of the documents.</div>
<div>
<br /></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi0i21PHEytTaPOz03bVljT75GeMey_VV8cB-41UQMgiM4xOyhJzdbsJlKdFkezDZnTqCO2BhCsgIdL6oSRHui8SMt3ADo52qufVBaKmmay6rntiOrhqjKFWC49urKtf4gy2KfSs-DH5bc/s1600/WTPvsSVMsize35.bmp" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi0i21PHEytTaPOz03bVljT75GeMey_VV8cB-41UQMgiM4xOyhJzdbsJlKdFkezDZnTqCO2BhCsgIdL6oSRHui8SMt3ADo52qufVBaKmmay6rntiOrhqjKFWC49urKtf4gy2KfSs-DH5bc/s1600/WTPvsSVMsize35.bmp" height="361" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="font-size: medium; margin-bottom: 0.5em; margin-left: auto; margin-right: auto; padding-bottom: 6px; padding-left: 6px; padding-right: 6px; padding-top: 6px; text-align: center;"><tbody>
<tr><td class="tr-caption" style="font-size: 13px; padding-top: 4px; text-align: center;">Accuracy results considering just the first 35 words.</td></tr>
</tbody></table>
</td></tr>
</tbody></table>
<div>
All the results showed there are referred to accuracy achieved on average using 10 different randomly generated test set and trying to configure SVM params to maximize the accuracy.</div>
<div>
</div>
<div>
<b>Considerations</b></div>
<div>
As you can see the method based on definition of "Waiting Time Polynomials" that I'm proposing, performs significantly better than SVM.</div>
<div>
More comparative tests will be shown in the publication I'm writing on such topic.</div>
<div>
<br /></div>
<div>
<b>Further notes</b></div>
<div>
Processes based on waiting time or geometrically distributed random variables are extremely important for risk assessment and risk evaluation.</div>
<div>
I'll show you in another post some application of such polynomials in this field.</div>
<div>
As usual, stay tuned</div>
<div>
cristian</div>
Cristian Mesianohttp://www.blogger.com/profile/04880057603671195464noreply@blogger.com5tag:blogger.com,1999:blog-7631116270195175228.post-64764179088623106392014-03-06T09:51:00.000-08:002014-07-30T02:09:42.709-07:00The Mind Map project: a new concept of Enterprise Knowledge Management<b>Abstract</b><br />
A project to build an innovative Information Management tool to extract, correlate and expose unstructured information.<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjB-JEd-tD_-Oq2uF1_lT1LcJkohW7AecEKfRxTE214_COTJTbe19CQtEebhPPxaOa4QeesIxDCBiB2V9JTXA08DdmBkqTeoUPX-w_f7gO7VNdklRVEpr7esIMvKNxwyBIYCxwJQ4UjdbQ/s1600/MM4.bmp" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjB-JEd-tD_-Oq2uF1_lT1LcJkohW7AecEKfRxTE214_COTJTbe19CQtEebhPPxaOa4QeesIxDCBiB2V9JTXA08DdmBkqTeoUPX-w_f7gO7VNdklRVEpr7esIMvKNxwyBIYCxwJQ4UjdbQ/s1600/MM4.bmp" height="355" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">A screen shot of one of the mindmap generated by the tool.<br />
Different colors have been used to depict sub-concepts. </td></tr>
</tbody></table>
<br />
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<b>The demo</b><br />
Time ago I published in this blog some posts where I presented some algorithms I designed to extract relevant information from documents and more in general unstructured content (as tweets, blogs post, web pages).<br />
I don't want spend too much words, I guess a demo better describes the idea I have in mind.<br />
It's still a prototype, and a lot of work is still required, but in this video I hope you appreciate the idea.<br />
In the video I tested the prototype of the application using a wikipedia <a href="http://en.wikipedia.org/wiki/Nuclear_weapon" target="_blank">page</a>.<br />
<div class="separator" style="clear: both; text-align: center;">
<iframe allowfullscreen='allowfullscreen' webkitallowfullscreen='webkitallowfullscreen' mozallowfullscreen='mozallowfullscreen' width='680' height='400' src='https://www.youtube.com/embed/TFnD-BTctXI?feature=player_embedded' frameborder='0'></iframe></div>
<br />
PS<br />
<b>To optimize the video, watch it on youtube with the option "quality HD".</b><br />
<br />
...Looking forward to receive feedback! <br />
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
Stay Tuned</div>
<div class="separator" style="clear: both; text-align: left;">
cristian</div>
<br />Cristian Mesianohttp://www.blogger.com/profile/04880057603671195464noreply@blogger.com5tag:blogger.com,1999:blog-7631116270195175228.post-70948818181261003852014-03-03T23:15:00.001-08:002014-03-24T13:05:02.575-07:00Statistical Distribution to describe a Document - My Contribute Part II<b>Abstract</b><br />
<b><span class="Apple-style-span" style="color: #222222; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 13px; font-weight: normal; line-height: 18px;">it's presented the second step to generalize a probabilistic density function (based on geometrical distribution) to describe a document thru the underlying stochastic process.</span></b><br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiY7gJtUcRys9q-xoEcl6MzrMjkUNzIa6kw91wGVJkaQfV6HrhlCBbaOV8LyxWcLKk97eCPQUpxiac5BI41c6sXUqLX9bIKWLj5CGufL0EWCcITYx5swi2cI8130cQWsqBJ-b7KXIKp3qQ/s1600/03503503.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiY7gJtUcRys9q-xoEcl6MzrMjkUNzIa6kw91wGVJkaQfV6HrhlCBbaOV8LyxWcLKk97eCPQUpxiac5BI41c6sXUqLX9bIKWLj5CGufL0EWCcITYx5swi2cI8130cQWsqBJ-b7KXIKp3qQ/s1600/03503503.jpg" height="230" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Marginal PDF and CDF for document described by three words.<br />
In red has been depicted the distribution of a document having size = 35, in Yellow size=34, in Blue size 33.<br />
<br /></td></tr>
</tbody></table>
The marginal distribution that describes the density of a document for a fixed size composed by three words is the following:<br />
<div>
<div style="text-align: center;">
<div style="text-align: -webkit-auto;">
<br /></div>
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgRQm4CuVUf-PjrNctIpe0txqlkGw9Em0ARmXvKNx2XvWLJLnhhdDvXun7CvvXxUy7oZit09k3pNjfXQy_HvZdPVBG1TQVAOYlMV63lJbwolcxHVokqLz6ZrDiT_kNE9jCsy10dB6id3XQ/s1600/formulaCorrected.bmp" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgRQm4CuVUf-PjrNctIpe0txqlkGw9Em0ARmXvKNx2XvWLJLnhhdDvXun7CvvXxUy7oZit09k3pNjfXQy_HvZdPVBG1TQVAOYlMV63lJbwolcxHVokqLz6ZrDiT_kNE9jCsy10dB6id3XQ/s1600/formulaCorrected.bmp" height="70" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEis7jRMwgJROq6OOwjW9WqPA0IJAERTAPT7s5babQQ3bsQ9z0GVYgur7tSFIW0kJWI3Se0BiEl-kPBu9ubZY9lS2I_ZWKtm1UPAmrG7H-qbYdemnwhTqUCwN0Z4LF3nQdjtZ6uIR4w3w3Q/s1600/indexesCoeff.bmp" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEis7jRMwgJROq6OOwjW9WqPA0IJAERTAPT7s5babQQ3bsQ9z0GVYgur7tSFIW0kJWI3Se0BiEl-kPBu9ubZY9lS2I_ZWKtm1UPAmrG7H-qbYdemnwhTqUCwN0Z4LF3nQdjtZ6uIR4w3w3Q/s1600/indexesCoeff.bmp" height="160" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Example of alpha, beta, gamma parameters determination. </td></tr>
</tbody></table>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<b><br /></b>
<b>Introduction</b></div>
<div>
The bag of words approach and almost all the related techniques to extract features from a document are based on the manipulation of frequencies associated to the words of the document. Such methodologies tend to fail when documents are characterized exactly by the same bag of words, and by the same frequency.</div>
<div>
<br /></div>
<div>
The proposed approach bypasses such problem since it takes in account the waiting time of the words in a document.</div>
<div>
<br /></div>
<div>
<b>From 2 variables to 3 variables</b></div>
<div>
In the former post I presented the easiest case: a document depicted by two words. The generalization of the problem it's not painless process.</div>
<div>
To understand the tricks used to tackle the problem I'm going to explain the passage from 2 to 3 variables.</div>
<div>
...To be continued.</div>
<div>
c.</div>
<div>
PS: I'm back!</div>
<div>
</div>
<div>
<br /></div>
<div>
</div>
Cristian Mesianohttp://www.blogger.com/profile/04880057603671195464noreply@blogger.com3tag:blogger.com,1999:blog-7631116270195175228.post-86853900248329096892013-10-29T14:39:00.000-07:002013-10-29T14:39:55.647-07:00Document as Stochastic Process: My contribute - part 1<b>Abstract</b><br />
A statistical discrete distribution (based on geometric random variable) to depict a document. In the post it's presented a probabilistic density function to describe a document thru the underlying stochastic process.<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj1u0VOVKEJFa8EOT_etAKZu5IsdKWuazeIBb6d39f57uky2SkoqjwdyOiRlSTo-QFhqJd3c2dPJvbZaPqdYHXDHig3IOFh_XVjZHfy43fvbQwke4l1oow9nhGeLiCD-uceKRDlKSmGd8g/s1600/abstract.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="175" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj1u0VOVKEJFa8EOT_etAKZu5IsdKWuazeIBb6d39f57uky2SkoqjwdyOiRlSTo-QFhqJd3c2dPJvbZaPqdYHXDHig3IOFh_XVjZHfy43fvbQwke4l1oow9nhGeLiCD-uceKRDlKSmGd8g/s400/abstract.jpg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">
PDF-CDF charts and PDF analytical formula for document of length Z.</div>
<div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">
Results for a document of two words.</div>
</td></tr>
</tbody></table>
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<b>Introduction</b><br />
We often discussed about the importance of the features extraction step in order to represent a document.<br />
<div>
So far, I mostly presented techniques based on "bag of words" and some more sophisticated approach to capture the relation among such features (the graph entropy, mutual information, and so on).</div>
<div>
<br /></div>
<div>
In the last weeks I started to think about a different perspective in order to capture the essence of a document.</div>
<div>
<b><br /></b></div>
<div>
<b>The idea</b></div>
<div>
What is a document from statistical point of view?</div>
<div>
I tried to give an answer, and I came to the following (expected) conclusion:</div>
<div>
<ol>
<li>A document is a stochastic process of random variables - the words of the document.</li>
<li>In a document as in a whatever statistic process, the occurrences of the random variables and functions over them doesn't provide complete description of the process.</li>
<li>The joint distribution of the waiting times between two occurrences of the same variable (word) encloses all the significative links among the words. </li>
</ol>
The idea is to depict a document as stochastic process of "waiting time" random variables.</div>
<div>
This formalization allows us to think to a document as a statistic distribution characterized by its own density and cumulate function.<br />
The waiting time is very well depicted by geometric distribution.</div>
<div>
<br /></div>
<div>
<b>Formalization </b></div>
<div>
A document is a process of geometrically distributed random variables:<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiYHHIn_MXPRT6IMmCvsJ276BOe3OtlXC6C5tBoT4dold32WKtDLeDx6n6mJV8mEL2ybsVvyNIdm5vMPBNvJtAAkfRVh01nvecmYlOUMq1r76AbnW_HQUwdhpgs5e8Jut-MCHxFBBc3c04/s1600/formalization.bmp" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="20" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiYHHIn_MXPRT6IMmCvsJ276BOe3OtlXC6C5tBoT4dold32WKtDLeDx6n6mJV8mEL2ybsVvyNIdm5vMPBNvJtAAkfRVh01nvecmYlOUMq1r76AbnW_HQUwdhpgs5e8Jut-MCHxFBBc3c04/s400/formalization.bmp" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="font-size: medium; margin-bottom: 0.5em; margin-left: auto; margin-right: auto; padding-bottom: 6px; padding-left: 6px; padding-right: 6px; padding-top: 6px; text-align: center;"><tbody>
<tr><td class="tr-caption" style="font-size: 13px; padding-top: 4px; text-align: center;">N= size of the document, j = #words.</td></tr>
</tbody></table>
</td></tr>
</tbody></table>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: left;">
where each random variable is described as follow:</div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh02gPNxbLKc20QfKm85_wvmc6cYuoQUe7wmaecfolzLWe7HbkF5pwZTSHAF_l2h1V2e8e5YoMGTAwd3sQbT-nwnxFk6ZYlc5Dvb6StPptFptOOdh9FW-miYkYTyDS0AKXl8sr8sTzhuMk/s1600/varXgeom.bmp" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="18" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh02gPNxbLKc20QfKm85_wvmc6cYuoQUe7wmaecfolzLWe7HbkF5pwZTSHAF_l2h1V2e8e5YoMGTAwd3sQbT-nwnxFk6ZYlc5Dvb6StPptFptOOdh9FW-miYkYTyDS0AKXl8sr8sTzhuMk/s400/varXgeom.bmp" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
Let's consider the following document composed by a sequence of two words: "a" and "b".</div>
<div class="separator" style="clear: both; text-align: left;">
<b><br /></b></div>
<div class="separator" style="clear: both; text-align: left;">
<b>The Problem</b></div>
<ul>
<li>What is the probability function associated to a document Dk?</li>
</ul>
Given the following document:<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgAaVb5osI1qXrdzxxoxTWgw6rpNko74w9Js4wlmrDqhw7xGN7hx3n_PHebOVW9U5S0Uiuxg96ypfhFckoXJJcJf8JJ2rzDfsDFcD6K7ZUcbK66Zq0ZebBRn5jDbQxHQbY8nnCSoPmsG50/s1600/document.bmp" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgAaVb5osI1qXrdzxxoxTWgw6rpNko74w9Js4wlmrDqhw7xGN7hx3n_PHebOVW9U5S0Uiuxg96ypfhFckoXJJcJf8JJ2rzDfsDFcD6K7ZUcbK66Zq0ZebBRn5jDbQxHQbY8nnCSoPmsG50/s1600/document.bmp" /></a></div>
<div class="separator" style="clear: both; text-align: left;">
The Probability of Dk is given by:</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiSbyTF6kVkqPyfO8Ud4Ddzaytq4usvBA0hwSFJq6Ng0ylcDKd1Vtl2wlPXETrtpORa8U7pB4Br-kGbJrDsoXmkMDbgEC5vjSPqsW8KyaF2gbvsdRBG0FHLXn3uDOjF0aQPmr4p6uTNiQQ/s1600/DocProbability.bmp" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="15" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiSbyTF6kVkqPyfO8Ud4Ddzaytq4usvBA0hwSFJq6Ng0ylcDKd1Vtl2wlPXETrtpORa8U7pB4Br-kGbJrDsoXmkMDbgEC5vjSPqsW8KyaF2gbvsdRBG0FHLXn3uDOjF0aQPmr4p6uTNiQQ/s320/DocProbability.bmp" width="320" /></a></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
The below image should explain better what I meant:</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiMl1CC6YKBH7tPAlnrNY4mC9B2YGUzrLO6EBO01yWer949cdT98GjIrj1_uqLl0nI0YSpuPeKLLX9UJlL4Xo6o7ojb4c1Klt9QKe4Qynre5-uMAIy_aLSVaR79MPctFtqHLok8y0h9lMs/s1600/docProb.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="201" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiMl1CC6YKBH7tPAlnrNY4mC9B2YGUzrLO6EBO01yWer949cdT98GjIrj1_uqLl0nI0YSpuPeKLLX9UJlL4Xo6o7ojb4c1Klt9QKe4Qynre5-uMAIy_aLSVaR79MPctFtqHLok8y0h9lMs/s400/docProb.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Probability of Document Dk<br />
<b>q = 1-p</b></td></tr>
</tbody></table>
<div class="separator" style="clear: both; text-align: left;">
<b>Probability Density Function </b></div>
<div class="separator" style="clear: both; text-align: left;">
<u>-> From now, if not clearly specified I will refer to a document of two words (as above described).</u></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
Questions:</div>
<div class="separator" style="clear: both; text-align: left;">
</div>
<ol>
<li>What is the analytical representation of the above function?</li>
<li>Is it the above function a probability function?</li>
</ol>
<br />
<div class="separator" style="clear: both; text-align: left;">
<b>1. Analytical formulation</b></div>
<div class="separator" style="clear: both; text-align: left;">
This is the funny part: I like combinatoric... the art of tally!</div>
<div class="separator" style="clear: both; text-align: left;">
To solve the problem, I'm sure, there are several strategies, I used just the way that is more aligned to my mind setting: I put all the options on the table and I start to check to gather them together according to some criteria :)</div>
<div class="separator" style="clear: both; text-align: left;">
</div>
<ul>
<li>The formulation of the "p" is a joke: the sum of the exponents must be equal to the size of the vector.</li>
<li>the formulation for the "q" requires some deeper thinking.</li>
</ul>
As you know, I don't like too much technicalities, so I spare you the theoretical aspects.<br />
I think it's more worthwhile to show how the solution comes!<br />
<u><br /></u>
<u>Exponents:</u><br />
Let's put the <b>q_a</b> and <b>q_b </b> exponents properly sorted on a matrix:<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiHnH5fckzwc-DxeEKuLfA9z7G7ONHjCIihf5W7MxONA6BZ68Ewkl_BJl1ZXtPkBaBV3Lc-CY2Z3grtCprDKybrn-_QgbRiyhaodjGbrFZKfMIGPQcXesUf49ewXtPu9eCRn5KDidM4qJA/s1600/matrixExponents.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="274" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiHnH5fckzwc-DxeEKuLfA9z7G7ONHjCIihf5W7MxONA6BZ68Ewkl_BJl1ZXtPkBaBV3Lc-CY2Z3grtCprDKybrn-_QgbRiyhaodjGbrFZKfMIGPQcXesUf49ewXtPu9eCRn5KDidM4qJA/s320/matrixExponents.jpg" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">The split over the anti-diagonal makes the calculus easier.</td></tr>
</tbody></table>
<br />
<div class="separator" style="clear: both; text-align: left;">
<u>Coefficients:</u></div>
<div class="separator" style="clear: both; text-align: left;">
The same configuration of exponents can be related to different configurations of the words in the document:</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiLfxc_TvL_wadNdxzEPAU5Hm1JNd1_386laKTp6aBavMQjt13zLIqkDRVDfjgbSYwCkZZ4DROX3oOQmNgxjW-Y1aRl2hc6v5PSm2cs_gWMvgQU00TV2SA8x_w0FxdeX19aqmfpyWfpkpY/s1600/coefficients.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiLfxc_TvL_wadNdxzEPAU5Hm1JNd1_386laKTp6aBavMQjt13zLIqkDRVDfjgbSYwCkZZ4DROX3oOQmNgxjW-Y1aRl2hc6v5PSm2cs_gWMvgQU00TV2SA8x_w0FxdeX19aqmfpyWfpkpY/s320/coefficients.jpg" width="224" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">The different configurations (20) of two words in a document having length = 8. </td></tr>
</tbody></table>
<div class="separator" style="clear: both; text-align: left;">
Let's put the the occurrences of the different configurations of words for the same exponents configuration in a matrix:</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiCc1-fdsqscMUmlDgotgqy3wUK07xTmCnqregOza0pHuM_NjJlGENh1RQnJi3EKc9mC0MSrcPVo6pYiqL6isKxfgnCA-EyVoeeJxuwWp6wBmys6AkbH7HRXdHX22aeAWOwRw0opPOssf4/s1600/matrixCoefficient.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiCc1-fdsqscMUmlDgotgqy3wUK07xTmCnqregOza0pHuM_NjJlGENh1RQnJi3EKc9mC0MSrcPVo6pYiqL6isKxfgnCA-EyVoeeJxuwWp6wBmys6AkbH7HRXdHX22aeAWOwRw0opPOssf4/s1600/matrixCoefficient.jpg" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="font-size: medium; margin-bottom: 0.5em; margin-left: auto; margin-right: auto; padding-bottom: 6px; padding-left: 6px; padding-right: 6px; padding-top: 6px; text-align: center;"><tbody>
<tr><td class="tr-caption" style="font-size: 13px; padding-top: 4px; text-align: center;"><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">
# o<span class="Apple-style-span" style="font-size: x-small;">ccurrences of the different configurations of words for the same exponents configuration</span><span class="Apple-style-span" style="font-size: small;">.</span></div>
<div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">
<span class="Apple-style-span" style="font-size: x-small;">Do you see the similarities</span><span class="Apple-style-span" style="font-size: small;"> </span><span class="Apple-style-span" style="font-size: x-small;">respect the two halves of the matrix?</span></div>
</td></tr>
</tbody></table>
</td></tr>
</tbody></table>
<div class="separator" style="clear: both; text-align: center;">
</div>
Is it easier now find the "low" to build the above matrix?<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
</div>
<div>
... If you don't like play around with combinatorics, you can ask help to <a href="http://oeis.org/" target="_blank"> The On-Line Encyclopedia of Integer Sequences®</a>!</div>
<div>
Ok, ok you are too lazy...: the solution is Binomial[k+i,k].</div>
<div>
<b><br /></b></div>
<div>
<b>The Analytical formula</b></div>
<div>
The above matrixes suggest that the analytical formula is composed by two main addends: one for the orange side and another one for the green side.<br />
For a document of two words the Probability Density Function is the following:<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgZKZpgpOtZagZkMmFhizh1q8iSXPBHCeePFEsAv6RQno-kYKB3IgJ26NqZ0FwHDwjaQyEFNeqSwqRTl9onz2m2K_DGjRVWDGP0zlLj5l9lhhZEkp9ce_XBU1CZ1phQOOmfKY0UusCRBjE/s1600/Formula.bmp" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="32" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgZKZpgpOtZagZkMmFhizh1q8iSXPBHCeePFEsAv6RQno-kYKB3IgJ26NqZ0FwHDwjaQyEFNeqSwqRTl9onz2m2K_DGjRVWDGP0zlLj5l9lhhZEkp9ce_XBU1CZ1phQOOmfKY0UusCRBjE/s400/Formula.bmp" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">PDF for a document composed by two words.<br />
qa = 1-pa = pb; pa+pb = 1.</td></tr>
</tbody></table>
I don't want enter in theoretical details to proof that the above formula is probability function (after all this is a blog by practical means!).<br />
The only point I want to remark is that the convergence to a PDF is guaranteed for a length of document that tends to Infinity.<br />
The above formula can be further reduced and simplified (in the easy case of 2 words) if you consider the following conditions:<br />
<div style="text-align: center;">
<b>qa = 1-pa = pb => qb=pa (pa+pb = 1)</b><span class="Apple-style-span" style="font-size: 13px;">.</span></div>
Basically in the above case the PDF is function of 1 variable.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiRH3G0qBPjlhgcPGZIGkfH28W20a-g-WSvIEWAiBSQ4WZ1pzVIkTCChL6EnZAQJe-sGhjOQt2xfgZVmQID3wPrrzw52_Z1Eve7UMTzrynqWmeeuprcQ9nwpcXQEQmCPpJbPPGjSr_-Yt4/s1600/abstract.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="175" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiRH3G0qBPjlhgcPGZIGkfH28W20a-g-WSvIEWAiBSQ4WZ1pzVIkTCChL6EnZAQJe-sGhjOQt2xfgZVmQID3wPrrzw52_Z1Eve7UMTzrynqWmeeuprcQ9nwpcXQEQmCPpJbPPGjSr_-Yt4/s400/abstract.jpg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">PDF and CDF for a document of 2 words.</td></tr>
</tbody></table>
In the CDF, we sum the probability contribute of all documents composed by two words having length =1, 2, ..., Infinity.<br />
As you can see, the contribute of documents having length >50 is negligible for whatever value of pa.<br />
<br />
These kind of distributions based on geometric variables are often used in maintenance and reliability problems, so I encourage you to have some reading on that!<br />
<br />
<br />
<b>Next Steps</b><br />
In the next posts I'm going to explain<br />
<ul>
<li>some properties of the function</li>
<li>its extension to multiple variables </li>
<li>entropy</li>
<li>practical applications </li>
</ul>
<b>Last Consideration</b><br />
I did the above exercise just for fun, with out any pretension to have discovered something new and or original: if you have some paper/book/reference about identical or similar work, please don't hesitate to contact me! <br />
<br />
Don't rush, stay tuned<br />
cristian</div>
Cristian Mesianohttp://www.blogger.com/profile/04880057603671195464noreply@blogger.com2tag:blogger.com,1999:blog-7631116270195175228.post-19548337490549038632013-06-30T06:06:00.002-07:002013-06-30T06:06:55.830-07:00Image objects recognition: a basic approachTo complete the brief dissertation on the most popular problems of the image analysis science, I did a couple of experiments on a very intriguing topic: the identification of specific objects within an image.<br />
I didn't use "state of art" algorithms (...my knowledge on the field is not deep and I wont use techniques that I don't fully understand!).<br />
The target of this experiment isn't to show the best way, but just to show the rudiments of the matter.<br />
<br />
<b>Objective</b><br />
Given an image called <i>target image,</i> and given another image called <i>feature</i>, the objective is to identify the <i>features</i> within the <i>target image.</i><br />
<i><br /></i>
<b>Technique </b><br />
I decided to play with a technique that I studied on signal processing course at University: the matrix correlation.<br />
<br />
<div style="line-height: 19px; margin-bottom: 8pt; margin-left: 1pt; margin-right: 1pt; margin-top: 3pt;">
<span class="Apple-style-span" style="font-family: inherit;">The matrix correlation is typically used to detect similarities between two 2D signals, which are often saved in matrice<span class="Apple-style-span" style="font-family: inherit;">s. </span></span></div>
<div style="line-height: 19px; margin-bottom: 8pt; margin-left: 1pt; margin-right: 1pt; margin-top: 3pt;">
The magnitude of the correlation shows how similar the signals are. If the correlation is large, the two signals are considered to be very similar. Alternatively, if correlation is zero, the two signals are considered to be independent. Analytical definition and explanation are available in all books of signal processing.</div>
<div style="line-height: 19px; margin-bottom: 8pt; margin-left: 1pt; margin-right: 1pt; margin-top: 3pt;">
<b>Experiment: Ball detection</b></div>
<div style="line-height: 19px; margin-bottom: 8pt; margin-left: 1pt; margin-right: 1pt; margin-top: 3pt;">
In the first example I will try to detect the balls in the target image:</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEidF5fhNEC2MUiHvS7RBrfE2lTZt16Bst-5ScRHfg_YC8v3MdqOAEGawZ05v6YGCxoST3ytlAIkkCgAN7XFU2BkW6UX6m-IIjFwBS0FJEQ1eiJ94bDLJfCalDEIcU0vhlAjoh-YY_UivGc/s628/exp_balls.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="153" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEidF5fhNEC2MUiHvS7RBrfE2lTZt16Bst-5ScRHfg_YC8v3MdqOAEGawZ05v6YGCxoST3ytlAIkkCgAN7XFU2BkW6UX6m-IIjFwBS0FJEQ1eiJ94bDLJfCalDEIcU0vhlAjoh-YY_UivGc/s320/exp_balls.jpg" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">On the left the target image. On the right the feature (the ball!) to be identified in the image.</td></tr>
</tbody></table>
<div style="line-height: 19px; margin-bottom: 8pt; margin-left: 1pt; margin-right: 1pt; margin-top: 3pt;">
To solve the problem, I tried to apply the algorithm directly to the original images, but the results were terrible, so I applied some image manipulation technique in order to make more digestible the input to the main algorithm: I never be tired to say that preprocessing is crucial to solve a problem!</div>
<div style="line-height: 19px; margin-bottom: 8pt; margin-left: 1pt; margin-right: 1pt; margin-top: 3pt;">
The algorithms used here are: edge detection and binarization. So I feed the correlation algorithm with the following input:</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjPm51TPuVX2K64xeUOJ-IqzNGspffMzzg4HyE5e0Brtl75dPNjtCODp0ncOcH_u56p7EeLezgzNeyEeHKI_IeUeDvPhbfVuzjzhPFVyhJjSKxJHd5pk-3JVO8DeYAVceAT0WA4ZRUZXCk/s905/binarization.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="195" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjPm51TPuVX2K64xeUOJ-IqzNGspffMzzg4HyE5e0Brtl75dPNjtCODp0ncOcH_u56p7EeLezgzNeyEeHKI_IeUeDvPhbfVuzjzhPFVyhJjSKxJHd5pk-3JVO8DeYAVceAT0WA4ZRUZXCk/s320/binarization.jpg" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">The input used to feed the correlation algorithm obtained with Binarization and eEge Detection techniques.</td></tr>
</tbody></table>
<div style="line-height: 19px; margin-bottom: 8pt; margin-left: 1pt; margin-right: 1pt; margin-top: 3pt;">
<b>Results</b></div>
<div style="line-height: 19px; margin-bottom: 8pt; margin-left: 1pt; margin-right: 1pt; margin-top: 3pt;">
Finally we can show the output: we can check if the algorithm identified correctly in the target image the balls!</div>
<div style="line-height: 19px; margin-bottom: 8pt; margin-left: 1pt; margin-right: 1pt; margin-top: 3pt;">
To highlight the results I played around with colors and image merging :)</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj_M-Ex6VvwoSvjfXpflKwz_PxSjfbbPTybMuIGFuNOjV3iPxA62R2r3GfqI2bqqTvXc8SqQVwo8sXmFkJ2ia1YOzat0ZIQdvUhiF0yA0O7B8qSwmqAxwhN4fbOHAAVKFKoD-mUGJJS9I0/s1074/ballsRes.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="153" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj_M-Ex6VvwoSvjfXpflKwz_PxSjfbbPTybMuIGFuNOjV3iPxA62R2r3GfqI2bqqTvXc8SqQVwo8sXmFkJ2ia1YOzat0ZIQdvUhiF0yA0O7B8qSwmqAxwhN4fbOHAAVKFKoD-mUGJJS9I0/s400/ballsRes.jpg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Results: balls identified in the image </td></tr>
</tbody></table>
<div style="line-height: 19px; margin-bottom: 8pt; margin-left: 1pt; margin-right: 1pt; margin-top: 3pt;">
As you can see the results are not so bad, and playing around with a preprocessing parameters I'm sure it's possible to improve the accuracy.</div>
<div style="line-height: 19px; margin-bottom: 8pt; margin-left: 1pt; margin-right: 1pt; margin-top: 3pt;">
<b>Experiment: Painting Recognition</b></div>
<div style="margin-bottom: 8pt; margin-left: 1pt; margin-right: 1pt; margin-top: 3pt;">
</div>
<div style="text-align: left;">
I Applied the same technique to identify the paintings hung on a wall:</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjX9AUas8xWk6hOPVLIbvwL_Qz65G0qhPBNZIUlfFVJc2wK2iJoUi0H2PfW_VGLz_mzNuuLLDCtuHLpX4vVtPajOaljaDgTGvActY4ZPR6y_3R0lFh0Lcrf5ulhFOSqcblf2-J_spA0-G4/s1600/quadri2.gif" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="163" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjX9AUas8xWk6hOPVLIbvwL_Qz65G0qhPBNZIUlfFVJc2wK2iJoUi0H2PfW_VGLz_mzNuuLLDCtuHLpX4vVtPajOaljaDgTGvActY4ZPR6y_3R0lFh0Lcrf5ulhFOSqcblf2-J_spA0-G4/s320/quadri2.gif" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Target image: paintings hung on a wall</td></tr>
</tbody></table>
<b>Results:</b><br />
<div style="margin-bottom: 8pt; margin-left: 1pt; margin-right: 1pt; margin-top: 3pt;">
The techniques used are same of the above experiment, so let's go directly to the results:</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgH1IrZz7dY_xWsBTS4yJ9gdLCLfujWlmffM8YB0e6LrSfH6P4TFmjl4XVOQ73yqoAVvPz-t1VpMmzLir5K4pR3xif4oD5EY2BHmhwks_O8kIQSV1kBLkdfQ1eN9DqHyjwx-XKUlpy7TQE/s1600/res_paints2.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="203" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgH1IrZz7dY_xWsBTS4yJ9gdLCLfujWlmffM8YB0e6LrSfH6P4TFmjl4XVOQ73yqoAVvPz-t1VpMmzLir5K4pR3xif4oD5EY2BHmhwks_O8kIQSV1kBLkdfQ1eN9DqHyjwx-XKUlpy7TQE/s400/res_paints2.jpg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Paintings detected</td></tr>
</tbody></table>
<div style="margin-bottom: 8pt; margin-left: 1pt; margin-right: 1pt; margin-top: 3pt;">
<b>Considerations</b></div>
<div>
The technique is quite rough and the accuracy depends on </div>
<div>
<ul>
<li>Preprocessing technique applied </li>
<li>Measure used to compute the correlation (in both experiment I used cosine distance)</li>
</ul>
The algorithm per se seems to be not so general purpose to be used for such objective, but it's a good starting point to approach the problem. Maybe if the feature to be detected is always the same (for instance as in face recognition), the accuracy is acceptable.</div>
<div>
Stay Tuned</div>
<div>
Cristian</div>
Cristian Mesianohttp://www.blogger.com/profile/04880057603671195464noreply@blogger.com2tag:blogger.com,1999:blog-7631116270195175228.post-56902433978070678762013-05-15T13:39:00.000-07:002013-05-15T14:02:00.842-07:00Image comparison: SURF algorithm In the former post I showed a naif approach that can be easily implemented and that works quite well under the condition that the image is not subjected to distortions.<br />
I decided to deepen a bit the topic just for the sake of the curiosity ( curiosity and imagination are the skills that I most appreciate at work!).<br />
<br />
SURF stands for "<span class="Apple-style-span" style="line-height: 19px;"><span class="Apple-style-span" style="font-family: inherit;">Speeded Up Robust Features" and it has been published by<span class="Apple-style-span" style="font-family: inherit;"> </span></span></span><span class="Apple-style-span" style="font-family: inherit;"><span class="Apple-style-span" style="line-height: 19px;">Herbert Bay, Andreas Ess, Tinne Tuytelaars, Luc Van Gool in 2008. The original paper is available <a href="ftp://ftp.vision.ee.ethz.ch/publications/articles/eth_biwi_00517.pdf" target="_blank">here</a>.</span></span><br />
<br />
I strongly recommend to read the paper: it's self contained and it's very clear to be read also by neophyte as I.<br />
As usual I go straight to the experiments and to discuss the results.<br />
<br />
<b>Experiment</b><br />
I considered again the same data set of the last post, and I tried to compare all the possible combinations of pairs of images to check the accuracy of the algorithm.<br />
<br />
The first step of the algorithm is to determine the key features of an image.<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjw3cGjEzabUMrjjDbHRxUUyie89zM9bDt3Sh8e9taI2GyYmo0_0RA0NNwqs8YOfQ1rKDLl8X06cq0ELjtuIwfMHybtTTsE7KylUZba80w8dmIsLChpb1lnpDFV7B7BhucnvMY3-ba4DNw/s1600/keypoints.bmp" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="242" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjw3cGjEzabUMrjjDbHRxUUyie89zM9bDt3Sh8e9taI2GyYmo0_0RA0NNwqs8YOfQ1rKDLl8X06cq0ELjtuIwfMHybtTTsE7KylUZba80w8dmIsLChpb1lnpDFV7B7BhucnvMY3-ba4DNw/s320/keypoints.bmp" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">key features detection</td></tr>
</tbody></table>
Once the features have been detected it's possible to use them for the comparison through euclidean measures or similar.<br />
<div>
<br /></div>
<div>
In the below image I plotted the common points and their position between two images:</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjucM9XWe-V_mDRRLtcBE6GX2O4UCPTotpRhHyBmHobGISQhoDEAdCFgCFwpSB1wSySobSfKpc51JzY1woT6c1IhmqMgQgFl6hYR1VHBpjVes87Eq0r_cUYg9bosZ0cqMkBH66yf0EZlSg/s1600/SURF.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjucM9XWe-V_mDRRLtcBE6GX2O4UCPTotpRhHyBmHobGISQhoDEAdCFgCFwpSB1wSySobSfKpc51JzY1woT6c1IhmqMgQgFl6hYR1VHBpjVes87Eq0r_cUYg9bosZ0cqMkBH66yf0EZlSg/s320/SURF.jpg" width="226" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Common points found between two images. In yellow I depicted them and their position.</td></tr>
</tbody></table>
If you zoom a bit the image you will notice that the common points found are thousands. To find them the algorithm took around four minutes.<br />
<div>
In a productive environment this amount of time could be not acceptable. </div>
<div>
To improve the situation and reduce the computational time, I processed the image with binarization filters; after this preprocessing the time has been reduced from 4 minutes to 25 seconds.</div>
<div>
<br /></div>
<div>
<b>Results</b></div>
<div>
<b><br /></b></div>
<div>
In the below matrix I listed all the comparison done to check the accuracy.</div>
<div>
The correct pairs lay on the principal diagonal of the matrix.</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgGNXcZ42oEW_3DvuME40NbOQI2KTsyEJ6ICA1L5Ta2aZT53cZ7aLdZp63JMqUadDARfQtT0AGsoYIq2Z3zZbmK-ULYojMM0FI6E8_nkv8pWNW2hCzCv-VZf5MV3ekMeEqOUK5lSc5gNog/s1600/grid.bmp" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgGNXcZ42oEW_3DvuME40NbOQI2KTsyEJ6ICA1L5Ta2aZT53cZ7aLdZp63JMqUadDARfQtT0AGsoYIq2Z3zZbmK-ULYojMM0FI6E8_nkv8pWNW2hCzCv-VZf5MV3ekMeEqOUK5lSc5gNog/s320/grid.bmp" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">In red have been listed the correct pairs</td></tr>
</tbody></table>
<div>
To show graphically the results I plotted over the matrix the number of common points found for each pair. The colors chosen highlight that number. </div>
<div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiXF8f1D3tM4J0nMVU0dhlhNqWQjTBwHF9Mmf7QHAUuJVy-ZimBFnONqcJm6JCs3EGWNJxBGo1Ti9vczhl5VTPKI0igAjS_Qtvxl8KnpohQWuYljqsGa4l2AtfD221uLWFSTSsA1_-5hwE/s1600/histog3D_1.bmp" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiXF8f1D3tM4J0nMVU0dhlhNqWQjTBwHF9Mmf7QHAUuJVy-ZimBFnONqcJm6JCs3EGWNJxBGo1Ti9vczhl5VTPKI0igAjS_Qtvxl8KnpohQWuYljqsGa4l2AtfD221uLWFSTSsA1_-5hwE/s1600/histog3D_1.bmp" /></a></div>
<div>
<b><br /></b></div>
<div>
<b>Conclusions</b></div>
<div>
As you can see for all the pairs on the principal diagonal the algorithm detected the highest number of common points.</div>
<div>
<br /></div>
<div>
In the next post I will show how to use this algorithm to identify logos in a document.</div>
<div>
Stay tuned</div>
<div>
Cristian<br />
<div>
<br /></div>
</div>
</div>
Cristian Mesianohttp://www.blogger.com/profile/04880057603671195464noreply@blogger.com1tag:blogger.com,1999:blog-7631116270195175228.post-76760608944605680562013-03-30T08:29:00.002-07:002013-03-30T13:31:20.384-07:00Image comparison for dummies: an easy and fast approach<div class="separator" style="clear: both; text-align: left;">
Since I started to blog, I presented different methods, techniques and tricks to solve problems related to Data mining, and text mining (... I'm still puzzled about the difference between text mining and text analytics). </div>
Today I would like to present a different approach to solve a common problem in the text analytics area: the document comparison problem.<br />
<b><br /></b>
<b>The problem</b><br />
Usually if you want to compare two documents in order to check if they are pretty much the same you can apply some easy tricks in order to check frequency and position of the words.<br />
Sometime the documents require an OCR processing in order to get the "text rendition" for the documents, and such process usually is a very expensive in term of computational time (often it takes more than 10 seconds per page): what if you have to perform many comparison checks and you are short in time?<br />
...In that case I suggest to consider the documents as images!<br />
<br />
<b>Image processing approach</b><br />
This branch of the computer science evolved quite fast in the last decade and it achieves a level of complexity extremely high.<br />
I'm not an expert at all, and many libraries (often free as <a href="http://opencv.org/" target="_blank">opencv</a>) are available and they offer several functionalities: image comparison, image recognition and so on.<br />
Unfortunately I'm stubborn and curious and I like try to do experiments :).<br />
<br />
So I decided to implement a very easy algorithm able to compare with high accuracy documents very similar.<br />
<b><br /></b>
<b>Algorithm:</b><br />
1. Resize the image (in order to have images having the same number of pixels)<br />
2. Binarize the image<br />
3. compute the histogram distribution<br />
4. perform the comparison considering the Kullback distance between the two empirical distributions got from the histograms.<br />
<br />
Considerations:<br />
<br />
<ul>
<li> such algorithm is not robust to linear transformation/distortions of the image (but this is not case when you compare documents...).</li>
<li>it's very fast respect the solutions that require OCR processing.</li>
<li>it could be easily optimized considering sub-block of the images (and give higher weights to blocks in the middle of the image).</li>
<li>through some assumptions based on eigenvectors it could be helpful also in case of image linear transformations.</li>
</ul>
<b>Experiment</b><br />
<br />
I selected 10 random images (pictures, medical analysis, text, screenshot):<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh_Qzsy4WTIND1cZ1puTSPLP4Jc0TkEwWKfVuVdFukUlbWzF_T7SY7D6kCtR8Y6FsMefvfZSZWHRGAVhNU4w6EEuezLFNy7W5aU2cM6LKM6KN1m4YG0jJ88PCwQ_t6ZovjxKfQFX7uDfoY/s1600/OriginalSampleSet.bmp" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="155" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh_Qzsy4WTIND1cZ1puTSPLP4Jc0TkEwWKfVuVdFukUlbWzF_T7SY7D6kCtR8Y6FsMefvfZSZWHRGAVhNU4w6EEuezLFNy7W5aU2cM6LKM6KN1m4YG0jJ88PCwQ_t6ZovjxKfQFX7uDfoY/s400/OriginalSampleSet.bmp" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Original sample set </td></tr>
</tbody></table>
<div class="separator" style="clear: both; text-align: center;">
</div>
For each document, after some image adjust filter, I computed the histograms:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgrR_FlQcab4ZHRfzBadexOLOydES1LKimQvuTxl5ASnHkxbQ20Q8uEFoESOPS_PNMhdrsgIEPnU69-ruyrUT1uL3_s3JMUkf-Grw8oymWsvyuO7jcDlGw2xBvP_TbSsYZLjqSM_ZvlVwI/s1600/histogramsSampleSets.bmp" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="96" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgrR_FlQcab4ZHRfzBadexOLOydES1LKimQvuTxl5ASnHkxbQ20Q8uEFoESOPS_PNMhdrsgIEPnU69-ruyrUT1uL3_s3JMUkf-Grw8oymWsvyuO7jcDlGw2xBvP_TbSsYZLjqSM_ZvlVwI/s400/histogramsSampleSets.bmp" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Histograms for the above images</td></tr>
</tbody></table>
I introduced some noise in each of the documents and I computed again the histograms:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiKHNPewniuHIVNIxGpRLwMKJZyAE6Mxs0hTrXyx1oah9B6nu1uMAkRhjvjV_b9xUQUz1fSYVHiEnFMEdTZuq-qLj08KJH7bpcLVmVt1QdmvaCJSaUafeWCg_QCoTagreELv0WwMxmk1Mo/s1600/NoisedSampleSet.bmp" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="155" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiKHNPewniuHIVNIxGpRLwMKJZyAE6Mxs0hTrXyx1oah9B6nu1uMAkRhjvjV_b9xUQUz1fSYVHiEnFMEdTZuq-qLj08KJH7bpcLVmVt1QdmvaCJSaUafeWCg_QCoTagreELv0WwMxmk1Mo/s400/NoisedSampleSet.bmp" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Noised sample set.</td></tr>
</tbody></table>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhIqpRpzOk7ApQFZlDmYrNTefNRoqzNVe7_nzix3bKCQxxQ3HhwuNxuSq8j9DnEXpj6SleOW1dVgCWopszGgYNL8BQEv9LX4Ai4JkfPjn-giyElpbWfX11XIXHnTDJGRLu1YHBRJDduYD8/s1600/histogramsNoisedSets.bmp" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="96" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhIqpRpzOk7ApQFZlDmYrNTefNRoqzNVe7_nzix3bKCQxxQ3HhwuNxuSq8j9DnEXpj6SleOW1dVgCWopszGgYNL8BQEv9LX4Ai4JkfPjn-giyElpbWfX11XIXHnTDJGRLu1YHBRJDduYD8/s400/histogramsNoisedSets.bmp" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Histograms for the noised images</td></tr>
</tbody></table>
As you can see, despite the noise introduced is quite big, the characteristic histogram seems to preserve quite well the shape.<br />
<div>
<br /></div>
<div>
<b>Results:</b></div>
<div>
I compared each possible possible combination of the pairs composed by original image and noised image and I ranked the result of the comparison.</div>
<div>
...The test has been passed with an accuracy of 100%.</div>
<div>
I have to say that the images are quite heterogeneous among each other, but further tests (I'll publish them in one of the next posts) highlighted that the method is quite robust (with all the limitations explained above).</div>
<div>
Here is some of the best matching found:</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEijtqbD_5m82ZsV7v4a3AEVXbM0W3EE525LM-A66aYaw7FUaB1-d7FxT1NzqInpEqSJXqroTFZR5g_Jma_17FWhxNfzYmHg5cLwk5IgN7Wsyjm4hGBffwW8FgGLXdbZlzczhx6_0TYZzN4/s1600/res.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="366" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEijtqbD_5m82ZsV7v4a3AEVXbM0W3EE525LM-A66aYaw7FUaB1-d7FxT1NzqInpEqSJXqroTFZR5g_Jma_17FWhxNfzYmHg5cLwk5IgN7Wsyjm4hGBffwW8FgGLXdbZlzczhx6_0TYZzN4/s400/res.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Some of the best matching found</td></tr>
</tbody></table>
<div>
<br /></div>
<div>
...I have to say that the image processing is a very fascinating science!<br />
Stay tuned<br />
cristian<br />
<br /></div>
Cristian Mesianohttp://www.blogger.com/profile/04880057603671195464noreply@blogger.com6tag:blogger.com,1999:blog-7631116270195175228.post-71328875636834303012013-02-19T14:07:00.000-08:002013-02-19T14:16:37.526-08:00Mind maps of #textanalytics - automatic generationWhat I want to show you today is "snippet" of a project I'm working during my spare time.<br />
The project has the aim to gather, provide and represent the knowledge in a different way.<br />
For the time being I don't want to disclose too much info, but the main idea is to leverage my concept of the graph entropy to capture information in a document (or in a data set) to create a new kind of indexer.<br />
<br />
One of the output of the algorithm is the generation of the mind map.<br />
<br />
<b>Experiment</b><br />
Through the API provided by Twitter I downloaded the last 200 tweets from the hashtag #textanalytics.<br />
I processed the tweets and here is the results:<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjMunbg9TkHjzpAJP5N-2kd8VuDTt2vsm-H9ATjVPcfmMpyS_Dz0v2LmGpIfmIys_nS43oJ4U1DAe33v0FtqlEtcz7PyhnYWKjGZ-sZTEQrbq3nQgE5_aG6VzZ_Ie3p3zCU2-0rv2pbe6U/s1600/textanalytics.bmp" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjMunbg9TkHjzpAJP5N-2kd8VuDTt2vsm-H9ATjVPcfmMpyS_Dz0v2LmGpIfmIys_nS43oJ4U1DAe33v0FtqlEtcz7PyhnYWKjGZ-sZTEQrbq3nQgE5_aG6VzZ_Ie3p3zCU2-0rv2pbe6U/s400/textanalytics.bmp" width="395" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">#textanalytics mind map</td></tr>
</tbody></table>
<br />
In red have been depicted the first 10 relevant words extracted through the graph entropy criteria.<br />
It's interesting notice how the map shows intuitively the links among the words.<br />
Of course the map can be enriched with more links or more relevant words: the choice depends on the user choice.<br />
In the next posts I'll show you how the clustering techniques discussed in the former posts can be profitably used to create homogeneous group of words related each others.<br />
what do you think?<br />
Stay tuned<br />
cristianCristian Mesianohttp://www.blogger.com/profile/04880057603671195464noreply@blogger.com7tag:blogger.com,1999:blog-7631116270195175228.post-82587561742016106322013-01-05T05:58:00.001-08:002013-01-05T13:06:30.637-08:00Graph Clustering through "prominent vertexes": some clarificationI received many mails about the last post, and most of them asking explanations about the technique I developed.<br />
I have to clarify that:<br />
<br />
<ol>
<li>This method is just a consideration about the role played by some vertexes in a generic (directed) graph.</li>
<li>The results posted in the former graph can be a lot improved (I'm working on it among the several streams that I opened around the concept of entropy graph!).</li>
<li>the approach I'm going to explain in this post is not the same I used in the former because I prefer focus on the main idea respect the optimizations (but exential) aspects.</li>
</ol>
<div>
Let's consider the following graph:</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiTDC7RivvvWM9fq-_Uzm6b0LMaBxOUIIW4A-Yh4dch4KedMyHXzSE5WUs4RbcSWO7AJdiy5r-AebRd0af_Mvr2JkWrzmc9KDpY-1SY9GfcV0BddrQCyTw5xa6UOeU-IMmQr_gXN7n2ScI/s1600/gTest.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="116" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiTDC7RivvvWM9fq-_Uzm6b0LMaBxOUIIW4A-Yh4dch4KedMyHXzSE5WUs4RbcSWO7AJdiy5r-AebRd0af_Mvr2JkWrzmc9KDpY-1SY9GfcV0BddrQCyTw5xa6UOeU-IMmQr_gXN7n2ScI/s400/gTest.jpg" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
Let's consider two nodes: "<b>cbr</b>" and "<b>zx</b>", then let's take all the paths that connect the two nodes:</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg94CiV5wl0i0U-vu-8sjion9d0QZNuqLInnVjdcVNnjzlIthjr2JY3qoA1nQ5SKfhbEks4ledTDfytatFKLJtJYbn2uapOR0Py34zhdC2Ng2EIy68ORRtDlDygnO4Jv71c4IInof0bgBc/s1600/paths.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg94CiV5wl0i0U-vu-8sjion9d0QZNuqLInnVjdcVNnjzlIthjr2JY3qoA1nQ5SKfhbEks4ledTDfytatFKLJtJYbn2uapOR0Py34zhdC2Ng2EIy68ORRtDlDygnO4Jv71c4IInof0bgBc/s400/paths.jpg" width="335" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">All the possible paths between the node "<b>cbr</b>" and "<b>zx</b>"</td></tr>
</tbody></table>
<div class="separator" style="clear: both; text-align: left;">
We have now to consider the intersection (excluding the starting and ending points) among the all paths: </div>
<div class="separator" style="clear: both; text-align: left;">
<b>such nodes are essential in order to reach the node "zx" from "cbr"because excluding just one of them, you cannot reach the ending point! </b></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiGyh1PzpJNUtBr7ioH1ysr_JV9hjybqlmiM74e9fNvOlbWtWlpS4DZ2B8eke28aa2T2gX6HbRkWaHaXxGgspT37s0Eso9v6KpP94Cbvsb0U-kFiLZ-8fBVXX4mwgsha57Yeo-OfvjtwHc/s1600/prominent+nodes.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="116" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiGyh1PzpJNUtBr7ioH1ysr_JV9hjybqlmiM74e9fNvOlbWtWlpS4DZ2B8eke28aa2T2gX6HbRkWaHaXxGgspT37s0Eso9v6KpP94Cbvsb0U-kFiLZ-8fBVXX4mwgsha57Yeo-OfvjtwHc/s400/prominent+nodes.jpg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">The prominent nodes for the pair ("cbr","zx"): </td></tr>
</tbody></table>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
In the above procedure we found out the prominent nodes related to the pair "cbr", "zx".</div>
<div class="separator" style="clear: both; text-align: left;">
Of course in order to determine the prominent nodes of the entire graph you have to consider all the pairs of nodes: that's the reason that justify the complexity >Binomial [n,2] of the procedure (n = order of the graph). The complexity doesn't consider the complexity to obtain the paths between the nodes!</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
The final result can be obtained taking the intersection of all "prominent nodes" for each pairs of nodes!</div>
<div class="separator" style="clear: both; text-align: left;">
As you can imagine this is an expensive way to obtain the prominent nodes: I'll provide you an update about improvement on it.</div>
<div class="separator" style="clear: both; text-align: left;">
Stay tuned </div>
<div class="separator" style="clear: both; text-align: left;">
cristian</div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div>
<br /></div>
<br />
<br />Cristian Mesianohttp://www.blogger.com/profile/04880057603671195464noreply@blogger.com1tag:blogger.com,1999:blog-7631116270195175228.post-8070880853819980722012-12-18T15:23:00.001-08:002012-12-18T22:43:13.192-08:00Graph clustering: an approach based on "prominent vertexes"I was doing some experiments on the entropy graph, and I noticed that under some conditions it is a good marker of special vertexes.<br />
I'll try to explain the concept without formulas: that's really challenging because of my English :)<br />
These vertexes play a special role that I decided (arbitrarily) to call "Prominent Vertexes": basically if you remove them, you introduce some disconnections on the graph; the remaining connected components are the clusters that induce the graph partitioning. To be more precise the connected components return just the N-1 clusters, the last cluster is obtained by the complement operation between the original graph and the N-1 clusters.<br />
<br />
<b>The intuitive explanation</b><br />
<br />
<ul>
<li>Consider all the possible paths between two nodes <b>A </b>and <b>B </b>in a graph</li>
<li>Let's take for the above paths the intersection of them.</li>
<li>The result is a set of nodes that are essential to connect the two nodes: these nodes are the prominent nodes for the vertexes <b>A </b>and <b>B</b></li>
</ul>
<div>
The above schema can be applied (with some expedient) to all the Binomial[N,2] pairs of vertexes.</div>
<div>
<br /></div>
<div>
<b>Some result</b></div>
<div>
<b>1st example</b></div>
<div>
Let's consider the following graph and the respective "prominent vertexes":</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgvLNlgn7e37KovihR5Expqgq47pp7PpGeN7Y-LFfWfQrBuM7Y5dDkCZyCThWoJJPvnrmXQugWzZ6GQdTdm6iF_m8F8fncwOTAlduEOYH1xXB1uDxL17essnN_BRb0y1OOkBBBqsJrCv-I/s1600/influencingFactors_1.bmp" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="116" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgvLNlgn7e37KovihR5Expqgq47pp7PpGeN7Y-LFfWfQrBuM7Y5dDkCZyCThWoJJPvnrmXQugWzZ6GQdTdm6iF_m8F8fncwOTAlduEOYH1xXB1uDxL17essnN_BRb0y1OOkBBBqsJrCv-I/s400/influencingFactors_1.bmp" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">In yellow have been depicted the prominent nodes for the graph</td></tr>
</tbody></table>
The clusters obtained considering the connected components are the following:<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhvEEtbwNvT_brCZSAG3uCaZtllGdnu8hK1ZPFajTMfaYu8SNex2jpO-lc_xzvyZyeVEcr_6f9p-FaHl_WD-BDbY5Lg4sQAoxWnPWnF78sjXqO2zGck41ARJECIhRWKAuC44AwrSCi_2eU/s1600/clustering_1.bmp" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="116" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhvEEtbwNvT_brCZSAG3uCaZtllGdnu8hK1ZPFajTMfaYu8SNex2jpO-lc_xzvyZyeVEcr_6f9p-FaHl_WD-BDbY5Lg4sQAoxWnPWnF78sjXqO2zGck41ARJECIhRWKAuC44AwrSCi_2eU/s400/clustering_1.bmp" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">The 3 clusters obtained on the graph</td></tr>
</tbody></table>
<div>
As usual I compared my approach against another technique: <span class="Apple-style-span" style="font-family: Verdana; font-size: 12px;"> </span><span class="Apple-style-span" style="font-family: inherit;">A. Clauset, "Finding Local Community Structure in Networks," <i>Physical Review E</i>, <b>72</b>, 026132, 2005.</span></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiCd7Vaa-ZHnYE7pkEP9pzaTVd4yJZGU7tZJnGL37gIwjntDEkzl2BV3r37wqpbqFBgQatwB5bwmzKPJvo_lMVB8dOrDk_E4MydqAdVzRgkvKjl9YYG6uZaosNvHqVYlROeKzHx3c8MD8A/s1600/community_1.bmp" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="116" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiCd7Vaa-ZHnYE7pkEP9pzaTVd4yJZGU7tZJnGL37gIwjntDEkzl2BV3r37wqpbqFBgQatwB5bwmzKPJvo_lMVB8dOrDk_E4MydqAdVzRgkvKjl9YYG6uZaosNvHqVYlROeKzHx3c8MD8A/s400/community_1.bmp" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">The 6 clusters obtained through the "Clauset" method.</td></tr>
</tbody></table>
<div>
<span class="Apple-style-span" style="font-family: inherit;"><b>Comment:</b></span></div>
<div>
As you can see the proposed approach seems to be much less fragmented respect the Clauset method.</div>
<div>
<br /></div>
<div>
<b>2nd example</b></div>
<div>
Here is another example:<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhh3OdZdevPsHrQNjQE2gSQ03RJvCGE4z9HmD1BXpd85z3FMOIjor-yfEz-eMfDLZ68TQG0AdEbeQ54eJRqKGft3pEh69FGc2Hrlk9AefskaQkDr5cJqPkylQvIMcJxUMdrNNSCtzd2Uro/s1600/clustering_2.bmp" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhh3OdZdevPsHrQNjQE2gSQ03RJvCGE4z9HmD1BXpd85z3FMOIjor-yfEz-eMfDLZ68TQG0AdEbeQ54eJRqKGft3pEh69FGc2Hrlk9AefskaQkDr5cJqPkylQvIMcJxUMdrNNSCtzd2Uro/s400/clustering_2.bmp" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">The 2 clusters obtained with the prominent vertexes method,</td></tr>
</tbody></table>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh7KOIx_zDtUD4G_1NEf19EishmGI803ePaHcxLnHodpfangF-FHu9Gl8Xwxy0e_6eEQqOkcDEPrL177kLMQafRbw5CKabiCcS4TOHs7xMgCB_FJLSAnREM1RCeW8GBioJ_MUUmIgROdP0/s1600/community_2.bmp" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh7KOIx_zDtUD4G_1NEf19EishmGI803ePaHcxLnHodpfangF-FHu9Gl8Xwxy0e_6eEQqOkcDEPrL177kLMQafRbw5CKabiCcS4TOHs7xMgCB_FJLSAnREM1RCeW8GBioJ_MUUmIgROdP0/s400/community_2.bmp" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">The 6 clusters obtained through the Clauset method.</td></tr>
</tbody></table>
<div>
<b>Comment:</b></div>
<div>
As in the former example the "prominent vertexes" method seems to behave better (even if the result is not optimal).</div>
<div>
<br /></div>
<div>
<b>Considerations</b></div>
<div>
<b><br /></b></div>
<div>
<b>PRO</b></div>
<div>
<ul>
<li>The method proposed seems to be a promising graph clustering algorithm.</li>
<li>The method doesn't require parameters (e.g. number of clusters).</li>
<li>The method doesn't fragment to much the nodes. </li>
</ul>
<div>
<b>CONS</b></div>
</div>
<div>
<ul>
<li>The complexity is (for the current implementation) quite high ~Binomial[Graph Order, 2].</li>
<li>The method requires improvements and theoretical explanation.</li>
<li>The method has lower accuracy when the vertexes degree is very small.</li>
</ul>
<div>
The code will be released as soon as the formalization of the Entropy Graph is completed.</div>
<div>
Looking forward to receive your comments.</div>
<div>
<div>
Stay tuned & Merry Christmas!</div>
</div>
<div>
<br /></div>
<div>
c. </div>
</div>
</div>
Cristian Mesianohttp://www.blogger.com/profile/04880057603671195464noreply@blogger.com3tag:blogger.com,1999:blog-7631116270195175228.post-57490305619775557042012-11-14T15:10:00.002-08:002012-11-14T15:10:47.931-08:00Document Clustering and Graph Clustering: graph entropy as linkage function<div class="separator" style="clear: both; text-align: center;">
</div>
Let's continue our discussion about the applications of the graph entropy concept.<br />
Today I'm going to show how we can re-use the same concept on the document clustering.<br />
What I want to highlight is that through such methodology it's possible to:<br />
<br />
<ol>
<li>extract from a document the relevant words (as discussed <a href="http://feedproxy.google.com/~r/TextDataMiningByPracticalMeans/~3/inxOX5Z6Vow/graph-entropy-to-extract-relevant-words.html" target="_blank">here</a>);</li>
<li>clustering of the words of a document (as discussed <a href="http://feedproxy.google.com/~r/TextDataMiningByPracticalMeans/~3/_FdaTn6LYcs/key-words-through-graph-entropy.html" target="_blank">here</a>);</li>
<li>clustering set of documents;</li>
<li>clustering a graph and assign a ranking score to each cluster by homogeneity criteria;</li>
</ol>
<div>
<b>The Experiment</b></div>
<div>
For this experiment I chose a subset of the standard dataset called "<a href="http://qwone.com/~jason/20Newsgroups/" target="_blank">the 20 Newsgroup dataset</a>".</div>
<div>
The documents of this data set are email selected from different newsgroups.</div>
<div>
I selected some documents from the categories: Hockey, Motorcycle, Atheist and Electronics.</div>
<div>
You can download <a href="http://www.ziddu.com/download/20881027/tr2.zip.html" target="_blank">here</a> the documents used for this experiment.</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div>
<br /></div>
<div>
<b>The procedure </b></div>
<div>
<ul>
<li>For each document build the respective graph (filtering by stopwords list, and stemming the words). </li>
<li>For each graph calculate the entropy graph value of each vertex (word).</li>
<li>For each graph extract the first <b>k </b>relevant vertex (sorting by entropy graph value)</li>
<li>Perform hierarchical clustering (in this case I used once again an approach based on Simulated annealing).</li>
</ul>
<b>The results</b></div>
<div>
Here you are the early results obtained through the clustering.</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhc_V0Pi9Ir38PqLnss-g3Xl8hea38yOIq27uDEhB223bFlfJ99zwYsPr6Vdj1Swis-WPFq6V3kvx9ixcryNIVs6-JOrnrnlD_8FZ-xppeClNd57QMHGbeDcBB3Du324xdPzd4Jyeir-rY/s1600/clusters2.bmp" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="115" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhc_V0Pi9Ir38PqLnss-g3Xl8hea38yOIq27uDEhB223bFlfJ99zwYsPr6Vdj1Swis-WPFq6V3kvx9ixcryNIVs6-JOrnrnlD_8FZ-xppeClNd57QMHGbeDcBB3Du324xdPzd4Jyeir-rY/s400/clusters2.bmp" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Each line depicts a cluster (the elements are the filename). The clusters have been sorted by homogeneity criteria assigned by the clustering algorithm. </td></tr>
</tbody></table>
<b>Considerations</b><br />
<div>
The results are promising but not perfect. The accuracy achieved is</div>
<div>
<ul>
<li> <b>75.0% </b>for first cluster </li>
<li><b>62.5% </b>for second cluster</li>
<li><b>50.0 % </b>for the fifth cluster</li>
<li><b>37.5</b> % for the third and fourth cluster</li>
</ul>
</div>
<div>
The reasons of the bad aggregations are related to the class "atheist". Such class contains documents characterized by high variety: documents very short, text not strictly pertinent to the topic, topic quite general.</div>
<div>
Moreover the clustering algorithm hasn't been optimized/customize for such problem.</div>
<div>
Let's try to analyze the results under a different perspective:</div>
<div>
<b><br /></b></div>
<div>
<b>Something more...</b></div>
<div>
I mentioned before that such approach works over the concept of entropy graph.</div>
<div>
Here you are the graph representation of the words related to each cluster.</div>
<div>
For each cluster I selected the first relevant 150 words.</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjI1w12zrKWmNzKXPMyjeEsiCqAV_65meukRg-FgVhGUX1Sn7avaZr8tM5NFz5Wy66nmtXobqtCnF9pN5ITPtpxRPRw7mhTCVN4R3OxgnJ6OiqoJK8Nva41EiBvYJq6j6ehloINCkeVzes/s1600/cl1.bmp" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="314" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjI1w12zrKWmNzKXPMyjeEsiCqAV_65meukRg-FgVhGUX1Sn7avaZr8tM5NFz5Wy66nmtXobqtCnF9pN5ITPtpxRPRw7mhTCVN4R3OxgnJ6OiqoJK8Nva41EiBvYJq6j6ehloINCkeVzes/s320/cl1.bmp" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">In Red has been highlighted the first cluster</td></tr>
</tbody></table>
<br /><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjBVvTJOISzZ-uvZuOB50HDkXBOHnOt-rgnmyDxZWy2Us6p0h6dbcbnflIc0e9gjO2rUspxEaNYJcgVzxFmVxjYhoZmToPzIV2663GzgnlSWUiuzxJfCXkl8O2nbM84H1uBkyS1zgVhjpU/s1600/cl2.bmp" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="314" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjBVvTJOISzZ-uvZuOB50HDkXBOHnOt-rgnmyDxZWy2Us6p0h6dbcbnflIc0e9gjO2rUspxEaNYJcgVzxFmVxjYhoZmToPzIV2663GzgnlSWUiuzxJfCXkl8O2nbM84H1uBkyS1zgVhjpU/s320/cl2.bmp" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">In dark Blue Has been Highlighted the second cluster</td></tr>
</tbody></table>
<div>
<br /></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg_rDJgGQliMa18OQXJTJi_8GWyQwfylRg5UBFO9khwTg8tQz20H0psiLh2xjljmccVLW3z1ENzrZzT995TSWtgx0yZPZhmTnA9MrnwR4eBcQB3IoZani_shIezAHqUZEsR_vndhZ-8sDA/s1600/cl3.bmp" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="313" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg_rDJgGQliMa18OQXJTJi_8GWyQwfylRg5UBFO9khwTg8tQz20H0psiLh2xjljmccVLW3z1ENzrZzT995TSWtgx0yZPZhmTnA9MrnwR4eBcQB3IoZani_shIezAHqUZEsR_vndhZ-8sDA/s320/cl3.bmp" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">In Yellow has been depicted the third cluster</td></tr>
</tbody></table>
<br /><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhMvutQtm10ceuduVs891ES42OX1uB3Gm27DSaKtiocAHGbRc3kC7giqMpBhIKVsxj_3SrPpxBOsdZpTDro0bAlxNbDsYL9nSPoVz4hdVYIEtoFW9_ETvpzK3O1hIzrYddt96ydU8r_R6g/s1600/cl4.bmp" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="313" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhMvutQtm10ceuduVs891ES42OX1uB3Gm27DSaKtiocAHGbRc3kC7giqMpBhIKVsxj_3SrPpxBOsdZpTDro0bAlxNbDsYL9nSPoVz4hdVYIEtoFW9_ETvpzK3O1hIzrYddt96ydU8r_R6g/s320/cl4.bmp" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">In Green has been depicted the fourth cluster.</td></tr>
</tbody></table>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiHKsVLmzchvEfQu9uNwTCfcDKIEHTJXcT1bqQ2-uqeCHjuFuHaouYPRgbu1BOLDDiIv2NhfzU38ug7uWJQXPEhPM1c6Yh0PXGTWue18WGgzwfV9MhRtchY9rzpLtzbhCMinl6oYWQHXz4/s1600/cl5.bmp" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="314" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiHKsVLmzchvEfQu9uNwTCfcDKIEHTJXcT1bqQ2-uqeCHjuFuHaouYPRgbu1BOLDDiIv2NhfzU38ug7uWJQXPEhPM1c6Yh0PXGTWue18WGgzwfV9MhRtchY9rzpLtzbhCMinl6oYWQHXz4/s320/cl5.bmp" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">In Brown has been depicted the fifth cluster.</td></tr>
</tbody></table>
<div>
<div>
<div>
<b>Notes</b></div>
<div>
<b>Such representation of the clustering highlights that the linkage function used in this experiment depicts quite well the different areas of the graph.</b></div>
<div>
<ul>
<li>The Red cluster grouped together the right block of vertexes.</li>
<li>The Brown cluster grouped the right side of the graph</li>
<li>The Blue cluster focused on the core of the graph</li>
<li>The yellow cluster grouped together the corona of the core of the graph</li>
<li>The green cluster grouped together the corona of the peripheral of the graph.</li>
</ul>
<div>
<b>The next steps</b></div>
</div>
</div>
</div>
<div>
In the next post I'll try to refine the results, optimizing the clustering algorithm, and trying to use different kind of clustering algos.</div>
<div>
Stay Tuned</div>
<div>
cristian</div>
Cristian Mesianohttp://www.blogger.com/profile/04880057603671195464noreply@blogger.com4tag:blogger.com,1999:blog-7631116270195175228.post-2006952645078396562012-10-23T14:49:00.000-07:002012-10-24T10:28:50.870-07:00Key words through graph entropy Hierarchical clusteringIn the last post I showed how to extract key words from a text through a principle called graph entropy.<br />
Today I'm going to show another application of the graph entropy in order to extract clusters of key words.<br />
<b>Why</b><br />
The key words of a document depict the main topic of the content, but if the document is big, often, there are many different sub topics related to the main.<br />
In this perspective, a clusters of keywords should make easier for the reader the identification of the key points of a document.<br />
Moreover, imagine to implement a search engine based on clusters of relevant words instead of the common indexing of atomic words: it enables documents comparison, taxonomies definition, and much more!<br />
<b>How</b><br />
The definition of graph entropy I'm studying on, assigns to each word of the document a relevance score and a sub graph of words topologically closed to it.<br />
The clustering should maximize the relevance score obtained merging two words in the same cluster.<br />
It's easy to understand that we have to face a combinatoric maximization problem.<br />
The idea is to take advantage of the Simulated annealing (a bit revisited and adapted to the scope) in order to identify sub-optimal merging solution at each step of the merging phase of the hierarchical clustering.<br />
<b>Experiment</b><br />
I decided to adopt as document test the complete version of the file we used in the last post: <a href="http://en.wikipedia.org/wiki/Nuclear_weapon" target="_blank">Nuclear_weapon</a>.<br />
Here you are the clusters of first 100 relevant words extracted:<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjKuP6hJi49LGuiv2h1NvsMH4Tmi1UNywXy6DLyq39GmljubQrW7yMRoWDYgJYEuWhvInoEXc6BTIuJYyRlNNfuJyOvdi2vZ2X72SiTxnWylsv1dARLwiwdd0lpLnUt8uv6uFykwN-QhR4/s1600/clusters.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjKuP6hJi49LGuiv2h1NvsMH4Tmi1UNywXy6DLyq39GmljubQrW7yMRoWDYgJYEuWhvInoEXc6BTIuJYyRlNNfuJyOvdi2vZ2X72SiTxnWylsv1dARLwiwdd0lpLnUt8uv6uFykwN-QhR4/s400/clusters.jpg" width="233" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">The three clusters obtained.</td></tr>
</tbody></table>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
It's interesting highlight the following considerations:<br />
<br />
<ul>
<li>The first cluster merged together words as "<b>material,uranium, plutonium, isotope</b>" and "<b>war, attack, arm</b>", and also "<b>proliferation, movement, control, development</b>".</li>
<li>The second cluster (which has the lowest rank) aggregates words as "<b>japan, japanese, place, israel, iraq,american</b>", and "<b>ton, tnt, yeld</b>" </li>
<li>The third cluster (which has the highest rank) describes quite well the primary topic, merging all the most important words of the document! </li>
</ul>
<br />
Of course, the procedure is still in "incubator" phase, and the accuracy of the clusters rests on the performance of the Annealing clustering (...maybe different algorithms in this context perform better... but just to show a rough solution I guess it's enough :D)<br />
This is the optimization process for the last merging stage (I presume that temperature schedule requires an adjustment):<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhlttbjYmDLAABOFfR4UUrVaq5HYG-afHLLXdOf3xjnuFxFEUP04bIwJnEwhYLHRxwqYCccQhKFzcoXMTNDIIFTol7lZaLjBIX_hJYZDLlOB1jrMM-ZvhlczE3B4GzQ2HdTMDrkjuXsfLA/s1600/HSAC.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="197" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhlttbjYmDLAABOFfR4UUrVaq5HYG-afHLLXdOf3xjnuFxFEUP04bIwJnEwhYLHRxwqYCccQhKFzcoXMTNDIIFTol7lZaLjBIX_hJYZDLlOB1jrMM-ZvhlczE3B4GzQ2HdTMDrkjuXsfLA/s320/HSAC.jpg" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Optimization curve through Simulated Annealing Hierarchical Clustering (last merging stage)</td></tr>
</tbody></table>
<b><br /></b>
<b>Next steps:</b><br />
Looking forward to receive comments, and suggestions.<br />
...It would be interesting using such methodology to create a new kind of full text search engine, totally independent by frequency of the words and frequency of visits.<br />
<br />
<br />
<div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">
<b>The doc</b></div>
<div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">
here you are the document parsed and colored through the clustering assignment (have been highlighted just the first 100 relevant features ranked through the Graph Entropy method).<br />
<br />
<span class="Apple-style-span" style="-webkit-text-decorations-in-effect: underline; color: #006a79; font-family: Verdana; font-size: 12px; font-weight: 800;"><a href="http://www.ziddu.com/download/20701582/nuclearWeapons.pdf.html" target="_blank">http://www.ziddu.com/download/20701582/nuclearWeapons.pdf.html</a></span></div>
<div>
<div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">
Stay tuned</div>
<div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">
cristian.<br />
<br />
<img src="webkit-fake-url://FC72FDF0-7156-4CD7-BC1B-3C5B91AF5858/application.pdf" /></div>
</div>
Cristian Mesianohttp://www.blogger.com/profile/04880057603671195464noreply@blogger.com1tag:blogger.com,1999:blog-7631116270195175228.post-84409126064949759322012-09-24T13:26:00.002-07:002012-09-24T13:39:37.733-07:00Graph Entropy to extract relevant wordsI would share with you some early results about a research I'm doing in the field of "graph entropy" applied to text mining problem.<br />
<br />
There are many definitions of graph entropy, my favorite is very well described in the work of<span class="Apple-style-span" style="font-family: inherit;"> </span>J.Körner: Coding of an information source having ambiguous alphabet and the entropy of graphs (1973).<br />
<span class="Apple-style-span" style="font-family: inherit;"><br /></span>
<b>Why Graph Entropy is so important?</b><br />
Based on the main concept of entropy the following assumptions are true:<br />
<br />
<ul>
<li>The entropy of a graph should be a functional of the stability of the structure (so that it depicts in some way the distribution of the edges of the graph).</li>
<li>Sub sets of vertexes quite isolated from the rest of the graph are characterized by a high stability (low entropy).</li>
<li>It's quite easy use the entropy as a measure for graph clustering.</li>
</ul>
<div>
As you can imagine a smart definition of graph entropy can be helpful in many problems related to text mining.</div>
<div>
<br /></div>
<div>
Let's see an application of graph entropy to extract relevant words in a document.</div>
<div>
<br /></div>
<div>
The experiment as been done using the first section of the definition of "<a href="http://en.wikipedia.org/wiki/Nuclear_weapon" target="_blank">nuclear weapons</a>".</div>
<div>
<br /></div>
<div>
<b>Results</b></div>
<div>
<b>Graph Entropy:</b></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi9LppZJt8wV-LXteRj0y5QXOyidxGFDdWK5QhqRI7hM74mN2HtZhhniVvRhnLnhPtz2tG4TkHCtArN70qMd3kChBagiehZnGm6r0EI4V_OAX1kLx0pAddcSHebRgtiITLGuHsjbpBHBcE/s1600/grEntr.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="238" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi9LppZJt8wV-LXteRj0y5QXOyidxGFDdWK5QhqRI7hM74mN2HtZhhniVvRhnLnhPtz2tG4TkHCtArN70qMd3kChBagiehZnGm6r0EI4V_OAX1kLx0pAddcSHebRgtiITLGuHsjbpBHBcE/s400/grEntr.jpg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">In red have been highlighted the relevant words extracted through Graph Entropy</td></tr>
</tbody></table>
<div>
Here you are the words extracted (first 25th) - In red I depicted words that in my opinion shouldn't be selected:</div>
<div>
weapon, reaction, nuclear release, <span class="Apple-style-span" style="color: red;">consider, acknowledge</span>, explosive, weaponsa, detonate, test, ton, bomb, energy, tnt, first, <span class="Apple-style-span" style="color: red;">possess</span>, small, device, <span class="Apple-style-span" style="color: red;">unite</span>, hiroshima, <span class="Apple-style-span" style="color: red;">chronologically</span>, thermonuclear, force, nagasaki</div>
<div>
<b><br /></b></div>
<div>
<b>Frequency based:</b></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiMvSffLIWow_IrF8yFrkVuNX9jLSS12Vcy-DQ-gPZJ0tbesA5TOjdY_0Uv-NVX7YDGsjZhEUsw-Xjg75GgQofjOtFTzgvC_2GwvhnBuimZGMWzh5oH7y7oRMNYiKvgYYc6nHjr4DS5PLE/s1600/freq.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="237" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiMvSffLIWow_IrF8yFrkVuNX9jLSS12Vcy-DQ-gPZJ0tbesA5TOjdY_0Uv-NVX7YDGsjZhEUsw-Xjg75GgQofjOtFTzgvC_2GwvhnBuimZGMWzh5oH7y7oRMNYiKvgYYc6nHjr4DS5PLE/s400/freq.jpg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="font-size: medium; margin-bottom: 0.5em; margin-left: auto; margin-right: auto; padding-bottom: 6px; padding-left: 6px; padding-right: 6px; padding-top: 6px; text-align: center;"><tbody>
<tr><td class="tr-caption" style="font-size: 13px; padding-top: 4px; text-align: center;">In red have been highlighted the relevant words extracted through Frequency relevance.</td></tr>
</tbody></table>
</td></tr>
</tbody></table>
<div>
Words extracted through the frequency relevance: </div>
<div>
nuclear, weapon, bomb, fission, test, <span class="Apple-style-span" style="color: red;">possess</span>, bombing, detonate, <span class="Apple-style-span" style="color: red;">state</span>, <span class="Apple-style-span" style="color: red;">unite</span>, tnt, ton, first, energy, release, <span class="Apple-style-span" style="color: red;">acknowledge</span>, weaponsa, <span class="Apple-style-span" style="color: red;">status</span>, japan, nagasaki, hiroshima, <span class="Apple-style-span" style="color: red;">japanese, name, code, type</span></div>
<div>
<span class="Apple-style-span" style="color: red;"><br /></span></div>
<div>
<b>Closeness Centrality:</b></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi18RWsZ6LTYDF-h0iEfRBjp9_3PACBi2WXoPDSQqYmSbms-ipWI7tY12zrLtmYjaEgM8SSYlokEjXm1PZ-ttbYg2YqPKYhz9PexjUZ2yGu_hFk6Tgr5zrdukWkuyoxhyphenhyphenon64_jbOUPnUQ/s1600/closenessCentr.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="238" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi18RWsZ6LTYDF-h0iEfRBjp9_3PACBi2WXoPDSQqYmSbms-ipWI7tY12zrLtmYjaEgM8SSYlokEjXm1PZ-ttbYg2YqPKYhz9PexjUZ2yGu_hFk6Tgr5zrdukWkuyoxhyphenhyphenon64_jbOUPnUQ/s400/closenessCentr.jpg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">In red have been highlighted the relevant words extracted through Closeness Centrality method.</td></tr>
</tbody></table>
<div>
<b></b><br />
<div style="font-weight: normal;">
<b>Words extracted through the frequency relevance: </b></div>
<b>
</b>nuclear, weapon, detonate, <span class="Apple-style-span" style="color: red;">possess</span>, nagasaki, first, thermonuclear, small, force, <span class="Apple-style-span" style="color: red;">estimate</span>, <span class="Apple-style-span" style="color: red;">debut, fabricate, succeed</span>, radiation, tnt, <span class="Apple-style-span" style="color: red;">acknowledge, consider, believe</span>, hiroshima, <span class="Apple-style-span" style="color: red;">know, nation</span>, boy, explode, matte, <span class="Apple-style-span" style="color: red;">date.</span></div>
<div>
<b><br /></b></div>
<div>
<b>Considerations:</b></div>
<div>
<b><br /></b></div>
<div>
<b>
</b>
<li style="font-weight: normal;"><b>The method based on graph entropy seems provide the more accurate results (5 errors respect 9 and 11 of the other methods).</b></li>
<b>
<li style="font-weight: normal;">The graph entropy depicts better the core of the graph containing the relevant words.</li>
<li style="font-weight: normal;">I tried to expand the number of relevant features and the accuracy of the other two methods tends to worsen quickly:</li>
</b></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiQ1VMQoR_2DB_GeZn7_YJSW0GFO5C5YmW3bXWdziel8GcJag-QxPIP4-nvn2R9ZFrfvs-PfA7iinVDGhCFSJGMOq-BdBIlrLsrpKs286dD1QX96TSGtmzmX_kE6fsMClV19JKTXXLXacA/s1600/extracted.bmp" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="136" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiQ1VMQoR_2DB_GeZn7_YJSW0GFO5C5YmW3bXWdziel8GcJag-QxPIP4-nvn2R9ZFrfvs-PfA7iinVDGhCFSJGMOq-BdBIlrLsrpKs286dD1QX96TSGtmzmX_kE6fsMClV19JKTXXLXacA/s400/extracted.bmp" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><br /></td></tr>
</tbody></table>
<div class="separator" style="clear: both; text-align: center;">
<span class="Apple-style-span" style="font-size: x-small;">First 40 relevant words using Graph Entropy, Frequency method and Closeness Centrality.</span></div>
<div class="separator" style="clear: both; text-align: center;">
<span class="Apple-style-span" style="font-size: x-small;">Notice how the graph Entropy preserves better the core of the graph respect the other two methods.</span></div>
<div>
<div>
<br /></div>
<div>
Stay tuned<br />
cristian</div>
<br /></div>
Cristian Mesianohttp://www.blogger.com/profile/04880057603671195464noreply@blogger.com2tag:blogger.com,1999:blog-7631116270195175228.post-82492852385792613762012-08-12T01:58:00.000-07:002012-08-12T01:58:04.382-07:00Function minimization: Simulated Annealing led by variance criteria vs Nelder Mead<div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"></div><b><span class="Apple-style-span" style="border-collapse: collapse; color: #222222; font-weight: normal;"><span lang="EN"><span style="font-family: inherit; font-size: small;">Most of the datamining problems can be reduced as a minimization/maximization problem.</span></span></span></b><br />
Whenever you are looking for the best trade off between costs and benefits you are solving a minimization problem.<br />
<div class="MsoNormal" style="margin-bottom: 10pt; margin-left: 0cm; margin-right: 0cm; margin-top: 0cm; text-align: justify;"><span class="Apple-style-span" style="border-collapse: collapse; color: #222222; font-weight: normal;"><span lang="EN"><span style="font-family: inherit; font-size: small;">Often the number of the variables that affects the cost function is high and the domain of these variables is in a dense set and, last but not least: in the real world problem often, you have no clue about the analytical form of the function.</span></span></span><br />
<span class="Apple-style-span" style="border-collapse: collapse; color: #222222;">Formally such problems belong to the multidimensional unconstrained optimization family.</span></div><div class="MsoNormal" style="margin-bottom: 10pt; margin-left: 0cm; margin-right: 0cm; margin-top: 0cm; text-align: justify;"><span class="Apple-style-span" style="border-collapse: collapse; color: #222222; font-weight: normal;"><b><span lang="EN"><span style="font-family: inherit; font-size: small;">Examples</span></span></b></span></div><div class="MsoNormal" style="margin-bottom: 10pt; margin-left: 0cm; margin-right: 0cm; margin-top: 0cm; text-align: justify;"><span class="Apple-style-span" style="border-collapse: collapse; color: #222222; font-weight: normal;"><span lang="EN"><span style="font-family: inherit; font-size: small;">Let's consider easy scenarios where the function cost is conditionated just by two parameters.<span class="Apple-style-span" style="font-size: x-small;"><b></b></span></span></span></span></div><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi83hQhVAi8v-d7ATj3RsNortIgaXY2Rcfs4A0eW-fkwE2F3LHrlgwvKXeGFe0T9kc_WrXC9rTJzwKk1VNmNGUVkXvqpyiEHBPGEk11PVHfDVwYq5AwGR8VvcZQeJn1bS7KBGcwaeO64cs/s1600/f1.bmp" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="208" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi83hQhVAi8v-d7ATj3RsNortIgaXY2Rcfs4A0eW-fkwE2F3LHrlgwvKXeGFe0T9kc_WrXC9rTJzwKk1VNmNGUVkXvqpyiEHBPGEk11PVHfDVwYq5AwGR8VvcZQeJn1bS7KBGcwaeO64cs/s400/f1.bmp" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Schaffer f6 function: in the contour plot dark purple to depict local minimums.</td></tr>
</tbody></table><br />
<b><span class="Apple-style-span" style="border-collapse: collapse; color: #222222; font-weight: normal;"><br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgiKB7PlnS7NJO3Wwfu7T4qSXBZhLTakIu5ugExRPrq07N_hwmhYeDz6fEh2p-9lAowYNSok-dduqQoUO-ObO1Nm7yDyBlGkPqp-JP2fVLTjQuER2UhGgAL4Ry7yJbEonCP-xfT1e2zhdA/s1600/f2.bmp" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="208" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgiKB7PlnS7NJO3Wwfu7T4qSXBZhLTakIu5ugExRPrq07N_hwmhYeDz6fEh2p-9lAowYNSok-dduqQoUO-ObO1Nm7yDyBlGkPqp-JP2fVLTjQuER2UhGgAL4Ry7yJbEonCP-xfT1e2zhdA/s400/f2.bmp" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Another function where minimum ischaracterized by two valleys </td></tr>
</tbody></table></span></b><br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgEyPvRiyzEPhJQH_HP6KUFDkpW209G6xLXd_wDJIB7IaTNmrdwvabOR5BLELShKANsesnGC7IQZBcoieD0BctR-wbLfFU_VRwc-k9wTIxJUID1MNrE-0i4rXv1ehpKjt-n6dz-v3BSKzU/s1600/f3.bmp" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="208" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgEyPvRiyzEPhJQH_HP6KUFDkpW209G6xLXd_wDJIB7IaTNmrdwvabOR5BLELShKANsesnGC7IQZBcoieD0BctR-wbLfFU_VRwc-k9wTIxJUID1MNrE-0i4rXv1ehpKjt-n6dz-v3BSKzU/s400/f3.bmp" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">This function presents several local minimums well depicted in the contour plot</td></tr>
</tbody></table><br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiG6SpZmfpBmmiEQ2tUGldcwh6rvVK6sWhTpCPt9nJU8dQJpnB9QYol5HZpZJa3YjPzZWF_iOvuHjXIu0XyuYS94XyjpTeyNhWF00ecf5fiB2M7Zb6iMtX1dknlGo9JlOpn5-Sg_fLk8oE/s1600/f5.bmp" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="208" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiG6SpZmfpBmmiEQ2tUGldcwh6rvVK6sWhTpCPt9nJU8dQJpnB9QYol5HZpZJa3YjPzZWF_iOvuHjXIu0XyuYS94XyjpTeyNhWF00ecf5fiB2M7Zb6iMtX1dknlGo9JlOpn5-Sg_fLk8oE/s400/f5.bmp" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">This function is characterized by a several local minimums having high range of values </td></tr>
</tbody></table><span class="Apple-style-span" style="border-collapse: collapse; color: #222222;"></span><br />
<div class="MsoNormal" style="font-size: 13px; margin-bottom: 10pt; margin-left: 0cm; margin-right: 0cm; margin-top: 0cm; text-align: justify;"><span lang="EN"><span style="font-size: small;"><span style="font-family: inherit;">As you can see the above problems have in common a massive presence of local minimum :).</span></span></span></div><div class="MsoNormal" style="font-size: 13px; margin-bottom: 10pt; margin-left: 0cm; margin-right: 0cm; margin-top: 0cm; text-align: justify;"><span lang="EN"><span style="font-family: inherit; font-size: small;">Let's see how to handle these problems through an approach that I define hybrid, that is, obtained mixing different methods and properties.</span></span></div><div class="MsoNormal" style="font-size: 13px; margin-bottom: 10pt; margin-left: 0cm; margin-right: 0cm; margin-top: 0cm; text-align: justify;"><b><span lang="EN"><span style="font-family: inherit; font-size: small;">Simulated Annealing (SA) "variance based" approach</span></span></b></div><div class="MsoNormal" style="font-size: 13px; margin-bottom: 10pt; margin-left: 0cm; margin-right: 0cm; margin-top: 0cm; text-align: justify;"><span style="font-size: small;"><span style="font-family: inherit;"><i><span lang="EN">Disclaimer: </span></i><span lang="EN">I had no the time to check whether this method has been already published somewhere. The underlying idea is quite simple, so I would assume that someone has already spent time in proofing convergence and better variations (and actually I don't think is a rocket science proof it).</span></span></span></div><div class="MsoNormal" style="margin-bottom: 10pt; margin-left: 0cm; margin-right: 0cm; margin-top: 0cm; text-align: justify;">SA, belongs to the family of numerical method based on "search strategy" and its convergence requirements are related to the stationary conditions inducted by the underlying markovian process.</div><div class="MsoNormal" style="margin-bottom: 10pt; margin-left: 0cm; margin-right: 0cm; margin-top: 0cm; text-align: justify;">In a former post I showed an its application in a discrete domain, where at each iteration we chose the next candidate solution by comparison with the current solution. The new solution was found "fishing randomly" a candidate in a discrete space.</div><div class="MsoNormal" style="margin-bottom: 10pt; margin-left: 0cm; margin-right: 0cm; margin-top: 0cm; text-align: justify;">In the above problems the solution space is continuos, so we need a smart strategy to extract the new candidate. Let's see my approach!</div><div><span class="Apple-style-span" style="font-family: inherit;"><b>Variance trick </b> </span></div>Let's consider a sampling in two different region of a function <i>f :</i><br />
<ul><li>sampling_1 in a region having a smooth minimum and evaluate such points</li>
<li>sampling_2 in a region having a spiky minimum and evaluate such points</li>
</ul><div>Take now the variance of these evaluated sets.<br />
How does the variance of <i>f(sampling_1) </i>change respect the variance of <i>f(sampling_2)? </i><br />
Here you are the answer:</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiMr5fHIqUTZ5NYY6amJsSEaRle1zBvH4-ujFV1iqXBPuhm7tyt24L-DKXWihpQ44iNEiTr05PPpFP7ldKIIpYFpbrIkjmyqate5JiZkVGiPtbtoOC-pvt3SNgpplbJVuoixMCD1jgw7i4/s1600/Slide1.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiMr5fHIqUTZ5NYY6amJsSEaRle1zBvH4-ujFV1iqXBPuhm7tyt24L-DKXWihpQ44iNEiTr05PPpFP7ldKIIpYFpbrIkjmyqate5JiZkVGiPtbtoOC-pvt3SNgpplbJVuoixMCD1jgw7i4/s400/Slide1.jpg" width="400" /></a></div><div class="separator" style="clear: both; text-align: justify;">As you can see the variance can be used as indicator of a minimum region of the cost function.</div><div class="separator" style="clear: both; text-align: justify;">Instead of explore randomly the solution space, the approach I propose is led by the variance used as above. </div><div class="separator" style="clear: both; text-align: center;"></div><div class="" style="clear: both; text-align: justify;">What happens if the smooth valley is a global minimum?</div><div class="" style="clear: both; text-align: justify;">There are two intuitive factors to avoid that the algorithm jams in a local minimum:</div><div class="" style="clear: both; text-align: justify;">1. The acceptance law admit also pejorative solutions.</div><div class="" style="clear: both; text-align: justify;">2. The sampling procedure moderate by variance (if the variance is chosen high enough) allows explorations of better regions.</div><b><br />
</b> <b>The algorithm</b><br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgtRQT-pgb61cDJ9kc6AyIyaKd62v8qu__ia-Fh4VRxmxwZbTfqeEnvKiNPZmwedw3YE4qChhiVdVB4h08GWNxiLQ-BdaN8g24hrETTGhEaabHqzjVjXh36wn7aA5_5ZBJwfhFzZMFz4Eg/s1600/Slide1.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgtRQT-pgb61cDJ9kc6AyIyaKd62v8qu__ia-Fh4VRxmxwZbTfqeEnvKiNPZmwedw3YE4qChhiVdVB4h08GWNxiLQ-BdaN8g24hrETTGhEaabHqzjVjXh36wn7aA5_5ZBJwfhFzZMFz4Eg/s400/Slide1.jpg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">The meta code of the Simulated Annealing based on variance criteria</td></tr>
</tbody></table>Many optimizations can be done to the algorithm.<br />
<div>For instance we could condition the annealing procedure only if the<br />
<div style="text-align: center;">argMin(evaluate(f,newSet)) > argMin(evaluate(f,curset))</div><b><br />
</b> <b>Tests</b><br />
Let's see how the approach works.<br />
As usual, all my experiments are compared with other techniques. In this case I chose as comparative method the well known "Nelder Mead" method.<br />
I tried to optimize the Nelder Mead method playing around its param setting: shrink ratio, contraction ratio and reflection ratio.<br />
All experiments have been done using the same initial points.<br />
<u>For the Nelder Mead method I plotted just the unique solutions found.</u><br />
<u>The approach I propose has been led using always the same setting (3000 cycles, variance sampling = 0.75).</u><br />
<span class="Apple-style-span" style="color: #222222;"><span class="Apple-style-span" style="border-collapse: collapse;"><b>Experiment 1. </b></span></span><b><span style="font-family: arial, sans-serif; font-size: x-small;"></span></b><br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhyXlIIQUF57HqLupArLiYa_1LBjwScnPy8VzbgMBEsdYgAqkVPL7kkfZPXLT8zXsBA9yNPjhhZaIJOu3K7VaAo7phIxvl3KkfIyjK-1QqRNtcKdeOHvpZynMPuVXJpiix1jtrhvwUJ0T0/s1600/schafferf6res2.bmp" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="366" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhyXlIIQUF57HqLupArLiYa_1LBjwScnPy8VzbgMBEsdYgAqkVPL7kkfZPXLT8zXsBA9yNPjhhZaIJOu3K7VaAo7phIxvl3KkfIyjK-1QqRNtcKdeOHvpZynMPuVXJpiix1jtrhvwUJ0T0/s400/schafferf6res2.bmp" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">On the left side it has been represented the solution found by Nelder Mead method.<br />
On the right side the solution found by the SA presented in this post: the blue point depicts the starting point and the red point depicts the solution found.<br />
The last chart shows the space solution explored.</td></tr>
</tbody></table><span class="Apple-style-span" style="color: #222222;"><span class="Apple-style-span" style="border-collapse: collapse;">Notice that method found a better solution respect Nelder Mead approach. </span></span><br />
<span class="Apple-style-span" style="color: #222222;"><span class="Apple-style-span" style="border-collapse: collapse;"><b><br />
</b></span></span> <span class="Apple-style-span" style="color: #222222;"><span class="Apple-style-span" style="border-collapse: collapse;"><b>Experiment 2</b></span></span><br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiXl8JKBZSVEfS7rv_NFP6UTSeomvWhlPW5FAPtmD__85k9fiKIEBo1EoWrEbA1vrAQCosW6jz_5yJUd1yDo8qjQxlH3UlW42SOVEA6mm1QzOTKWDWJH7BiJDfLUI8ZU4NGVcwD2Svw9po/s1600/f2res.bmp" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="201" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiXl8JKBZSVEfS7rv_NFP6UTSeomvWhlPW5FAPtmD__85k9fiKIEBo1EoWrEbA1vrAQCosW6jz_5yJUd1yDo8qjQxlH3UlW42SOVEA6mm1QzOTKWDWJH7BiJDfLUI8ZU4NGVcwD2Svw9po/s400/f2res.bmp" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">On the left side it has been represented the solution found by Nelder Mead method.<br />
On the right side the solution found by the SA presented in this post: the blue point depicts the starting point and the red point depicts the solution found.</td></tr>
</tbody></table>SA found a better solution in really few steps.</div><div><br />
</div><div><b>Experiment 3</b></div><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhvO1yZiAeLL1FKI_7JfxVh80dD15W8PiMK0hqP_LDt_Vt1DxHWl7b8K03s2H7wimarhGJPKvV4XU5Zr7ArS3YKCAuU-N-nG0aFJN5lweDx78FVVByl_jUMuKsCEAOmdS7wiCKGzbFERr8/s1600/f3res.bmp" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="328" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhvO1yZiAeLL1FKI_7JfxVh80dD15W8PiMK0hqP_LDt_Vt1DxHWl7b8K03s2H7wimarhGJPKvV4XU5Zr7ArS3YKCAuU-N-nG0aFJN5lweDx78FVVByl_jUMuKsCEAOmdS7wiCKGzbFERr8/s400/f3res.bmp" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">On the left side it has been represented the solution found by Nelder Mead method.<br />
On the right side the solution found by the SA presented in this post: the blue point depicts the starting point and the red point depicts the solution found.</td></tr>
</tbody></table><div>This experiment shows once again the good performance of this approach even if the problem is very challenging!</div><div><br />
</div><div><b>Experiment 4</b></div><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgMDFKcWIQ97eTXhAje5BNDdlenbFsQYyIB05Cp5uNgVSGntOWcW1nrMzoHqy_o0suz-g0Pn_qeykm_c7TMIpzkEGpqnos2yBhkOVKP9zaVePgzPyl5h5X6nQws9nQReqjU4PDeLInQHgw/s1600/f5Res2.bmp" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="343" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgMDFKcWIQ97eTXhAje5BNDdlenbFsQYyIB05Cp5uNgVSGntOWcW1nrMzoHqy_o0suz-g0Pn_qeykm_c7TMIpzkEGpqnos2yBhkOVKP9zaVePgzPyl5h5X6nQws9nQReqjU4PDeLInQHgw/s400/f5Res2.bmp" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="font-size: medium; margin-bottom: 0.5em; margin-left: auto; margin-right: auto; padding-bottom: 6px; padding-left: 6px; padding-right: 6px; padding-top: 6px; text-align: center;"><tbody>
<tr><td class="tr-caption" style="font-size: 13px; padding-top: 4px; text-align: center;">On the left side it has been represented the solution found by Nelder Mead method.<br />
On the right side the solution found by the SA presented in this post: the blue point depicts the starting point and the red point depicts the solution found.</td></tr>
</tbody></table></td></tr>
</tbody></table>The convergence, as shown above, has been reached very quickly.<br />
<div><br />
</div><div><b>Conclusion</b></div><div>The approach seems promising and works well in different contexts: I'm quite sure that many improvements can be implemented.</div><div>I'm looking forward to receive your comments, ideas and other comparative tests.</div><div><br />
</div><div>Stay tuned!<br />
<div><b><br />
</b></div><div>cristian </div></div>Cristian Mesianohttp://www.blogger.com/profile/04880057603671195464noreply@blogger.com1tag:blogger.com,1999:blog-7631116270195175228.post-26828389041015328432012-07-10T11:09:00.000-07:002012-07-10T11:09:54.095-07:00Data Mining: Tools and CertificatesAs member of many Linkedin groups related to data mining & text mining I read many threads related to certificates that should help either in job seeking and consolidating curriculum, and many other threads about miraculous tools able to solve whatever problem.<br />
<b>Is being certified really worth it?</b><br />
In my experience I think that a certificate in a specific data mining tool could be a positive point on the curriculum, but it doesn't really help to improve your knowledge on the field.<br />
Let me explain better (<b>which is not easy with my bad English</b>): The certificates system is a market and its target is to generate profit or to promote products.<br />
<br />
<b>Data mining tools</b><br />
My question is: do you really think that can exist a tool able to embrace all aspects related to data mining?<br />
<br />
<div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">I guess that the number of problems data mining related are so high that maybe we could use the Cantor diagunalization to proof that are uncountable :)</div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><br />
</div><br />
In my opinion is too naive the common thought that through a software, clicking here and there you can obtain tangible benefits in mining your data.<br />
The "data mining" definition has been created by marketing industries just to summarize in a buzz word technics of applied statistics and applied mathematics to the data stored in your hard disk.<br />
I don't want say that tools are useless, but it should be clear that tools are only a mean to solve a problem, not the solution.<br />
<ul><li>In the real world the problems are never standard and really seldom you can take an algorithm as is to solve them! ...maybe I'm unlucky but I never solved a real problem through a standard method.</li>
<li>The tool X is able to load Terabyte of data. And so what? A good data miner should know that you cannot consider the entire population to analyze a phenomena, you should be able to sample your population in order to ensure the required margin of accuracy! ... this technic is simply called Statistic!</li>
<li>If you really want to claim "I know very well this approach", you must be able to implement it by your self: only implementing it by your self you can deeply understand in which context the algorithm works, under which conditions it performs better than other tools and so on. Don't rely only on one paper that compare few techniques: if you change just one of the conditions the results are terrible different.</li>
<li>Without theory you cannot go deep: Let's consider a tool as Mathematica or R or ... These tools allow the user to have access to a large library of pre defined algorithms and routine, they provide visualization functions to show results in a fancy way, and last but not least they provide a complete language programming to code whatever you want. I love them, but I couldn't do anything without the theory behind the problem. Mathematica can provide me the algorithm to cluster a data set through k-means: but how can I be sure that it is the right algo for your problem? (<a href="http://textanddatamining.blogspot.it/2011/07/kmeans-easy-clustering-algorithm-maybe.html" target="_blank">click here to have a demo</a>).</li>
</ul><div>Actually I would prefer attend a course to deepen some aspects of multivariate statistic or seminars on new methodology to solve some problem respect pay plenty money to know every single detail of a tool, that maybe will not be in the market in the next 5 years.<br />
<br />
</div><div>I know that the companies often are looking for certified guys on a famous tool just because they bought it and they need to reduce the time to "integrate" a new resource on a team. Fair enough! ...but I think it is ridiculous require certificates as strict requirement!</div><div>I'm really curious to know your experiences and opinions.<br />
cristian<br />
<br />
<br />
</div>Cristian Mesianohttp://www.blogger.com/profile/04880057603671195464noreply@blogger.com6tag:blogger.com,1999:blog-7631116270195175228.post-37497279246983918312012-07-03T11:46:00.001-07:002012-07-03T11:46:34.195-07:00Simulated Annealing: How to boost performance through Matrix Cost rescaling<div class="separator" style="clear: both; text-align: center;"></div>One of the most widely algorithm used in Machine Learning is the Simulated Annealing (SA).<br />
The reason of its celebrity lays in:<br />
<ul><li>Simplicity of implementation</li>
<li>broad spectrum of applicability</li>
</ul>There are many implementations of it available for almost all languages, and many books and papers on the topic; personally the best book I would like to suggest is the <a href="http://www.amazon.it/Simulated-Annealing-Boltzmann-Machines-Combinatorial/dp/0471921467/ref=sr_1_23?ie=UTF8&qid=1341308954&sr=8-23" target="_blank">Simulated Annealing and Boltzmann Machines: A Stochastic Approach to Combinatorial Optimization and Neural Computing</a> wrote by Korst & Aarts.<br />
<br />
<b>The problem</b><br />
In this post I'm going to describe a very canonic combinatoric problem (NP hard complexity) often approached through annealing methods: the <a href="http://en.wikipedia.org/wiki/Travelling_salesman_problem" target="_blank">complete Asymmetric Travelling Salesman Problem (ATSP)</a>.<br />
The ATSP problem, as you know, can be taken as base to model several problems as for instance network routing problems, chemistry and physics problems. The ATSP is so charming (at least for me) because the nature of the problem is very easy and the way to find out the best solution is extremely trivial: you have to explore all the possible solutions!<br />
The only problem is the number of solutions you have to compare:<br />
<ul><li>with 10 towns you have to explore ~ 3.6*10^6 solution.</li>
<li>with 50 towns you have to explore ~ 3 *10^64 solution.</li>
</ul>In the past (I'm not aware if it is still open) there were many competitions to find the best solution of the asymmetric and complete TSP with more than 30000 towns.<br />
<br />
...Considering the time required to obtain the optimal solution, in many cases a sub optimal solution obtained exploring less than 1% (as usually happens with SA) of the solution space could be enough!<br />
<b><br />
</b><br />
<b>The experiments</b><br />
I considered two instances of the problem, the first one with 10 towns and the second one with 50 towns.<br />
Even if for this kind of problem there are more efficient algorithms to face TSP, I chose the simulated annealing just to show how versatile it is and to show how much the <b>rescaling operation is essential in order to maximize the performance of your algorithm. </b><br />
<br />
<b>Why rescaling? </b>(I'll try to be technical as little as possible)<br />
The SA bases its powerful on the ability to overcome the local minimum through the acceptance at time "t" of a bad solution respect a better solution found at the time "t-1". It assigns an acceptance probability based on the sigmoidal function that decrease if the cost of the new solution is higher respect the current solution.<br />
So if the cost distance between the current solution and the new solution is very high, the acceptance probability will be very small.<br />
This probability is mitigated by the temperature schedule that allow more acceptance rate (of bad solutions) when the temperature of the system is high, and it is reduced once the temperature of the system decreases during the exploration of the solution space.<br />
The aim of the rescaling is to increase the distance between towns having high cost and to reduce the cost of short distance between towns.<br />
To obtain it I apply a "quasi" n log n function to the cost matrix.<br />
(A publication on the topic should be ready by the end of the year).<br />
<br />
<b>Results</b><br />
All experiments as usual as been done through "home made" application written in Mathematica.<br />
<ul><li>The cost matrix has been built using a uniform distribution with integer values (1 to 10 for the first experiment and 1 to 100 for the second one).</li>
<li>Just to have a reference point I assigned to the cost matrix the minimum value (1) in correspondence of the external perimeter of the "town system"</li>
<li>The ATSP with 10 towns has optimal solution cost = 9. The ATSP with 50 towns has optimal solution cost = 49.</li>
</ul><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjctpDNSc4wOkah-yiivxpfv_g0ijy3DPc6ctjR3j0aogzYGu8FbOBVQbf8afHz5TEXG5AZhHC5fKysOJiMAGCElERbVNnpNpm6AxjU0aDX5Lg1UwtpFvqVmJTVcwpc7uwbq1wi8_TIYVg/s1600/sa_exp1_Nb5.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjctpDNSc4wOkah-yiivxpfv_g0ijy3DPc6ctjR3j0aogzYGu8FbOBVQbf8afHz5TEXG5AZhHC5fKysOJiMAGCElERbVNnpNpm6AxjU0aDX5Lg1UwtpFvqVmJTVcwpc7uwbq1wi8_TIYVg/s400/sa_exp1_Nb5.jpg" width="385" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">After 2000 cycles, SA found a solution having cost = 13. (the graph on the II quadrant represents the acceptance prob.).<br />
The initial temperature beta has been set = 5.</td></tr>
</tbody></table><br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhEOCOqeNGA7Xb-yZlNFnnBpsgkzaWUY7bWKIqI4zw9rLYWvqwI0N5aDpbiZN_OO6URKSBcRrZNy7Y7NBFtMIbkNqXzoewRSlgvq9k9bAHSfyyXs8ThX2UCWMq3bAKDTYNu_DiOhdHbKzs/s1600/sa_exp1_Mb5.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhEOCOqeNGA7Xb-yZlNFnnBpsgkzaWUY7bWKIqI4zw9rLYWvqwI0N5aDpbiZN_OO6URKSBcRrZNy7Y7NBFtMIbkNqXzoewRSlgvq9k9bAHSfyyXs8ThX2UCWMq3bAKDTYNu_DiOhdHbKzs/s400/sa_exp1_Mb5.jpg" width="385" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">After 1000 cycles, SA with MatrixCost rescaled found the optimal solution having cost 9.<br />
The initial temperature beta has been set = 5.</td></tr>
</tbody></table>I repeated the same experiment increasing the initial temperature to 15.<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhtqOf8NwedkWvW49oNlllpbo6PuSfknaR6rEIgNoQP76Dlml4W0hUQK3n9xK301-vFSguMxK7KD4gOG4OO5OqDvyY9W6d-jslH0LxFzjfDm_Vg8-dlN2BhMnONyXNpzGXIkRrHmkrANMc/s1600/sa_exp1_Nb15.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhtqOf8NwedkWvW49oNlllpbo6PuSfknaR6rEIgNoQP76Dlml4W0hUQK3n9xK301-vFSguMxK7KD4gOG4OO5OqDvyY9W6d-jslH0LxFzjfDm_Vg8-dlN2BhMnONyXNpzGXIkRrHmkrANMc/s400/sa_exp1_Nb15.jpg" width="385" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">SA launched with beta= 15. After 5000 cycles the sub optimal solution found has cost = 14<br />
<br />
</td></tr>
</tbody></table><br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjvfgIdHejECWLt69zLRBPyVq8dJDlBg3w0_322jiuPQNpzyIA909Jfu77uY-PTFZrJLVaMkoh7k2qUf6fIJOK6LzKtD-RYt9wBTUgyp5zlykMhnk02AWckjnrxlnN7RBE8EwXrJu9IDUs/s1600/sa_exp1_Mb15_2.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjvfgIdHejECWLt69zLRBPyVq8dJDlBg3w0_322jiuPQNpzyIA909Jfu77uY-PTFZrJLVaMkoh7k2qUf6fIJOK6LzKtD-RYt9wBTUgyp5zlykMhnk02AWckjnrxlnN7RBE8EwXrJu9IDUs/s400/sa_exp1_Mb15_2.jpg" width="385" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">SA with matrix cost rescaled after 2000 cycles found better solution (same initialization of the above experiment)</td></tr>
</tbody></table>A larger test has been done to understand the statistical behavior:<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi8NYojqaJVAPMe6gep1CvhhOY6Xu1nR6Ne9dktDCm7gIvam1Uiy7pKUkhEFWgRKssACcP7hH6fL9rOrxXeC_w38rafmVkidt65KfiYG6yCGyqprVCvpdQezqSSt934h0tTd1keYbXUzVY/s1600/comparisonbeta5.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="255" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi8NYojqaJVAPMe6gep1CvhhOY6Xu1nR6Ne9dktDCm7gIvam1Uiy7pKUkhEFWgRKssACcP7hH6fL9rOrxXeC_w38rafmVkidt65KfiYG6yCGyqprVCvpdQezqSSt934h0tTd1keYbXUzVY/s400/comparisonbeta5.jpg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">The blue line represents the solution costs of the traditional SA.<br />
The red line represents the solution costs of the SA with matrix cost rescaled </td></tr>
</tbody></table><div><br />
<div>And to conclude, here you are the results obtained with higher complexity (50 towns).</div><div><br />
</div><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiaHpbYmWmXiRMT1_QtfsQwFuBTXUFnzkNfv_jTFJ0CYBmtmY7yKe0szylHZcENhzGX7dHuPTOWp-6I8ruRIsqJUm5tH42jrcoAV9MOfjR7yG0Mh4d9at1MvkVCvoMwPR3tFQzS4LZfBPQ/s1600/comparisonbeta5_prbDim50.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="252" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiaHpbYmWmXiRMT1_QtfsQwFuBTXUFnzkNfv_jTFJ0CYBmtmY7yKe0szylHZcENhzGX7dHuPTOWp-6I8ruRIsqJUm5tH42jrcoAV9MOfjR7yG0Mh4d9at1MvkVCvoMwPR3tFQzS4LZfBPQ/s400/comparisonbeta5_prbDim50.jpg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">ATSP (50 towns) comparison between traditional SA and SA with rescaled matrix cost.<br />
<br />
<br />
</td></tr>
</tbody></table>Let's try to increase the number of cycles from 1000 to 3000.</div><div class="separator" style="clear: both; text-align: center;"></div><div><br />
<div class="separator" style="clear: both; text-align: center;"></div><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjMyi8-kFPsOwGco5Y_T2Alv7r44ztifZJ4yFLb_eQFSKOgyUetVuWoxWGOLJzXVwAobLQcOzp_fWbzfrbBtJP8lJquEH0YKzsBjTNBEZWQE6LPly2RpyriSb1NZ3CXRr4arZfJjZjUvlA/s1600/comparisonbeta5_prbDim50_3000.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="252" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjMyi8-kFPsOwGco5Y_T2Alv7r44ztifZJ4yFLb_eQFSKOgyUetVuWoxWGOLJzXVwAobLQcOzp_fWbzfrbBtJP8lJquEH0YKzsBjTNBEZWQE6LPly2RpyriSb1NZ3CXRr4arZfJjZjUvlA/s400/comparisonbeta5_prbDim50_3000.jpg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">The % of success of the new strategy seems increase with the increase of the number of cycles.</td></tr>
</tbody></table><div><br />
Let's see what happens increasing once again the number of cycles:</div><div><br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjWkDfDcZFdnuveqhCtNlhAE7O1W6pdF1JUDCP_cbDakGQL4HdyhX1I4d-rn0a2CqxqCiN3g7w1b0YTLhPPdkOohyvr0t4tIjgHu_KjKXzEviQIna2SyAiT0AdMYuMmAmQoavT_0m-UEUY/s1600/comparisonbeta5_prbDim50_6000.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="252" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjWkDfDcZFdnuveqhCtNlhAE7O1W6pdF1JUDCP_cbDakGQL4HdyhX1I4d-rn0a2CqxqCiN3g7w1b0YTLhPPdkOohyvr0t4tIjgHu_KjKXzEviQIna2SyAiT0AdMYuMmAmQoavT_0m-UEUY/s400/comparisonbeta5_prbDim50_6000.jpg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">The assumption seems to be confirmed: the new strategy performs better than the traditional SA even if the number of cycles is higher. </td></tr>
</tbody></table><b>Conclusion</b></div><div><ol><li>Often the accuracy/performance of an algorithms depends on the way you feed it with the data! Rescaling your data is quite always essential!</li>
<li>The strategy requires further tests and validation.</li>
</ol><div>...We will discuss about the function used to rescale the matrix costs and some other application of SA on different problems in the next posts.</div></div><div>cheers & stay tuned</div><div>P.S.</div><div>Special thx to Tom V. : he is on the top of the list of the contributors for the sponsorship program.</div></div>Cristian Mesianohttp://www.blogger.com/profile/04880057603671195464noreply@blogger.com7