Thursday, December 15, 2011

Clustering algorithm to approximate functions

In the last post we discussed about function approximation through polynomials having exponents in the Real domain.
We have also seen that to obtain a good approximation of the function, long training session is required; the reason of this inefficiency lays in one and on the gradient surface and in another hand on the strategy adopted to train the system.
Let me spend few words about the gradient surface:

The gradient surface conformation is strictly related with the number of monomials used to approximate functions, so

  • if you use fews monomials the approximation cannot be good because you are using less complexity respect the complexity of the original function and the gradient will have pretty regular convex surface with bottom equal to best approximation achievable by the linear combination of monomials chosen.
  • if you use too many monomials, you are building a complex gradient function (because the number of its parameters are growing), so you increase the risk to fall down in a local minimum.

I mentioned that annealing strategy can help to overcome the barriers imposed by local minimums, but for the time being I prefer show you a little bit more sophisticated strategy: the clustering learning.
As you know usually the clustering techniques are used to divide data in different categories, but I would like once again deliver the message that in data mining (and in text mining as well) what is real important is not the product, the tool or the suite, but the knowledge, the creativity and the skills to abstract the problems and the algorithms and adapt the second one to solve the first one.

Divide et Impera (Latin motto)
... that is divide the problem in sub problems, find sub solutions and merge them to obtain the overall solution.
The strategy is very easy to describe:
1. Divide the domain of your function in k sub intervals.
2. Initialize k monomials;
3. Consider the monomials as centroids of your clustering algorithm.
4. Assign the points of the function to each monomial in compliance to the cluster algo.
5. Use the gradient descent to adjust the parameters of each monomial.
6. Go to 4. until the accuracy is good enough.

Basically in the former version we had this approximation:
Function Approximation through power real polynomials

While in this approach we introduce the membership of function which assign each point to each cluster (in this case monomial)
Power Real Polynomials approximation with clustering approach

Let's see the result obtained with the sinusoid function (please compare with the former results obtained with the canonical gradient descent).
Clustering approach to fit the Sin(x) function
Let me show how these strategy works in more complex domains, for example in R^3 (of course you should add extra "y" variable to the monomial).
First Example


And here you are the monomials found and the fitting found applying the membership function (I choose a crispy version but it is also feasible a fuzzy approach):
The spheres represent the points belonging to the original function, while the planes are representing the monomials surface found by clustering & gradient. (approximation with 36 monomials)
And the fitting obtained:
Original Function depicted with  blue points; Fitting depicted with Orange Points  
Second Example
The function chosen is the following:

On the left side the original function,
 on the right side the hyperplanes (monomials) found by gradient descent cluster based.
The fitting obtained applying the above monomials and the membership function:
The graphs shows the plane described by the original function.
In blue the  points belonging to the original function.
In orange the fitting obtained using the clustering based procedure.
As usual: Stay Tuned 
Cristian

Thursday, December 8, 2011

Power Real Polynomial to approximate functions: The Gradient Method

In the real world rarely a problem can be solved using just a single algorithm, more often a solution is a chain of algorithms where the output of the former is the input for the follower.
But you know that quite often machine learning algorithms return functions almost always extremely complex, and they don’t fit directly in the next step of your strategy.
In these conditions, it is really helpful the trick of the function approximation, that is, we reduce the complexity of our original model using a new easier model.
The big advantage of the function approximation is that you can impose the form of new model to describe your data.

One of the most common “model” used is the polynomial approximation (also know with the term linearization).
As you can imagine the polynomial approximation takes vantage of the linear combination of monomial and it makes easier analytical approaches.

First approach: Polynomial fitting (power-real) gradient descent
To apply this approach your simplified model must be differentiable!
I don’t want spend time explaining how to implement it because it is so standard and common that it is available everywhere on internet.

In my toolbox there is always a notebook with the gradient descent implementation: it is so general purpose that I couldn’t leave without it!!

Let me show a quick demonstration of it on two different cases:
In the first one I found an approximation of Sin[x] (…In a real scenario I would be really happy to have a “complex” model following a Sin function).

Polynomial approximation for Sin(x): A good approximation obtained after 30.000 iterations.
(In red the original function, in orange the approximation)
In the following case I choose a much complex function: 


And the results for this example is:
Polynomial approximation: here the convergence required six monomials and 70.000 iterations.
The final results has obtained via "fine-tuning" and it took 5000 iterations.
(In red the original function, in orange the approximation)
In both example the convergence has been measured like the square error respect the target function.


Considerations:

  • If you increase the number of the monomials, you are increasing the risk to fall in a local minimum.
  •  You should start with very small values of eta learning (I usually use two different eta learning: the first one for the coefficients and the second one for the exponents).
  • To speed up the convergence you should increase incrementally the value of eta learning.
  • To refine the learning it could be useful  "Fine tuning"  presenting samples with asymmetric distribution to focus the learning on the tricky regions.

Of course this strategy requires a bit of experience to be used and good dose of patient!
Be aware that the approximation as just local effects: if you move far away the training region, the error increase dramatically.

In the next session I show how to increase the accuracy of this technique mixing the gradient descent with simulated annealing.
Stay tuned.

Wednesday, November 23, 2011

Neural Nets Tips and Tricks: add recall Output neuron

After my former post about the Neural Networks frameworks, my email box has been literally flooded!
Let me clarify something: I didn't develop my "home-made" app to compete with the majors company in this field! My experiment has been done just to proof that there are so many different algorithms to train neural nets, so many models designed to work in specific domains, that in my opinion the best approach to work with this elegant machine learning algo is to implement it in house. that's it!
Anyway, in these email someone suggests really state of art models, dynamic momentum, cost function entropy based and so on: I'm really happy of the collection of paper I got for free :)

As mentioned, in many cases the customization of algorithms is the only way to achieve the target, but sometimes, some tricks can help to improve the learning even without changes of learning strategy!
Consider for example our XOR problem solved through neural networks.
Let's see how we can reduce considerably the epochs required to train the net.
As you know the Back Propagation is based on the famous delta rule, and for the "hidden to output" layer the delta is equal to:
Delta rule for the "hidden to output" layer
How can we reinforce  to speed up the learning process without modify the delta rule?
One of the possible answers is: duplicating the output neuron!!
So instead of change the strategy, we slightly modify the topology of net to obtain the following:
The modified Network: notice how changes the delta for the neuron h1 

The neuron O2 has the same identical target T of the neuron O1: basically the heuristic consists in duplicate the target to reinforce the delta contribute.
In the above image you can see the new contribute provided by new delta for the neuron O2; when the network finds a good way in the gradient descending, the output of O1 will be similar to O2 and the delta for h1 will receive a double correction because delta of O1 will be pretty much the same of the delta of O2.
...I've done my best to explain "by practical means" and from a theoretical prospective someone could stick his nose (I apologize for that... but the soul of the blog is to privilege the intuitive explanation).
As usual let me show the effects:
I did 10 tests to compare the original configuration and the network modified with the "recall neuron" using exactly the same number of hidden neurons, and the same param. configuration.
In 7 cases the new net reduced the learning phase by 210%. 
In 2 cases the new net took the same time of the original net.
In 1 case the new net didn't find a solution (due to oscillation problems).
Here you are some representative examples:
Original Net: convergence after 800 cycles.

Net with recall neuron: convergence after 120 cycles (notice how fast the error slumps) 
Here you are another example:
Another trial with original configuration (convergence after 600 cycles)

Another example of the effectiveness of the method: convergence obtained around at cycle 200) 
Just to conclude, let me show how quickly slumps the error from the synapses prospective. Below the most representative error surface for the last trial shown above.
Error surface plot for the most important connections (referred to the above net): notice again how the error slumps very quickly.
Stay tuned
cristian

Sunday, November 13, 2011

Buy or build, a practical example to explain my point of view

I was working on the post related to the relation between IFS fractals and Analytical Probability density, when an IT manager contacted me asking an informal opinion about a tool for Neural Networks.
He told me: “we are evaluating two different tools to perform general purpose data mining tasks, and we are oriented toward the tool xxx because the neural network module seems more flexible”.
My first reaction was: “sorry, but you have to solve a specific problem (the problem is a little bit complex to explain but it is a classical inversion function problem) and you are looking for a generic suite: I’m sincerely confused!”
He justified himself telling “If I buy a generic suite I can reuse it in a next time for other problems!”: bloody managers! Always focused to economize 1 $ today to loose 10$ tomorrow :)            

As you can imagine the discussion fired me, so I started to investigate about this “flexibility capabilities” (this is for me the key point to evaluate a solution) and I began long emails exchange with this enterprising manager.
Let me sum up the main questions/answers about this module.
Q: What kind of Neural Network does the suite support?
A: Back Propagation (just for multilayer perceptron) and SOM.
The suite has a nice GUI where you can combine the algorithms like lego-bricks. (it is very similar to WEKA’s gui).
Q: What kind of error function does it support  (limited to BP nets)?
A: I don’t Know, but I suppose just the delta based on target vs output generated;
Q: What kind of pruning algorithms does it support?
A: It supports a weight-based algorithm, but there is no control on it.
Q: How many learning algorithms does the BP net support?
A: It supports just BP with momentum.
Q: What kind of monitors does it support to understand the net behavior?
A: It has a monitor to follow the output error on the training set.
Q: What is the cost of this suite?
A: The suite (actually it is pretty famous data mining suite ...don’t ask more about it…) price is around 10 thousand $ for limited number of users (it doesn’t embrace neither the support costs nor integration costs nor dedicate hardware).
Q: do you have the chance to visualize the error surface?
A: I’m a manager, what the hell does this question mean?

My recommendations: don’t buy it.
Here you are the key points:

1. No words, Bring real evidences!
As usual, instead of thousands words I prefer get direct and concrete proof about my opinion, because my motto is : “don’t say I can do it in this way or in this way, but just do it!”  (…hoping that it makes sense in English!).
So during this weekend I decided to implement in Mathematica a custom tool to play with Back Propagation Neural Networks having at least the same features described above.
The general-purpose Neural Net Application built with mathematica: different colors for neurons to depict different energy levels., monitor to follow the learning process, fully customization for energy cost function, activation functions, ...
My “suite” (of course, it is not really useful for productive systems) has the following features:

  • Graphical representation of the net;
  • Neurons dynamically colored by energy importance (useful to implement a pruning algo);
  • Dynamical cost function updating;
  •  Learning coefficient decay (really useful but seldom present in the common suites);
  • Activation functions monitor;
  • Automatic setting of input/output neurons layers in compliance with training set.

2. General-purpose algorithm doesn’t mean “every one can use it to solve whatever problem”!
There is a misbelief that a  general-purpose algorithms like SVM, NN, Simulated annealing, decision trees and so on, can solve every kind of problem  without human interactions, customization and expertise.
For the XOR problem, maybe also a newbie can find the right setting for a neural net, but in the real world the problem are so complex that the overall accuracy is a combination of many factors, and sometimes the algorithm choice has a limited influence on the entire picture!


Let's start to show a sample toy the XOR problem (as you know is a classical example brought to explain problems without linear solutions).
I instantiated a network having 1 hidden layer and 2 hidden neurons pro layer. I fixed the Beta param  for the activation function to 1 and eta learning equal to .1 and i Randomized the weights in the range [-1,1]  (as reported in the main tutorials ).
here you are the network able to solve the problem and the related behavior for the cost function:
Neural Network for XOR problem: 2 input neurons , 2 hidden neurons, threshold neuron and the output. 
The Square Error decreasing for the above network. As you can see a stable configuration is achieved after 30000 learning cycles!!
Notice that with the standard configuration, the problem takes around 30000 cycles to be solved!!!
With a little bit of experience, the same results can be achieved with less than 300 cycles. Indeed, changing the eta learning, and the network configuration, we obtain:

Another configuration to solve the XOR problem: 1 hidden layer and 7 hidden neurons: as you can see after 300 cycles the learning the error is negligible.
Observe the hidden neurons energy bar:  as expected only one hidden neuron (H1,2) gives a concrete contribute to the network (this is compliance with the theory that only one hidden neuron is required to solve the XOR problem!!).
A novel data miner could say: "It is easy: I increase the number of hidden layers and the numbers of neurons and the training will be faster!": TOTALLY FALSE!!
Indeed, (especially for the neural network based on the classical Werbos delta rule) the complexity of the net, often obstacles the learning producing oscillatory  behaviors due to neuron saturation and overfitting.
A Neural network composed by 2 hidden layers and 8 hidden neurons trained to solve the XOR problem. As you can see  even if the network is more complex than the former example, it is not able to solve the problem after 1000 cycles (the former net solved the problem after 300 cycles).

We have seen that just with a trivial problem there are many variables to consider to obtain a good solution. In the real world, the problems are much complex and the expectation that a standard BP can fit your needs is actually naive!  

3. Benefits & thoughts.
How much time does take an implementation built in house?
I spent more or less 6 hours to replicate the standard BP, with the "neurons colored version"... and in this slot of time, my 4 years old daughter tried to "help" me to find the best color range and the most fancy GUI :) typing randomly the keys of my laptop....


  • The effort to replicate the model in a faster language (like java) could take 4 times the effort to code this in Mathematica: that is no more then 1 week development.
  • The customization of the standard BP for the specific problem, for example changing the objective function, could be really expensive and often not feasible for a standardize suite.
  • The integration of the bought solution takes almost always at least 50% of the development efforts, so build in house can mitigate these costs because your solution is developed directly on your infrastructure and not adapted to work with it!
  •  The advantage to develop in house a product requiring high competences lays in the deep knowledge that the company acquires, and the deep control of the application under every different aspect: integration, HW requirements, business requirements and so on. The knowledge is the key to win the competitive challenges in these strange world. Of course, train in house team having deep scientific knowledge oriented to the "problem solving" is initially expensive and risky, but in my belief the attempt to delegate this core "components" outsourcing totally these expertise to external companies will produce an impoverishment of the value of a company.

The BP simulator I implemented is totally free for students, and/or scholastic institutions: contact me at: cristian.mesiano@gmail.com.
...We will use this simulator to highlight interesting points in the amazing world of data mining.
Stay Tuned,
Cristian 

Monday, October 24, 2011

Think different part 1: IFS fractals to generate points in a convex polygon

Sometime datasets assume strange shapes, and often to make easier our computations, we simulate the set of points using approximation of well-known statistic distributions, mixture models, and so on, but as we know the world often is more complex than expected, and an analytical density function is tricky to find out.
Let me show, a different approach, to simulate points belonging to really irregular convex polygons.
The method I propose is based on a kind of Markovian Process, better known under the name of Iterated Function System (IFS). The IFS are famous thanks to the “collage theorem” of Prof. M. F. Barnsley that allows amazing image compressions and description of complex phenomena. (I really suggest a deep reading of  the book “Fractal everywhere” that you can buy clicking on the Amazoon panel).
Just to remark the powerful of the IFS fractals, I plotted the famous fern obtained with 4 affine transformations ( I’ll explain below how IFS works).
The sequence of images shows how a fern can be built following the underlying process.
IFS to generate the fern. To different transformation has been assigned a different color.  
IFS: how it works
To generate an IFS fractal you just need a set of affine contractive transformations and for each transformation an associated probability (…of course the sum of the probabilities associated to the transformations must be equal to one!).
So for a given set of affine contractive transformation {a,b,c,d} and the respective probability set {pr_a, pr_b, pr_c, pr_d} you initialize the process from an arbitrary point of the space “P_t0”, and you find the next point of the process applying to it a transformation chosen among the mentioned set with the associated probability.
That's it!

Generate points in a convex Polygon.
The process to generate a set of points in a convex polygon is based on the above process, but, instead of using a generic affine transformation, we can use only homothetic contractive transformations.
Procedure:
  • Find a set of homothetic affine transformations of the polygon having fixed point in the vertexes of the polygon.


Homothetic transformations with fixed points on vertexes of the polygon
  • Assign equal probability to each transformation
  • Trigger the Process starting from a point inside the polygon.
Actually the above procedure works properly also with non concave objects (like straight line):
The above procedure applied to find points belonging to a segment
Here you are some examples of set for the above procedures.

The set of points obtained for the former polygon. (Changing the contraction factor, you can obtain different coverages).
Others examples:

As you can see, some points fall above the polygons: Why?
Because I initialized the process from point outside the polygon, but the final effect doesn't change


In the next post we will discuss about the Probability Density Function  (PDF) for the points obtained with this procedure and the relation between the PDF and contractive factor.
Stay Tuned!
cristian

Thursday, October 13, 2011

ADA boost: a way to improve a weak classifier

We have seen that often a classifier is not able to do its job properly.
Why? There are many reasons:
  •  Even if many algorithms are in principle general purpose, we know that testing the algorithm on different data set (that is on different problem) the accuracy can change a lot.
  • The training hasn’t been done properly: consider for example datasets having high dimensionality, how can you be sure that your classifier hasn’t affected by overfitting?
  • The classifier it self cannot solve such problem: e.g a perceptron cannot solve the xor problem (more generally cannot solve non linear problems).

So the idea behind ADA, is to linearly combine a set of weak classifiers to obtain a better classifier!
One of the reasons that makes ADA so powerful is its generality:
It is a meta-algorithm, that is, you can think ADA like a strategy where you can combine different weak classifiers to cooperate together to improve the overall accuracy. So in the linear combination of classifiers you can have perceptrons mixed with neural networks, decision trees and so on.
Every single classifier is specialized to solve just a piece of the problem, so combining them you obtain the overall solution of the problem.
How it works?
ADA is an iterative strategy where on each round, the weights of each incorrectly classified example are increased so that the new classifier focuses more on those examples.
Further details are available everywhere in internet, but I would suggest to focus on the distribution function that it minimizes.
This is a crucial point: the function (at least in the original implementation) is a convex distribution, and the algo tends to fall in local minimum because basically it performs a greedy process.
Again: when you use an algorithm focus on its limits!!
As usual I’m not going to report here the meta-code of algo: there are hundreds of web pages with exactly the same things and I don’t want add entropy to the Universe J
I prefer show the some graphical example done in Mathematica: I believe that an image is thousands times better than a formula to understand the underlying concepts! (…I love the theoretical aspects of the algorithms, but after the theory you must to be practical and able to implement the solution!!).

So let’s start with a 2-dimensional problem:
I ran on the same data set (not linearly separable) many SVM having different setting (of course an SVM well trained can easily solve the problem).
I choose voluntarily a weak kernel (polynomial) with a bad capacity factor values.
Here you are the hyperplanes found by the svm.  As you can see, no one solution found fits properly problem.
  
The set of weak classifier  used to build the boosted classifier

Let me show now what is the output of ADA after few iterations.

The output of ADA using the above weak classifiers. Notice how the linear combination produce a segmented classifier.
Below an of example with 3-D datasets
The set of weak hyperplanes found by the weak svms (mixed)
Here you are the ADA output after just 2 iterations:
ADA output after two iterations


and the optimal solution found after 7 iterations:
The ADA output after 7 iterations: as you can see the optimal solution has been found.

Contact me if you are interested to the notebook with the Mathematica implementation.
Stay tuned!
cristian



Monday, September 26, 2011

Support Vector Clustering: An approach to overcome the limits of K-means

Time ago, I posted a banal case to show the limits of K-mean clustering. A follower gave us a grid of different clustering techniques (calling internal routines of Mathematica) to solve the case discussed.
As you know, I like write by myself the algorithms and I like show alternative paths, so I've decided to explain a powerful clustering algorithm based on the SVM.
To understand the theory behind of SVC (support vector clustering) I strongly recommend  to have a look at: http://jmlr.csail.mit.edu/papers/volume2/horn01a/rev1/horn01ar1.pdf . In this paper you will find all technical details explained with extremely clearness.
As usual I leave the theory to the books and I jump into the pragmatism of the real world.
Consider the problem of a set of points having an ellipsoid distribution: we have seen in the past that K-means doesn't work in this scenario, and even trying different tweaks changing the position of the centroids and its number of centroids, the final result is always unacceptable.
SVC is a clustering algorithm that takes in input just two parameters (C and q) both of them real numbers. C is to manage the outliers and q is to manage the number of clusters. Be aware that q is not directly related with the number of clusters!! Tuning q  you can manage the "cluster granularity" but you cannot decide a priori the number of clusters returned by the algo.


How to implement SVC.
There are many implementations of SVC, but I would like to show different tools (I love broaden the horizons...), so the ingredients of the daily recipe are: AMPL & SNOPT.
Both of them are commercial tools but to play with small set of points (no more than 300) you can use for free the student license!
AMPL is a comprehensive and powerful algebraic modeling language for linear and nonlinear optimization problems, in discrete or continuous variables and SNOPT is a software package for solving large-scale optimization problems (linear and nonlinear programs).
AMPL allows the user to write the convex problem associated to SVC's problem in easy way:

The AMPL file for SVC
And SNOPT is one of the many solvers ables to work with AMPL.
In the former image, after the statement "param x: 1  2   3   :=" there are the list of 3D points belonging to our data set.
One of the characteristics of SVC is the vector notation: it allows to work with high dimensions without changes in the development of the algorithm.

2D Problem 
Let's show the application of SVC in our ellipsoid data set
300 pt having ellipsoid distribution.  The first contour of SVC  has been depicted in black.   
The above image shows the clusters (depicted like connected components of a graph...read further details in the mentioned paper) returned by SVC and plotted by Mathematica.

3D problem
Just to show the same algorithm working in 3D on the same problem:
3D points having ellipsoid distribution.
And Here you are the SVC results plotted by Mathematica:
SVC applied on the former data set
As you can see in both scenarios SVC is able to solve the easy  problem that K means cannot manage.
PS
We will continue the text categorization in the next post... From time to time I allow to myself some divagation. 

Thursday, September 15, 2011

How the classifier accuracy is conditioned by the Feature Set (Bag of words)

In the last post we have seen a naïve approach to classify via SVM and C5, and we noticed that both SVM and C5 didn’t get good accuracy.
Despite the unsatisfactory results, we tested the different algorithms under the same conditions, and we found out that (at least in this case) they have the same behavior. Of course we should repeat the test in different conditions to provide a correct benchmark: in a next time we will do that.
But right now, I would like to perform a more useful (for our classification target) benchmark-test:
Consider an SVM based classifier, trained using the same parameters tuning strategy: How does the accuracy change if we build the vectors using different bag of words?
Basically we are trying to understand what is the best strategy to select the feature set (that is the bag of words) to classify our 4–classes data set.
Let me clarify immediately an important point:

  • The strategy of “single bag of words” isn’t an approach really useful in real business scenarios: it is too much trivial.
So why are we doing such test?
Because these kind of tests can give us a lot of useful information about the data set, and allow us to choose the best strategy!

Test description.
We have already seen that measure the accuracy on the training set is extremely dangerous (soon or later I should explain the pillar of error theory and the overfitting problems…).
So the first step is to assess your experiment with a validation set (...there are more sophisticated test to do that like k-fold cross validation, but for our case should be enough): How can we do that?
…Easy: splitting the training set in two parts: the first one will be used to train the algorithm, and the second one will be used to measure the accuracy.
Be aware: you never can measure the accuracy using samples already observed by the learning algo. Of course you cannot measure the accuracy on the test set to retraining your system!! (please see the article of dr. Sandro Saitta at: http://www.dataminingblog.com/how-to-cheat-with-data-mining).

Training Set splitting

The second step is to build the vectors of the same training set using different bag of words. For this experiment, I used the “TF-DF”, “TF-IDF”, and the “Closeness Centrality” functions to extract the features and to build the vectors.
At this point we can assess the accuracy of the classifier with the same data set but built in different ways.
Accuracy test using TF-DF bag of words.
In the accuracy matrix (% format) the principal diagonal contains the true positive.  
Accuracy test using TF-IDF bag of words.
In the accuracy matrix (% format) the principal diagonal contains the true positive.
Accuracy test using Closeness Centrality based bag of words.
In the accuracy matrix (% format) the principal diagonal contains the true positive.  

The above tables shows (in percentage) the answers provided by the classifier (the diagonal are the True Positive). The element [i,j] = x says that: the real class was i and the classifier assigned to the class j x% of the documents belonging to i.

For the "graph theory based" bag of words, we can also shown the corpus distribution in terms of the closeness function:
Graph of the Corpus generated by Closeness Centrality (has been plotted only the most relevant features)
Red color = high Closeness Centrality.
Purple = low Closeness Centrality.  


Results
There are three important things to highlight:

  1. For this data set the TF-DF feature set returns the worst accuracy (the grain class TP = 43%) respect the other two methods.
  2.  The TF – IDF and the Closeness Centrality give better results. And the accuracy is almost identical in absolute terms.
  3.  The grain and oil classes are the tricky case: the algorithm is not able to divided them properly.

As you can imagine the third findings is the most important information returned by this experiment: it says us that we have to spend more efforts to highlight the differences among the classes oil and grain in order to increase the accuracy.
Notice that in the “TF - IDF” and “Closeness Centrality” the True Positive distribution is mirrored! This supports our thesis that we are not describing properly the documents belonging to these two classes.
We will see how to solve the problem!
Stay Tuned.



Thursday, September 8, 2011

Classification Step: C 5.0 vs SVM

Finally, after many preambles we have all conditions to classify the documents. So far we have seen:
  1. How to translate a document in a vectors understandable by algorithms;
  2. How to analyze the data set with density analysis;
  3.  How to reduce the dimension of the problem (features reduction)

Well, now we have the all ingredients to classify the documents.
To show a couple of approach, I have extracted 4 classes from our REUTERS data set (choosing the criteria “TOPICS”, and considering the easier condition of documents belonging only to one class): “GOLD”,“GRAIN”,”OIL”,”TRADE”.

In this stage I put my self in a “comfortable zone” so, I considered the easiest way to classify the documents, that is, I built the vectors considering just ONE “bag of words”.
Basically I extracted the features set from the entire training set without any distinction of the specific class of the document.
In this approach we are delegating the entire complexity of the task to the classifier!
As we can see below, be lazy never pay!! (…I hope that it makes sense in English… ).
Let’s move on the results.
As I always recommend, never play with one toy! Be always curios and try different approach to solve the problem.

In this spirit for the time being I have done some tests using SVM (in LIBSVM implementation) and C.5 (the free version).
I embedded in Wolfram Mathematica the scripts to run both LIBSVM and C5 (contact me at cristian.mesiano@gmail.com if you desire the source); in this way all tasks (data preparing, features extraction, vector building and so on) can be done using only one interface.

Here you are the accuracy for SVM: 
Notice that the training set has been rescaled and weighted with different scores. 

Accuracy For SVM Classifier: Best results with Linear Kernel
As you can see, the accuracy measured on the training set is not brilliant! The OIL class is in the 50% of the case mislead. Moreover we have to consider that:
·      we are working with an easy data set and with few classes.
·      The accuracy has been measured only on the training set (…remember that the first error of the newbie is to focus much more on the training env.). The training set contains only points already observed by the classifier.  And analysis done only on it can hide overfitting problems!!

Even trying with different tuning, and with different kernels, the accuracy doesn’t improve.

The situation doesn’t improve with C5: as you can see below, the accuracy is pretty much the same achieved via SVM.
The Decision Tree returned by C5 (plotted with Mathematica). 
Accuracy Matrix Provided by C5: The results are pretty similar to the SVM computations 
So Do we give up?
NEVER.
In the next posts we will see how improve the accuracy introducing boosting techniques.
…There are thousands of details to consider and to explain like:
linear kernels for SVM, how to measure the accuracy, how to improve the training set and avoid the over fitting…
We will see that in the next posts.
Stay Tuned
Cristian

Monday, August 29, 2011

Features Reduction part 2: C 5 vs PCA - Road Map

In the last post I showed how reduce the complexity (dimension) of the problem through PCA algorithm.
I mentioned that PCA works well under the condition that the data set has homogenous density.
Are we in this scenario?
I analyzed our data set to verify how the points are distributed; to do that I used the first three components suggested by PCA for the classes having TOPICS = "GOLD" and "GRAIN".
To highlight better the density I soiled the 3D coordinates with a weak gaussian noise just to avoid the presence of coincident points. Here you are the results:
In green "GRAIN" class, in violet "GOLD" class. Both soiled with weak gaussian noise.
As you can see both the data set are pretty much heterogenous! So the enthusiasm sprung from the higher reduction of PCA is maybe too much optimistic!
A countercheck of this quantitative analysis is provided by the second feature reduction algorithm we are testing: C 5.

A C 5 decision tree is constructed using GainRatio. GainRatio is a measure incorporating entropy. Entropy measures how unordered the data set is. Further details about C.5 are available at Quinlan's website: http://www.rulequest.com
An Extremely interesting book about Entropy application is  the famous "Cover - Thomas":

Elements of Information Theory, 2nd Edition

C 5  is really easy to manage. In the default version you don't need to tune parameters, even if a lot of configurations are available to improve the accuracy. An interesting feature (by my prospective) is the boosting mode. ... but we will discuss about that in the next time...

I wrote a wrapper in Mathematica to:
  •  prepare the configurations files (.name file and .data) ;
  • adapt the data set from LIBSVM format to C5 format;
  • to display the tree output.
Contact me if you desire the notebook.
Let's show the output of C 5 using the default setting:


The Decision Tree returned by C 5
The leaves (depicted with red triangles) represent the class assigned by the algo for this decision path.
A = "GOLD", B= "GRAIN", C= "OIL".
The left members of the node inequalities  represent the feature number.
The above Tree represents  the Decision paths for a data set composed by the Three classes having TOPICS = "GOLD","GRAIN","OIL". 
To build the vector I used the TF DF feature set of the class "GOLD".
(in the next post I gonna explain why I'm building vectors of class X using feature set of the class Y :) ).
As you can see in the graph , only few features have been used: 

  
Notice that C 5 returns different features respect PCA algo! Moreover it suggests us to use only 7 features: it is an amazing complexity reduction!!
I used C 5 just to analyze the important features, but C 5 is itself a classifier! In the below image the classification accuracy for the mentioned classes.




So C 5 gives us a set of rules to classify documents belonging to three different classes (the rules are depicted by the tree paths from the "start" node to each leaf) with an overall error rate less than 15%.
Is it a good result? Apparently yes, but ...if you have a look to the class "GRAIN" (B) the results are not so brilliant!!.
Don't worry we will see how to improve dramatically these accuracy.
At this point we have discussed about many algorithms, techniques and data set.
So where we are, and where we gonna go?
Auto classification Road Map


Guys, Stay Tuned.
cristian