Thursday, October 13, 2011

ADA boost: a way to improve a weak classifier

We have seen that often a classifier is not able to do its job properly.
Why? There are many reasons:
  •  Even if many algorithms are in principle general purpose, we know that testing the algorithm on different data set (that is on different problem) the accuracy can change a lot.
  • The training hasn’t been done properly: consider for example datasets having high dimensionality, how can you be sure that your classifier hasn’t affected by overfitting?
  • The classifier it self cannot solve such problem: e.g a perceptron cannot solve the xor problem (more generally cannot solve non linear problems).

So the idea behind ADA, is to linearly combine a set of weak classifiers to obtain a better classifier!
One of the reasons that makes ADA so powerful is its generality:
It is a meta-algorithm, that is, you can think ADA like a strategy where you can combine different weak classifiers to cooperate together to improve the overall accuracy. So in the linear combination of classifiers you can have perceptrons mixed with neural networks, decision trees and so on.
Every single classifier is specialized to solve just a piece of the problem, so combining them you obtain the overall solution of the problem.
How it works?
ADA is an iterative strategy where on each round, the weights of each incorrectly classified example are increased so that the new classifier focuses more on those examples.
Further details are available everywhere in internet, but I would suggest to focus on the distribution function that it minimizes.
This is a crucial point: the function (at least in the original implementation) is a convex distribution, and the algo tends to fall in local minimum because basically it performs a greedy process.
Again: when you use an algorithm focus on its limits!!
As usual I’m not going to report here the meta-code of algo: there are hundreds of web pages with exactly the same things and I don’t want add entropy to the Universe J
I prefer show the some graphical example done in Mathematica: I believe that an image is thousands times better than a formula to understand the underlying concepts! (…I love the theoretical aspects of the algorithms, but after the theory you must to be practical and able to implement the solution!!).

So let’s start with a 2-dimensional problem:
I ran on the same data set (not linearly separable) many SVM having different setting (of course an SVM well trained can easily solve the problem).
I choose voluntarily a weak kernel (polynomial) with a bad capacity factor values.
Here you are the hyperplanes found by the svm.  As you can see, no one solution found fits properly problem.
  
The set of weak classifier  used to build the boosted classifier

Let me show now what is the output of ADA after few iterations.

The output of ADA using the above weak classifiers. Notice how the linear combination produce a segmented classifier.
Below an of example with 3-D datasets
The set of weak hyperplanes found by the weak svms (mixed)
Here you are the ADA output after just 2 iterations:
ADA output after two iterations


and the optimal solution found after 7 iterations:
The ADA output after 7 iterations: as you can see the optimal solution has been found.

Contact me if you are interested to the notebook with the Mathematica implementation.
Stay tuned!
cristian



11 comments:

  1. I'm interested in the notebook.

    Would you please show it or send it?

    edsferr@uol.com.br

    ReplyDelete
  2. Sorry, I am not quite a specialist on the topic. Is "weak classifier" a term, or does it mean that the classifier works poorly?

    If you mean the latter, I am here to disagree, because SVM with a polynomial kernel works pretty well for the case of learning to rank.

    ReplyDelete
  3. Hi itman, as I wrote in the post an svm properly trained can solve the problem without efforts ( a polynomial kernel having degree equal 3 is enough). I overfitted the 6 svm just to simulate a weak classifier. Btw thank to your remark

    ReplyDelete
  4. I choose voluntarily a weak kernel (polynomial) with a bad capacity factor values

    Thank you for the clarification, I see now, but I missed when I glazed over the text. This seems to be a very interesting opportunity.

    ReplyDelete
  5. Let me know if you are interested to the notebook.
    Cheers
    cristian

    ReplyDelete
  6. Could you post the Sourcecode for this notebook, for the 2-D Classifier as well as for the 3-D Classifier?
    Or could you send the notebook to giselagauss@web.de ??
    This looks very interesting to me :)

    ReplyDelete
  7. Hi
    Happy New Year.
    I've read your comments. I'm a research scholar in Anna University, India. Right now I'm concntrating on AdaBoost Alg. I've appled Ada with Random Tree and ID3 algorithms. It improved the performance of Random Tree but reduced that of ID3. What might be the reason? Can you give any suggestions related to this?

    ReplyDelete
  8. My mail id: psshanthi@yahoo.com

    ReplyDelete
  9. Hi That was a useful read Thank you!Could you please send me the notebook mt74.mana@gmail.com.I am currently pursuing research in text mining at IIITB Bangalore India.

    ReplyDelete
  10. I'm interested in the source code/notebook of this implementation. How can I obtain it?

    ReplyDelete