We have seen that often a classifier is not able to do its job properly.
Why? There are many reasons:
- Even if many algorithms are in principle general purpose, we know that testing the algorithm on different data set (that is on different problem) the accuracy can change a lot.
- The training hasn’t been done properly: consider for example datasets having high dimensionality, how can you be sure that your classifier hasn’t affected by overfitting?
- The classifier it self cannot solve such problem: e.g a perceptron cannot solve the xor problem (more generally cannot solve non linear problems).
So the idea behind ADA, is to linearly combine a set of weak classifiers to obtain a better classifier!
One of the reasons that makes ADA so powerful is its generality:
It is a meta-algorithm, that is, you can think ADA like a strategy where you can combine different weak classifiers to cooperate together to improve the overall accuracy. So in the linear combination of classifiers you can have perceptrons mixed with neural networks, decision trees and so on.
Every single classifier is specialized to solve just a piece of the problem, so combining them you obtain the overall solution of the problem.
How it works?
ADA is an iterative strategy where on each round, the weights of each incorrectly classified example are increased so that the new classifier focuses more on those examples.
Further details are available everywhere in internet, but I would suggest to focus on the distribution function that it minimizes.
This is a crucial point: the function (at least in the original implementation) is a convex distribution, and the algo tends to fall in local minimum because basically it performs a greedy process.
Again: when you use an algorithm focus on its limits!!
As usual I’m not going to report here the meta-code of algo: there are hundreds of web pages with exactly the same things and I don’t want add entropy to the Universe J
I prefer show the some graphical example done in Mathematica: I believe that an image is thousands times better than a formula to understand the underlying concepts! (…I love the theoretical aspects of the algorithms, but after the theory you must to be practical and able to implement the solution!!).
So let’s start with a 2-dimensional problem:
I ran on the same data set (not linearly separable) many SVM having different setting (of course an SVM well trained can easily solve the problem).
I choose voluntarily a weak kernel (polynomial) with a bad capacity factor values.
Here you are the hyperplanes found by the svm. As you can see, no one solution found fits properly problem.
|The set of weak classifier used to build the boosted classifier|
Let me show now what is the output of ADA after few iterations.
|The output of ADA using the above weak classifiers. Notice how the linear combination produce a segmented classifier.|
Below an of example with 3-D datasets
|The set of weak hyperplanes found by the weak svms (mixed)|
Here you are the ADA output after just 2 iterations:
|ADA output after two iterations|
and the optimal solution found after 7 iterations:
|The ADA output after 7 iterations: as you can see the optimal solution has been found.|
Contact me if you are interested to the notebook with the Mathematica implementation.