Monday, January 30, 2012

Document Classification: how to boost your classifier

Time ago, we discussed about ADAboost as a meta algorithm useful to improve the accuracy of a weak classifier and I showed a sample on a boolean classification problem.
Frund and Schapire in this excellent paper proposed a version of this algorithm adapted to work with "multi labels" classifiers.

The two variants proposed in such paper are based on the trick to assign a weight to each sample of the training set in according with the error committed by the classifier, so it forces the classifier to be focused on the hard instances of the problem.
The improved classifier is finally obtained as a weighted combination of each version of the weak algorithm improved at each iteration.

Adaption of AdaBoost.M1 to a classifier SVM based.

ADaBoost.M1 tries to improve step by step the accuracy of the classifier analyzing its behavior on training set. (Of course you cannot try to improve the classifier working with the test set!!).
Here lays the problem, because if we choose as "weak algorithm" an SVM, we know that almost always it returns excellent accuracy on the training set with results closed to 100% (in term of true positive).
In this scenario, try to improve the accuracy of classifier assigning different weights to the training set doesn't work!

The idea of AdaBoost.M1 is still valid also with SVM classifiers, but I noticed that instead of to improve step by step the single SVM classifier,  we can train different SVM classifiers (with different parameters) and assign the score to each classifier based on the error committed on a part of the training set not used to train the SVM.
Let me explain better the procedure:

  1. Split the training set in two different sets: training set (80% of the volume of the original training set), boosting set (20% of the volume of the original training set).
  2. train k SVM classifiers with different parameters.
  3. test the above SVM on the boosting set and calculate the beta score as done in the original implementation of ADAboost.M1.
  4. The final answer of the boosted classifier is obtained as proposed by the original implementation of the Adaboost.M1

As you can notice, the method basically assigns to the k classifiers a "trustability score".


Real Case
Let's see what happens with our REUTERS-2158 data set:
I tried to classify the documents belonging to the categories: "GRAIN","CRUDE", "TRADE","OIL".
I trained four different SVM. Here you are what are the results of each classifier:
The Confusion Matrix for the Classifier I.The principal diagonal represents the True Positive.
In The circle the worst result for this classifier.


The Confusion Matrix for the Classifier II.The principal diagonal represents the True Positive.
In The circle the worst result for this classifier. 

The Confusion Matrix for the Classifier III.The principal diagonal represents the True Positive.
In The circle the worst result for this classifier. 
The Confusion Matrix for the Classifier IV.The principal diagonal represents the True Positive.
In The circle the worst result for this classifier. 
And after the application of the meta algorithm, here you are the confusion matrix for the improved classifier:
The Confusion Matrix for the Boosted classifier
As you can see the boosting provides us really good results! 
Here the same results showed in terms of precision and recall:
 
The Precision/Recall chart for the four weak classifiers.
(The ring shows the worst case for each classifier)
And here you are the Precision/Recall for the boosted classifier:
As usual Stay Tuned
Cristian

3 comments: