Finally, after many preambles we have all conditions to classify the documents. So far we have seen:
- How to translate a document in a vectors understandable by algorithms;
- How to analyze the data set with density analysis;
- How to reduce the dimension of the problem (features reduction)
Well, now we have the all ingredients to classify the documents.
To show a couple of approach, I have extracted 4 classes from our REUTERS data set (choosing the criteria “TOPICS”, and considering the easier condition of documents belonging only to one class): “GOLD”,“GRAIN”,”OIL”,”TRADE”.
In this stage I put my self in a “comfortable zone” so, I considered the easiest way to classify the documents, that is, I built the vectors considering just ONE “bag of words”.
Basically I extracted the features set from the entire training set without any distinction of the specific class of the document.
In this approach we are delegating the entire complexity of the task to the classifier!
As we can see below, be lazy never pay!! (…I hope that it makes sense in English… ).
Let’s move on the results.
As I always recommend, never play with one toy! Be always curios and try different approach to solve the problem.
In this spirit for the time being I have done some tests using SVM (in LIBSVM implementation) and C.5 (the free version).
I embedded in Wolfram Mathematica the scripts to run both LIBSVM and C5 (contact me at email@example.com if you desire the source); in this way all tasks (data preparing, features extraction, vector building and so on) can be done using only one interface.
Here you are the accuracy for SVM:
|Notice that the training set has been rescaled and weighted with different scores.|
|Accuracy For SVM Classifier: Best results with Linear Kernel|
As you can see, the accuracy measured on the training set is not brilliant! The OIL class is in the 50% of the case mislead. Moreover we have to consider that:
· we are working with an easy data set and with few classes.
· The accuracy has been measured only on the training set (…remember that the first error of the newbie is to focus much more on the training env.). The training set contains only points already observed by the classifier. And analysis done only on it can hide overfitting problems!!
Even trying with different tuning, and with different kernels, the accuracy doesn’t improve.
The situation doesn’t improve with C5: as you can see below, the accuracy is pretty much the same achieved via SVM.
|The Decision Tree returned by C5 (plotted with Mathematica).|
So Do we give up?
In the next posts we will see how improve the accuracy introducing boosting techniques.
…There are thousands of details to consider and to explain like:
linear kernels for SVM, how to measure the accuracy, how to improve the training set and avoid the over fitting…
We will see that in the next posts.