Step by step, we will build a classifier going through the following steps:
- Data set preparing.
- feature extraction.
- feature reduction comparing different techniques like PCA, C4.5, and other algos.
- Translation of the documents in the proper form to feed the classification algorithms.
- Classification via SVM ...we will explore different strategies to adapt a boolean classifier to multi class algorithm
- Benchmarking with other classifiers like C4.5, naive Bayes Classifier.
As you can see, it will take plenty time to show every method, and I really hope you will attend proactively to exchange experiences.
After the pcreambolus, let's start with the first step: "data set preparing".
I decided to use for this experiment a very common standard data set: the Reuters-21578 data set.
It is a data set often used in academical context to perform benchmark tests over different strategies.
It is organized in tagged files (sgm) easy to handle to build ad hoc data set, and you can download it at: http://www.daviddlewis.com/resources/testcollections/
Here you are the composition of the data set:
|The Reuters-21578 data set composition|
As you can see it is a multi dimensional data set, that is you can perform classification tasks across different kind of categories.
For example, you could build a data set considering "TOPICS" classes where "PLACES" = "Canada" and so on.
For our tests I decided to work with categories TOPICS: there are documents belonging to 135 different classes of TOPICS.
In a realistic scenario (by enterprise prospective) build an efficient classifier for 135 classes is too much expensive in terms of efforts to achieve good accuracy for each class, in terms of maintenance, and in terms of computational efforts.
So, if there aren't business constraints, is important perform a quantitative analysis to detect which are the most common classes.
|The most important classes for "TOPICS" categories|
How many classes have I to consider to cover at least X% of the entire amount of documents?
|x- axis = classes (sorted by volume) --- y-axis % of documents covered|
If you are interested to play with the Reuters data set and follow the step by step classifier building, I arranged a Wolfram Mathematica notebook to extract automatically training/test set, using custom categories. Contact me directly at email@example.com