Despite the unsatisfactory results, we tested the different algorithms under the same conditions, and we found out that (at least in this case) they have the same behavior. Of course we should repeat the test in different conditions to provide a correct benchmark: in a next time we will do that.
But right now, I would like to perform a more useful (for our classification target) benchmark-test:
Consider an SVM based classifier, trained using the same parameters tuning strategy: How does the accuracy change if we build the vectors using different bag of words?
Basically we are trying to understand what is the best strategy to select the feature set (that is the bag of words) to classify our 4–classes data set.
Let me clarify immediately an important point:
- The strategy of “single bag of words” isn’t an approach really useful in real business scenarios: it is too much trivial.
So why are we doing such test?
Because these kind of tests can give us a lot of useful information about the data set, and allow us to choose the best strategy!
We have already seen that measure the accuracy on the training set is extremely dangerous (soon or later I should explain the pillar of error theory and the overfitting problems…).
So the first step is to assess your experiment with a validation set (...there are more sophisticated test to do that like k-fold cross validation, but for our case should be enough): How can we do that?
…Easy: splitting the training set in two parts: the first one will be used to train the algorithm, and the second one will be used to measure the accuracy.
Be aware: you never can measure the accuracy using samples already observed by the learning algo. Of course you cannot measure the accuracy on the test set to retraining your system!! (please see the article of dr. Sandro Saitta at: http://www.dataminingblog.com/how-to-cheat-with-data-mining).
|Training Set splitting|
The second step is to build the vectors of the same training set using different bag of words. For this experiment, I used the “TF-DF”, “TF-IDF”, and the “Closeness Centrality” functions to extract the features and to build the vectors.
At this point we can assess the accuracy of the classifier with the same data set but built in different ways.
|Accuracy test using TF-DF bag of words. |
In the accuracy matrix (% format) the principal diagonal contains the true positive.
The above tables shows (in percentage) the answers provided by the classifier (the diagonal are the True Positive). The element [i,j] = x says that: the real class was i and the classifier assigned to the class j x% of the documents belonging to i.
For the "graph theory based" bag of words, we can also shown the corpus distribution in terms of the closeness function:
|Graph of the Corpus generated by Closeness Centrality (has been plotted only the most relevant features)|
Red color = high Closeness Centrality.
Purple = low Closeness Centrality.
There are three important things to highlight:
- For this data set the TF-DF feature set returns the worst accuracy (the grain class TP = 43%) respect the other two methods.
- The TF – IDF and the Closeness Centrality give better results. And the accuracy is almost identical in absolute terms.
- The grain and oil classes are the tricky case: the algorithm is not able to divided them properly.
As you can imagine the third findings is the most important information returned by this experiment: it says us that we have to spend more efforts to highlight the differences among the classes oil and grain in order to increase the accuracy.
Notice that in the “TF - IDF” and “Closeness Centrality” the True Positive distribution is mirrored! This supports our thesis that we are not describing properly the documents belonging to these two classes.
We will see how to solve the problem!