Thursday, September 15, 2011

How the classifier accuracy is conditioned by the Feature Set (Bag of words)

In the last post we have seen a naïve approach to classify via SVM and C5, and we noticed that both SVM and C5 didn’t get good accuracy.
Despite the unsatisfactory results, we tested the different algorithms under the same conditions, and we found out that (at least in this case) they have the same behavior. Of course we should repeat the test in different conditions to provide a correct benchmark: in a next time we will do that.
But right now, I would like to perform a more useful (for our classification target) benchmark-test:
Consider an SVM based classifier, trained using the same parameters tuning strategy: How does the accuracy change if we build the vectors using different bag of words?
Basically we are trying to understand what is the best strategy to select the feature set (that is the bag of words) to classify our 4–classes data set.
Let me clarify immediately an important point:

  • The strategy of “single bag of words” isn’t an approach really useful in real business scenarios: it is too much trivial.
So why are we doing such test?
Because these kind of tests can give us a lot of useful information about the data set, and allow us to choose the best strategy!

Test description.
We have already seen that measure the accuracy on the training set is extremely dangerous (soon or later I should explain the pillar of error theory and the overfitting problems…).
So the first step is to assess your experiment with a validation set (...there are more sophisticated test to do that like k-fold cross validation, but for our case should be enough): How can we do that?
…Easy: splitting the training set in two parts: the first one will be used to train the algorithm, and the second one will be used to measure the accuracy.
Be aware: you never can measure the accuracy using samples already observed by the learning algo. Of course you cannot measure the accuracy on the test set to retraining your system!! (please see the article of dr. Sandro Saitta at:

Training Set splitting

The second step is to build the vectors of the same training set using different bag of words. For this experiment, I used the “TF-DF”, “TF-IDF”, and the “Closeness Centrality” functions to extract the features and to build the vectors.
At this point we can assess the accuracy of the classifier with the same data set but built in different ways.
Accuracy test using TF-DF bag of words.
In the accuracy matrix (% format) the principal diagonal contains the true positive.  
Accuracy test using TF-IDF bag of words.
In the accuracy matrix (% format) the principal diagonal contains the true positive.
Accuracy test using Closeness Centrality based bag of words.
In the accuracy matrix (% format) the principal diagonal contains the true positive.  

The above tables shows (in percentage) the answers provided by the classifier (the diagonal are the True Positive). The element [i,j] = x says that: the real class was i and the classifier assigned to the class j x% of the documents belonging to i.

For the "graph theory based" bag of words, we can also shown the corpus distribution in terms of the closeness function:
Graph of the Corpus generated by Closeness Centrality (has been plotted only the most relevant features)
Red color = high Closeness Centrality.
Purple = low Closeness Centrality.  

There are three important things to highlight:

  1. For this data set the TF-DF feature set returns the worst accuracy (the grain class TP = 43%) respect the other two methods.
  2.  The TF – IDF and the Closeness Centrality give better results. And the accuracy is almost identical in absolute terms.
  3.  The grain and oil classes are the tricky case: the algorithm is not able to divided them properly.

As you can imagine the third findings is the most important information returned by this experiment: it says us that we have to spend more efforts to highlight the differences among the classes oil and grain in order to increase the accuracy.
Notice that in the “TF - IDF” and “Closeness Centrality” the True Positive distribution is mirrored! This supports our thesis that we are not describing properly the documents belonging to these two classes.
We will see how to solve the problem!
Stay Tuned.


  1. I wonder if you can improve the accuracy by using n-grams (tuples of n consecutive words) instead of simple words. It will surely make the feature space much bigger, but it will capture grammar relationships as well improving accuracy.

  2. For some reason, my comment appears as Kitten Lulu (an old blog name). My name is actually Roberto Lupi.

  3. Hi Roberto,
    Thank you for your contribute!
    There are many articles on the n-grams and on semantic analysis claiming good accuracy.
    But it is not always true, consider for example corpus composed by documents like invoices or tables: in these case the semantic techniques don't work! In addition, the semantic methods, don't suit with documents having few words.
    From a corporate prospective, consider also that the semantic methods (and n-grams) are really a pitfall because these techniques are language dependent and if your corpus is composed by docs coming from different languages you have to replicate many times the system!
    The Latent semantic analysis seems be more promising:
    It considers co-occurrences of consecutive words (which are not perforce n-grams), and it is language agnostic (but it suffer in case of polysemy).
    BTW doesn't exist a method that fits for whatever datasets (IMHO)!
    I choose different strategies for different data set: this is the reason because the statistical and descriptive analysis of the data set is always recommended.

    1. As the "No free lunch" theorem teaches us... :)

      I love this blog!
      Thank you for your posts.

  4. This result is quite impressive. While I was reading the article I have immediately thought that applying Boosting philosophy should be interesting: using different BoW representations instead of totally different classifiers.

    I will deepen it better. In the meanwhile, what do you think about it?


  5. Hi Michele,
    thanks for you interest on this blog!
    It is one of the trick often used in Real scenarios of "document classification". The core classifier is always the same, but you feed the set of classifiers with dataset built in different way.
    BTW, I hope to see your name in the "followers list" on the right panel of the blog.