Tuesday, August 2, 2011

Document Classification part 1 - Data Set Analysis

As promised, in this post I'm going to discuss about an important branch of text analytics: the text categorization.
Step by step, we will build a classifier going through the following steps:

  • Data set preparing.
  • feature extraction.
  • feature reduction comparing different techniques like PCA, C4.5, and other algos.
  • Translation of the documents in the proper form to feed the classification algorithms.
  • Classification via SVM ...we will explore different strategies to adapt a boolean classifier to multi class algorithm 
  • Benchmarking with other classifiers like C4.5, naive Bayes Classifier.
As you can see, it will take plenty time to show every method, and I really hope you will attend proactively to exchange experiences.

After the pcreambolus, let's start with the first step: "data set preparing".
I decided to use for this experiment a very common standard data set: the Reuters-21578 data set.
It is a data set often used in academical context to perform benchmark tests over different strategies. 
It is organized in tagged files (sgm) easy to handle to build ad hoc data set, and you can download it at: http://www.daviddlewis.com/resources/testcollections/

Here you are the composition of the data set:
The Reuters-21578 data set composition
As you can see it is a multi dimensional data set, that is you can perform classification tasks across different kind of categories.
For example, you could build a data set considering "TOPICS" classes where "PLACES" = "Canada" and so on.
For our tests I decided to work with categories TOPICS: there are documents belonging to 135 different classes of TOPICS.
In a realistic scenario (by enterprise prospective) build an efficient classifier for 135 classes is too much expensive in terms of efforts to achieve good accuracy for each class, in terms of maintenance, and in terms of computational efforts.
So, if there aren't business constraints, is important perform a quantitative analysis to detect which are the most common classes.

The most important classes for "TOPICS" categories
The important question is:
How many classes have I to consider to cover at least X% of the entire amount of documents?
x- axis = classes (sorted by volume) --- y-axis % of documents covered 
As you can see, considering for example only 20 classes instead of 135, we are able to cover around 80% of all documents in the data set: this is the "Pareto Principle"!

If you are interested to play with the Reuters data set and follow the step by step classifier building, I arranged a Wolfram Mathematica notebook to extract automatically training/test set, using custom categories. Contact me directly at cristian.mesiano@gmail.com


  1. It's fun to see that Reuters-21578 is still finding uses, at least pedagogically. A few comments:

    1. For most research purposes RCV1-v2 is today more useful: more documents, more categories, better understanding of the accuracy of the manual categorization, etc.

    2. It's important to mention that Reuters documents can belong to 0, 1, or several of the 135 TOPICS categories. Given that, I'm not clear on how you're defining coverage. Upper bound on proportion of documents that have at least one category that get at least one of their categories?

    3. Building and maintaining classifiers for 135 categories is not necessarily, or even usually, a problem operationally. The limiting factor would not be computation, but availability of people who have been trained to label training and validation data for those 135 categories.

  2. First of all, many thanks to take part in this discussion. I'm very honored of that!
    About the point 2: One of the aims of this blog-tutorial is to provide an overview of the techniques and methods to classify texts. So, in the beginning I'm going to classify documents without overlap. When the level will be mature I'll show some techniques to analyze the overlapping.

    point 3: totally with you!
    Thanks again.
    I really hope you will give us other precious contributes!
    ...I cannot forget some your papers I read years ago during my university-time!!

  3. Cristian - And thanks to you for making the effort to teach people.

    Re point 2, so is the idea that you will be working only with the subset of documents that have exactly one category?

  4. Exactly David,
    I'm selecting documents in which the tag "TOPICS" contains only 1 class.
    I intend classifying 4-5 classes of docs.
    @blog community:
    Dr. David Lewis is a LEGEND of "Text Mining", "Text categorization".
    He is the Father of "Reuters Data Set" as well!