Monday, August 29, 2011

Features Reduction part 2: C 5 vs PCA - Road Map

In the last post I showed how reduce the complexity (dimension) of the problem through PCA algorithm.
I mentioned that PCA works well under the condition that the data set has homogenous density.
Are we in this scenario?
I analyzed our data set to verify how the points are distributed; to do that I used the first three components suggested by PCA for the classes having TOPICS = "GOLD" and "GRAIN".
To highlight better the density I soiled the 3D coordinates with a weak gaussian noise just to avoid the presence of coincident points. Here you are the results:
In green "GRAIN" class, in violet "GOLD" class. Both soiled with weak gaussian noise.
As you can see both the data set are pretty much heterogenous! So the enthusiasm sprung from the higher reduction of PCA is maybe too much optimistic!
A countercheck of this quantitative analysis is provided by the second feature reduction algorithm we are testing: C 5.

A C 5 decision tree is constructed using GainRatio. GainRatio is a measure incorporating entropy. Entropy measures how unordered the data set is. Further details about C.5 are available at Quinlan's website: http://www.rulequest.com
An Extremely interesting book about Entropy application is  the famous "Cover - Thomas":

Elements of Information Theory, 2nd Edition

C 5  is really easy to manage. In the default version you don't need to tune parameters, even if a lot of configurations are available to improve the accuracy. An interesting feature (by my prospective) is the boosting mode. ... but we will discuss about that in the next time...

I wrote a wrapper in Mathematica to:
  •  prepare the configurations files (.name file and .data) ;
  • adapt the data set from LIBSVM format to C5 format;
  • to display the tree output.
Contact me if you desire the notebook.
Let's show the output of C 5 using the default setting:


The Decision Tree returned by C 5
The leaves (depicted with red triangles) represent the class assigned by the algo for this decision path.
A = "GOLD", B= "GRAIN", C= "OIL".
The left members of the node inequalities  represent the feature number.
The above Tree represents  the Decision paths for a data set composed by the Three classes having TOPICS = "GOLD","GRAIN","OIL". 
To build the vector I used the TF DF feature set of the class "GOLD".
(in the next post I gonna explain why I'm building vectors of class X using feature set of the class Y :) ).
As you can see in the graph , only few features have been used: 

  
Notice that C 5 returns different features respect PCA algo! Moreover it suggests us to use only 7 features: it is an amazing complexity reduction!!
I used C 5 just to analyze the important features, but C 5 is itself a classifier! In the below image the classification accuracy for the mentioned classes.




So C 5 gives us a set of rules to classify documents belonging to three different classes (the rules are depicted by the tree paths from the "start" node to each leaf) with an overall error rate less than 15%.
Is it a good result? Apparently yes, but ...if you have a look to the class "GRAIN" (B) the results are not so brilliant!!.
Don't worry we will see how to improve dramatically these accuracy.
At this point we have discussed about many algorithms, techniques and data set.
So where we are, and where we gonna go?
Auto classification Road Map


Guys, Stay Tuned.
cristian

Wednesday, August 17, 2011

Build the vectors and Feature Reduction

Part I: building the vectors
In the former post we have created a features set for the class GOLD, now we show how use this features set to build the vectors to feed the svm.
The Features set is the fingerprint that characterizes the document class.
To build the vectors, you have to follow the same parsing and filtering process used to extract the features.
As described in the previous post,  from the original file, you have to follow the below steps:
  • Split the words of the file;
  • Remove the words having Length <= 2;
  • Remove the stopWords
  • Take the "root" of the word (e.g. Large from Larger)
On the filtered file just obtained, you can build every sort of statistic to associate to each word of the filtered file a score.
For our purpose the simple frequency could be enough.
 The last step is to associate to each word of the filtered file the score just calculated with the score associated to each word of the features set.
This step is illustrated in the below illustration:
"FT score" stands for Feature set score, < . , . > stands for a linear operator.
Notice that a vector assumes 0 value when the filtered file doesn't contain occurrences for the specific feature.
The grey box is the final Real Number vector. 
As you can see the vector has the same dimension of the Feature Set, in this case (built using TF-DF) it has dimension equal closed to 1000.
In a enterprise environment it could be a bottleneck in terms of computational capabilities, this due to the strict relation between vector dimension and time complexity to classify.

Part II: reducing the dimension
How can we reduce the dimension of the problem to speed up the entire classification process?
There are different methods,  the first I would show is the Principal Component Analysis (PCA).
As usual, I'm not going to enter in technical and theoretical issues: there are thousands websites, tutorial and paper  where you can find every sort of details on that.
Just to have an overview, PCA provides the most important linear components in a set of data. It is based on the concept of eigenvectors and eigenvalues : M.x = λ x where λ is the eigenvalue and x is the eigenvector.
PCA has two great features:
  1. it is completely unparameterized;
  2. it is extremely easy to implement; 
In Mathematica the implementation takes no more than 4-5 rows code:

the variable matrix is the dataset having whatever Dimension

Basically you standardize the data (to have mean = 0),  you compute the covariance matrix, and on that you extract the eigenvalues and eigenvectors.
The position of eigenvectors having higher eigenvalues are the the component most relevant: the components relevant are the features most important to classify the documents with a minimum loss of accuracy!
Just to understand the role played by eigenvectors consider  the following multivariate gaussian data set:

Data set gaussian distributed

The principal components are well represented by the red vectors:
Principal Components for the above data set 
Nevertheless, PCA is not useful in data having different density, and before to discard components (features) from your data set, I suggest to analyze the density!

What happens in our classes? 
I computed the PCA for the GOLD class, but the question  now is "What is the minimum threshold to discriminate the important components against the negligible features?   
...Sometimes a simple "logPlot" helps the analyst:

The blu arrow represent the discontinuity point for the eigenvalues 
As shown above, around at the component number 82, there is a discontinuity: after this point the magnitude of the eigenvalue is negligible. Here you are the threshold! 
Remark
The vectors have been built using the sorted Feature Set list (see the previous post).
and the criteria to sort was based on the TF- DF (or TF IDF, or ...): from the most important from TF DF prospective to the less important.
The graph above provides us a sort of countercheck of the goodness of the criteria chosen to select the features, because there is a perfect correspondence among the weight assigned with TF DF and the eigenvalues (notice the monotonicity of the graph)!
So for the GOLD class, we will consider only the first 82 components.
As you can see we used PCA in two different ways:
1) to countercheck our features extraction method
2) to reduce the complexity of the problem from 1000 to 82.

In the next post, we will see another algorithm of feature reduction: C4.5.
...We are now introducing many different approaches to play with the data, in the next post I'll show you the graph of experiments we are doing to keep track of every results and to chose the best strategy.
as usual: stay tuned
cristian

Tuesday, August 9, 2011

Feature Extraction - Different Approaches

When you build a classifier, and I'm not talking about only about the text categorization, you should have special care about this step.
If your features extractor has been made regardless the context,  whatever amazing classification algorithm you gonna use  ... the accuracy will be always unsatisfactory.
In the text categorization, you should  consider at least the following  points:
  • document containing a lot of text or in opposite documents containing few words.
  • document containing semantic  or in opposite documents without explicit semantic (like for example invoices or table)
  • context dictionary: that is the set of words used in the domain your are working on (for example  words like acronyms having meaning only in a specific context)
  • the overall environment where the classifier will work:  are there different languages? are there heterogeneous  sources? the manual classification is based only on the single document under processing?  
  • business constraints: like max time to process a document or business process.
Bag Of Words based on  TF - (I)DF 
Term Frequency - Inverse document Frequency (TD IDF) function is based on a work of Spärck Jones and Karen in 1972. It basically combines local frequency property of a specific word in the document with a global property based on the number (inverse) of documents containing this word. Further technical details are available in internet (let me know if you are interested to the original paper published).
Even if this function is dated, we will show how to is it in smart way to achieve good accuracy.
  • When does the TF DF work better than TF IDF?
When the documents of your domain contain few words (e.g. invoices,  spreadsheets, bank statements) the global component of IDF doesn't work well. In this context the easier TF DF works better!

In the classification sample we are working on, I gonna use K different Bag of Words (one bag of words  for each class) using both methods... we will check what is the best strategy!
To be honest I slightly modified the function in this way 

Modified TF (I)DF function
Where:
  •  W_i[D_j] = Occurrences for the Word W_i in the document [D_j];
  • DF[W_i] = Number of Documents containing  the Word W_i 
  • |TrSet| = Number of Documents for the specific Class; 
The Features set has been obtained filtering the documents by the StopWord list and stemming each word considering only the "Base Form".
Here you are The first 20 Features extracted for the Reuters data Set TOPICS = "GOLD" (see the previous post for further details) with the respective weight:

             TF DF                TF IDF
say ,51.907757725859135 sigma ,71.5899150352952
gold ,49.86505578380171 police ,64.30524039891371
ounce ,22.146187073000476 model ,64.30524039891371
ton ,8.317113808183201 land ,64.30524039891371
mine ,7.7300297803876585 grant ,64.30524039891371
company ,7.112820339974604 china ,64.30524039891371
pct ,7.094497045719568 pope ,64.30524039891371
year ,5.215953054827821 rand ,60.616126176742966
ore ,3.48549104848714 placer ,59.847379800543116
production ,3.2688965028783508 foot ,59.26642022035086
grade ,3.059411347433722 say ,58.04561519245233
silver ,2.5460893593153946 warrant ,57.23666647951021
price ,2.205337527517457 shoji ,55.38951920217253
reserve ,2.123335553835266 client ,55.38951920217253
drill ,2.118861548885319 ridge ,55.38951920217253
expect ,2.0390821784153337 alligator ,55.38951920217253
corp ,2.0390821784153337 play ,55.38951920217253
foot ,2.0236289512856094 vehicle ,55.38951920217253
short ,2.0225200886354586 grand ,55.38951920217253
average ,1.9028247115440413 action ,55.38951920217253

CLOSENESS CENTRALITY BASED
This is an attempt (I simplified a lot my the original version to highlight the main idea) using the Graph Theory to build a sort of "latent semantic" feature set.
After the filtering and stemming operations, I partitioned the document in couple of contiguous words and I built  a direct graph:
Sample of Direct Graph for a file of the Class "GOLD"
  To assign a score to a feature I used a Closeness Centrality function: The closeness centrality of a vertex u is defined as the inverse of the sum of the distance from u to all other vertices.
Assigning a color to the vertexes of the above graph (warm color for vertex having high score) you obtain:
Closeness Centrality assignment for the above graph

The first 20 features extracted using this method are:
bureau ,0.02127659574468085
interior ,0.01639344262295082
import ,0.014705882352941176
check ,0.01135161135161135
revise ,0.01105263157894737
technique ,0.01020408163265306
period ,0.00985400705790957
change ,0.009708737864077669
sigma ,0.009616273693943597
significant ,0.009615384615384616
cabinet ,0.009345794392523364
tunnel ,0.009212305611899932
affect ,0.008928571428571428
viceroy ,0.008796296296296295
conservatively ,0.008284833960972269
milling ,0.008240857798379922
practice ,0.008149473265752336
amenable ,0.008
cove ,0.0078125
department ,0.00757930926707298
How many Features Have we consider to obtain good accuracy?
How can we analyze the differences among the different Features Set seen before?
We will discuss about that in the next post.
Contact me at cristian.mesiano@gmail.com if you are interested to the Wolfram Mathematica source code.
Stay Tuned!!

Tuesday, August 2, 2011

Document Classification part 1 - Data Set Analysis

As promised, in this post I'm going to discuss about an important branch of text analytics: the text categorization.
Step by step, we will build a classifier going through the following steps:

  • Data set preparing.
  • feature extraction.
  • feature reduction comparing different techniques like PCA, C4.5, and other algos.
  • Translation of the documents in the proper form to feed the classification algorithms.
  • Classification via SVM ...we will explore different strategies to adapt a boolean classifier to multi class algorithm 
  • Benchmarking with other classifiers like C4.5, naive Bayes Classifier.
As you can see, it will take plenty time to show every method, and I really hope you will attend proactively to exchange experiences.

After the pcreambolus, let's start with the first step: "data set preparing".
I decided to use for this experiment a very common standard data set: the Reuters-21578 data set.
It is a data set often used in academical context to perform benchmark tests over different strategies. 
It is organized in tagged files (sgm) easy to handle to build ad hoc data set, and you can download it at: http://www.daviddlewis.com/resources/testcollections/

Here you are the composition of the data set:
The Reuters-21578 data set composition
As you can see it is a multi dimensional data set, that is you can perform classification tasks across different kind of categories.
For example, you could build a data set considering "TOPICS" classes where "PLACES" = "Canada" and so on.
For our tests I decided to work with categories TOPICS: there are documents belonging to 135 different classes of TOPICS.
In a realistic scenario (by enterprise prospective) build an efficient classifier for 135 classes is too much expensive in terms of efforts to achieve good accuracy for each class, in terms of maintenance, and in terms of computational efforts.
So, if there aren't business constraints, is important perform a quantitative analysis to detect which are the most common classes.

The most important classes for "TOPICS" categories
The important question is:
How many classes have I to consider to cover at least X% of the entire amount of documents?
x- axis = classes (sorted by volume) --- y-axis % of documents covered 
As you can see, considering for example only 20 classes instead of 135, we are able to cover around 80% of all documents in the data set: this is the "Pareto Principle"!

If you are interested to play with the Reuters data set and follow the step by step classifier building, I arranged a Wolfram Mathematica notebook to extract automatically training/test set, using custom categories. Contact me directly at cristian.mesiano@gmail.com