Wednesday, February 29, 2012

Features Extraction: Co-occurrences and Graph clustering

In the last two post we have discussed about co - occurrences analysis to extract features  in order to classify documents and extract "meta concepts" from the corpus.
We have also noticed that this approach doesn't return better than the traditional bag of words.
I would now explore some derivation of this approach, taking advantage of the graph theory.

In the following step I'm gonna to show you the alternative strategy applied to a corpus for the category "GRAIN" (referred always to our official data set REUTERS 21578).

Co - occurrences 
The first step is to extract the co occurrences as we already explained in the former posts.
The test has been performed considering just 100 documents, below the graph showing the score assigned to all (>5000) co occurrences extracted by the proc.

Sorted score assigned to the co occurrences extracted from the GRAIN category (100 docs).
Graph Reduction
In this step we build the graph of the co-occurrences of two words to obtain a representation of the links among the other co occurrences.
Here you are the graph obtained from this step:
The Graph of the co-occurrences linkage for the GRAIN category
 As you can see the graph is really huge and complex, how could we reduce its complexity without big information loss?
The Kcore algorithm could help in doing that. It gives groups of vertices that are connected to at least k others.
To build it you can remove repeatedly vertices of out-degree less than k.
Below the k core components from the original graph (the k chosen here is five).
k core components extracted from the original graph 
This reduction reduced the number of nodes from 5163 to 104.

The entities extracted
Form the k-core graph we could extract the shortest path for each combination of two vertexes.
And we can compare the paths obtained with the document in the corpus to select the paths really existing in the documents (this trick of course should be applied only on the training set!).
The procedure returned 526 entities.
Here you are the longest entities extracted:

1 rice farmer reach acre
2 department say last summer
3 contract trade  agricultural
4 month pct increase
5 summer grain harvest
6 board trade agricultural
7 record tonne previous
8 rice farmer reach
9 cash crop include
10 high rate national
11 pay crop loan
12 cotton payment reach
13 major farm group
14 future year ago
15 future trade area
16 total tonne intervention
17 offer  exporter issue
18 import tonne agricultural
19 bushel price  current
20 bushel corn producer

The graph clustering
Another interesting things we can do with such entities  is to isolate in the kcore graph the nodes that describe the entities and apply a clustering algorithm over the graph connections.
I decided to apply in this case as a clustering distance the "community modularity".
The clustering returns unstructured features.
Here you are the clustered graph and the unstructured features clustered:
The clustered k-core graph

The unstructured features clustered
As you can see last step tends to group together words semantically related:
Consider for example the second cluster, it contains almost all words related to period of time topic as "winter, year, month, ago", the fourth represents political aspects, the fifth describes quite precisely economic aspects.
The third cluster is gathering aspects strictly related with the main topic "grain", while the first one seems group words not related with the topic.
Of course it's not 100% accurate, and is still open the question how to use this clusters.
In the next post I'm gonna extend this experiment to other categories, and I would like to check if the entities extracted can be used to improve the accuracy of our classifier.
...Stay tuned

Monday, February 20, 2012

Document Classification: latent semantic vs bag of words. Who is the best?

We have seen few posts ago an approach to extract meta "concepts" from text based on latent semantic paradigm.
In this post we apply this approach to classify documents, and we do a comparison between this approach and the canonical bag of words.
The comparison test will be done through the ensemble method already showed in the last post.

The idea is to build the features set considering constructs of co-occurrences instead of single words as usually done with bag of words.

  • The prominent advantage of this approach is well rendered when the corpus contains structured features as for instance "bank account, invoice number" and so on. 
  • The biggest disadvantage lays in the length of the feature set and consequently in the length of the vectors built to represent each document.

Let me clarify the last point: consider for instance a corpus of 100 words; in this case the dimension of each vectors will be less or equal to 100. Let's try now to consider the co-occurrences of these 100 words, and just to simplify the process, let's focus on the co-occurrences of two words (so we consider the closeness of two words), in this case the length of the vector is 100*100 =10000!!.

The image below represents the graph of the co-occurrences extracted considered the highest ranking (the ranking has been assigned in the way described in this post applied to the data set used to explain the ensemble method based on ADA boost M1).

The most important features extracted through the LSA method.

To build the vectors I used the first 15000  structured features extracted by the latent semantic method and I considered just features composed by 2 words (co-occurrences).
To mitigate the complexity to crunch vectors with dimension so high, I built the training and test set using sparse matrixes of boolean values.
The advantage of the sparse matrix working with SVM is related to the small memory consumption required to select the support vectors.
Someone could argue that inner product of sparse array requires additional effort in the kernel evaluation, but if the vectors contain only boolean values, this extra complexity is negligible.

To perform a comparison test, I used the same sets and the same strategies of learning.
In both scenarios, I trained the system through the ensemble techniques described in the last post (based on the boos method of ADA boost M1).

Before to discuss the results let me show the confusion matrixes.
Confusion Matrix for LSA based classifier.
(How to read it?...look at the former post)

Confusion Matrix for "Bag of Words" based classifier.
(How to read it?...look at the former post)
In blue it has been represented the results for LSA based approach.
In red it has been represented the results for Bag of Words approach.

As you can see, even if the results obtained with the LSA classifier are not so bad, the traditional approach based on bag of words (in this experiment extracted through TD-DF) returns better results.

The training set used generated around 8.000 features (single words) and the vectors have been built using 500 features (6.25% of the total). 
The size of the features having relevant ranking in the LSA approach is around 15000, and to ensure the same conditions of the test I trained the system choosing again 500 features (3.33% of the total).
Basically the useful information to classify the docs in LSA approach is spread over a bigger feature set.

As we have already seen in the past, doesn't exist a technique that perform better than other approaches with every data set. With this data set (a sub set of the Reuters 21578) the LSA requires more physical resources and returns worse results.

If we scratch the surface of the results, we can observe that it detects different aspects of the corpus to classify the docs: for instance it performs better with "GRAIN" category. So I wonder if we can take advantage of this aspect to ensemble this approach with the bag of words to improve further the accuracy (as suggested by a follower of this blog) !
...We will see the results of this conjecture the next post.
As usual: Stay Tuned!!