Today I'm going to show how we can re-use the same concept on the document clustering.
What I want to highlight is that through such methodology it's possible to:
- extract from a document the relevant words (as discussed here);
- clustering of the words of a document (as discussed here);
- clustering set of documents;
- clustering a graph and assign a ranking score to each cluster by homogeneity criteria;
The Experiment
For this experiment I chose a subset of the standard dataset called "the 20 Newsgroup dataset".
The documents of this data set are email selected from different newsgroups.
I selected some documents from the categories: Hockey, Motorcycle, Atheist and Electronics.
You can download here the documents used for this experiment.
The procedure
- For each document build the respective graph (filtering by stopwords list, and stemming the words).
- For each graph calculate the entropy graph value of each vertex (word).
- For each graph extract the first k relevant vertex (sorting by entropy graph value)
- Perform hierarchical clustering (in this case I used once again an approach based on Simulated annealing).
Here you are the early results obtained through the clustering.
Each line depicts a cluster (the elements are the filename). The clusters have been sorted by homogeneity criteria assigned by the clustering algorithm. |
The results are promising but not perfect. The accuracy achieved is
- 75.0% for first cluster
- 62.5% for second cluster
- 50.0 % for the fifth cluster
- 37.5 % for the third and fourth cluster
The reasons of the bad aggregations are related to the class "atheist". Such class contains documents characterized by high variety: documents very short, text not strictly pertinent to the topic, topic quite general.
Moreover the clustering algorithm hasn't been optimized/customize for such problem.
Let's try to analyze the results under a different perspective:
Something more...
I mentioned before that such approach works over the concept of entropy graph.
Here you are the graph representation of the words related to each cluster.
For each cluster I selected the first relevant 150 words.
In Red has been highlighted the first cluster |
In dark Blue Has been Highlighted the second cluster |
In Yellow has been depicted the third cluster |
In Green has been depicted the fourth cluster. |
In Brown has been depicted the fifth cluster. |
Notes
Such representation of the clustering highlights that the linkage function used in this experiment depicts quite well the different areas of the graph.
- The Red cluster grouped together the right block of vertexes.
- The Brown cluster grouped the right side of the graph
- The Blue cluster focused on the core of the graph
- The yellow cluster grouped together the corona of the core of the graph
- The green cluster grouped together the corona of the peripheral of the graph.
The next steps
In the next post I'll try to refine the results, optimizing the clustering algorithm, and trying to use different kind of clustering algos.
Stay Tuned
cristian