I mentioned that PCA works well under the condition that the data set has homogenous density.
Are we in this scenario?
I analyzed our data set to verify how the points are distributed; to do that I used the first three components suggested by PCA for the classes having TOPICS = "GOLD" and "GRAIN".
To highlight better the density I soiled the 3D coordinates with a weak gaussian noise just to avoid the presence of coincident points. Here you are the results:
|In green "GRAIN" class, in violet "GOLD" class. Both soiled with weak gaussian noise.|
A countercheck of this quantitative analysis is provided by the second feature reduction algorithm we are testing: C 5.
A C 5 decision tree is constructed using GainRatio. GainRatio is a measure incorporating entropy. Entropy measures how unordered the data set is. Further details about C.5 are available at Quinlan's website: http://www.rulequest.com.
An Extremely interesting book about Entropy application is the famous "Cover - Thomas":
Elements of Information Theory, 2nd Edition
So C 5 gives us a set of rules to classify documents belonging to three different classes (the rules are depicted by the tree paths from the "start" node to each leaf) with an overall error rate less than 15%.
Is it a good result? Apparently yes, but ...if you have a look to the class "GRAIN" (B) the results are not so brilliant!!.
Don't worry we will see how to improve dramatically these accuracy.
At this point we have discussed about many algorithms, techniques and data set.
So where we are, and where we gonna go?
|Auto classification Road Map|
Guys, Stay Tuned.