If your features extractor has been made regardless the context, whatever amazing classification algorithm you gonna use ... the accuracy will be always unsatisfactory.
In the text categorization, you should consider at least the following points:
- document containing a lot of text or in opposite documents containing few words.
- document containing semantic or in opposite documents without explicit semantic (like for example invoices or table)
- context dictionary: that is the set of words used in the domain your are working on (for example words like acronyms having meaning only in a specific context)
- the overall environment where the classifier will work: are there different languages? are there heterogeneous sources? the manual classification is based only on the single document under processing?
- business constraints: like max time to process a document or business process.
Bag Of Words based on TF - (I)DF
Term Frequency - Inverse document Frequency (TD IDF) function is based on a work of Spärck Jones and Karen in 1972. It basically combines local frequency property of a specific word in the document with a global property based on the number (inverse) of documents containing this word. Further technical details are available in internet (let me know if you are interested to the original paper published).
Even if this function is dated, we will show how to is it in smart way to achieve good accuracy.
- When does the TF DF work better than TF IDF?
When the documents of your domain contain few words (e.g. invoices, spreadsheets, bank statements) the global component of IDF doesn't work well. In this context the easier TF DF works better!
In the classification sample we are working on, I gonna use K different Bag of Words (one bag of words for each class) using both methods... we will check what is the best strategy!
To be honest I slightly modified the function in this way
- W_i[D_j] = Occurrences for the Word W_i in the document [D_j];
- DF[W_i] = Number of Documents containing the Word W_i
- |TrSet| = Number of Documents for the specific Class;
|Sample of Direct Graph for a file of the Class "GOLD"|
u is defined as the inverse of the sum of the distance from u to all other vertices.he closeness centrality of a vertex
Assigning a color to the vertexes of the above graph (warm color for vertex having high score) you obtain:
|Closeness Centrality assignment for the above graph|
The first 20 features extracted using this method are:
How many Features Have we consider to obtain good accuracy?
How can we analyze the differences among the different Features Set seen before?
We will discuss about that in the next post.
Contact me at firstname.lastname@example.org if you are interested to the Wolfram Mathematica source code.