Text & Data Mining by practical means: Feature Extraction

Tuesday, August 9, 2011

Feature Extraction - Different Approaches

When you build a classifier, and I'm not talking about only about the text categorization, you should have special care about this step.
If your features extractor has been made regardless the context, whatever amazing classification algorithm you gonna use ... the accuracy will be always unsatisfactory.
In the text categorization, you should consider at least the following points:

document containing a lot of text or in opposite documents containing few words.
document containing semantic or in opposite documents without explicit semantic (like for example invoices or table)
context dictionary: that is the set of words used in the domain your are working on (for example words like acronyms having meaning only in a specific context)
the overall environment where the classifier will work: are there different languages? are there heterogeneous sources? the manual classification is based only on the single document under processing?
business constraints: like max time to process a document or business process.

Bag Of Words based on TF - (I)DF

Term Frequency - Inverse document Frequency (TD IDF) function is based on a work of Spärck Jones and Karen in 1972. It basically combines local frequency property of a specific word in the document with a global property based on the number (inverse) of documents containing this word. Further technical details are available in internet (let me know if you are interested to the original paper published).

Even if this function is dated, we will show how to is it in smart way to achieve good accuracy.

When does the TF DF work better than TF IDF?

When the documents of your domain contain few words (e.g. invoices, spreadsheets, bank statements) the global component of IDF doesn't work well. In this context the easier TF DF works better!

In the classification sample we are working on, I gonna use K different Bag of Words (one bag of words for each class) using both methods... we will check what is the best strategy!

To be honest I slightly modified the function in this way

Modified TF (I)DF function

Where:

W_i[D_j] = Occurrences for the Word W_i in the document [D_j];
DF[W_i] = Number of Documents containing the Word W_i
|TrSet| = Number of Documents for the specific Class;

The Features set has been obtained filtering the documents by the StopWord list and stemming each word considering only the "Base Form".

Here you are The first 20 Features extracted for the Reuters data Set TOPICS = "GOLD" (see the previous post for further details) with the respective weight:

TF DF		TF IDF
say	,51.907757725859135	sigma	,71.5899150352952
gold	,49.86505578380171	police	,64.30524039891371
ounce	,22.146187073000476	model	,64.30524039891371
ton	,8.317113808183201	land	,64.30524039891371
mine	,7.7300297803876585	grant	,64.30524039891371
company	,7.112820339974604	china	,64.30524039891371
pct	,7.094497045719568	pope	,64.30524039891371
year	,5.215953054827821	rand	,60.616126176742966
ore	,3.48549104848714	placer	,59.847379800543116
production	,3.2688965028783508	foot	,59.26642022035086
grade	,3.059411347433722	say	,58.04561519245233
silver	,2.5460893593153946	warrant	,57.23666647951021
price	,2.205337527517457	shoji	,55.38951920217253
reserve	,2.123335553835266	client	,55.38951920217253
drill	,2.118861548885319	ridge	,55.38951920217253
expect	,2.0390821784153337	alligator	,55.38951920217253
corp	,2.0390821784153337	play	,55.38951920217253
foot	,2.0236289512856094	vehicle	,55.38951920217253
short	,2.0225200886354586	grand	,55.38951920217253
average	,1.9028247115440413	action	,55.38951920217253

CLOSENESS CENTRALITY BASED

This is an attempt (I simplified a lot my the original version to highlight the main idea) using the Graph Theory to build a sort of "latent semantic" feature set.

After the filtering and stemming operations, I partitioned the document in couple of contiguous words and I built a direct graph:

Sample of Direct Graph for a file of the Class "GOLD"

To assign a score to a feature I used a Closeness Centrality function: The closeness centrality of a vertex u is defined as the inverse of the sum of the distance from u to all other vertices.

Assigning a color to the vertexes of the above graph (warm color for vertex having high score) you obtain:

Closeness Centrality assignment for the above graph

The first 20 features extracted using this method are:

bureau	,0.02127659574468085
interior	,0.01639344262295082
import	,0.014705882352941176
check	,0.01135161135161135
revise	,0.01105263157894737
technique	,0.01020408163265306
period	,0.00985400705790957
change	,0.009708737864077669
sigma	,0.009616273693943597
significant	,0.009615384615384616
cabinet	,0.009345794392523364
tunnel	,0.009212305611899932
affect	,0.008928571428571428
viceroy	,0.008796296296296295
conservatively	,0.008284833960972269
milling	,0.008240857798379922
practice	,0.008149473265752336
amenable	,0.008
cove	,0.0078125
department	,0.00757930926707298