Monday, September 24, 2012

Graph Entropy to extract relevant words

I would share with you some early results about a research I'm doing in the field of "graph entropy" applied to text mining problem.

There are many definitions of graph entropy, my favorite is very well described in the work of J.Körner: Coding of an information source having ambiguous alphabet and the entropy of graphs (1973).

Why Graph Entropy is so important?
Based on the main concept of entropy the following assumptions are true:

  • The entropy of a graph should be a functional of the stability of the structure (so that it depicts in some way the distribution of the edges of the graph).
  • Sub sets of vertexes quite isolated from the rest of the graph are characterized by a high stability (low entropy).
  • It's quite easy use the entropy as a measure for graph clustering.
As you can imagine a smart definition of graph entropy can be helpful in many problems related to text mining.

Let's see an application of graph entropy to extract relevant words in a document.

The experiment as been done using the first section of the definition of "nuclear weapons".

Results
Graph Entropy:
In red have been highlighted the relevant words extracted through Graph Entropy
Here you are the words extracted (first 25th) - In red I depicted words that in my opinion shouldn't be selected:
weapon, reaction, nuclear release, consider, acknowledge, explosive, weaponsa, detonate, test, ton, bomb, energy, tnt, first, possess, small, device, unite, hiroshima, chronologically, thermonuclear, force, nagasaki

Frequency based:
In red have been highlighted the relevant words extracted through Frequency relevance.
Words extracted through the frequency relevance: 
nuclear, weapon, bomb, fission, test, possess, bombing, detonate, state, unite, tnt, ton, first, energy, release, acknowledge, weaponsa, status, japan, nagasaki, hiroshima, japanese, name, code, type

Closeness Centrality:
In red have been highlighted the relevant words extracted through Closeness Centrality method.

Words extracted through the frequency relevance: 
nuclear, weapon, detonate, possess, nagasaki, first, thermonuclear, small, force, estimatedebut, fabricate, succeed, radiation, tnt, acknowledge, consider, believe, hiroshima, know, nation, boy, explode, matte, date.

Considerations:

  • The method based on graph entropy seems provide the more accurate results (5 errors respect 9 and 11 of the other methods).
  • The graph entropy depicts better the core of the graph containing the relevant words.
  • I tried to expand the number of relevant features and the accuracy of the other two methods tends to worsen quickly:

  • First 40 relevant words using Graph Entropy, Frequency method and Closeness Centrality.
    Notice how the graph Entropy preserves better the core of the graph respect the other two methods.

    Stay tuned
    cristian