There are many definitions of graph entropy, my favorite is very well described in the work of J.Körner: Coding of an information source having ambiguous alphabet and the entropy of graphs (1973).
Why Graph Entropy is so important?
Based on the main concept of entropy the following assumptions are true:
- The entropy of a graph should be a functional of the stability of the structure (so that it depicts in some way the distribution of the edges of the graph).
- Sub sets of vertexes quite isolated from the rest of the graph are characterized by a high stability (low entropy).
- It's quite easy use the entropy as a measure for graph clustering.
As you can imagine a smart definition of graph entropy can be helpful in many problems related to text mining.
Let's see an application of graph entropy to extract relevant words in a document.
The experiment as been done using the first section of the definition of "nuclear weapons".
Results
Graph Entropy:
In red have been highlighted the relevant words extracted through Graph Entropy |
Here you are the words extracted (first 25th) - In red I depicted words that in my opinion shouldn't be selected:
weapon, reaction, nuclear release, consider, acknowledge, explosive, weaponsa, detonate, test, ton, bomb, energy, tnt, first, possess, small, device, unite, hiroshima, chronologically, thermonuclear, force, nagasaki
Frequency based:
|
Words extracted through the frequency relevance:
nuclear, weapon, bomb, fission, test, possess, bombing, detonate, state, unite, tnt, ton, first, energy, release, acknowledge, weaponsa, status, japan, nagasaki, hiroshima, japanese, name, code, type
Closeness Centrality:
In red have been highlighted the relevant words extracted through Closeness Centrality method. |
Words extracted through the frequency relevance:
nuclear, weapon, detonate, possess, nagasaki, first, thermonuclear, small, force, estimate, debut, fabricate, succeed, radiation, tnt, acknowledge, consider, believe, hiroshima, know, nation, boy, explode, matte, date.
Considerations:
First 40 relevant words using Graph Entropy, Frequency method and Closeness Centrality.
Notice how the graph Entropy preserves better the core of the graph respect the other two methods.
Stay tuned
cristian
cristian