Text & Data Mining by practical means: June 2011

Thursday, June 30, 2011

Full text search engines competition: ... the winner is... the stupid "occurrences counter"!!

I was reading some technical features on full text search engines.
To be honest I'm not an expert on this kind of technology, but I saw that it is very easy impress people with astonishing announcements like:
"forget the occurrences, it is time of concepts; our engine is able to really understand the meaning of the key search!".
Wow! At the glance seems that these engines are able to order autonomously a pizza with out any human interaction.
But. ...there is alway a "but":
1) is it always the best solution a "semantic seeker"?
2) is it so complex build a customer solution starting from our old search engine?
Well. For the first point my opinion is that for each domain exists a specific solution. For example if the set of documents indexed in your docbase is mainly composed by invoices, or table documents (like spreadsheets) or advice for new orders, I don't see many advantages using semantic engine.
May be for documents containing many phrases and free text, this kind of technology could bring some added value. ...But I would like to see some honest benchmark on that.
Even for the classification problem is not yet proved that a semantic classifier works better a unstructured classifier (I'll arrange a post on that with some real test!).
For the second point.... the only way to check it is.... to try!
So let start with a stupid key search like "turbojet airplane"
Our old full text search engine (FTSE) will retrieve docs containing both "turbojet" and "airplane".
we could enrich with our specific (business) requirements!
For Example, using many different dictionaries available in internet I arranged a recursive (2 loops) lookup on the first set of "synonyms" (extend in this trivial example with broader terms and morphological derivatives terms). After that I built the graph of the connections and to restrict the set of pertinent terms i clustered the graph using a stupid k-means. The final list of words is selected choosing the cluster containing the original word! ...No more chatting:
for the word "turbojet" the set of words to enrich our old FTSE is:

The cluster of words selected is painted in orange

and for the word "airplane":

Even in this case i painted the selected cluster of words in orange.

The method is very general and tunable because:

1) you can select or deselect you preferred dictionaries (depending on your data domain),

2) you can enrich/impoverish the set of words tuning the number of loops for the recursive search.

3) you can choose different clusters

Of course the latest FTSE are faster and the sophisticated, but we can extend the life of our old engine with few maintenance.

I love simplicity :) ...occam's razor.

cristian

Cristian welcomes you to our blog!

Hi All,
... Sorry, but it is my first blog... and I forgot to introduce my self...
I'm Cristian, an italian guy employed in a big multinational company in Zürich. As you will see, my English is pretty terrible despite I'm working on Text Mining projects in the Content Management department of my company :).
I really hope you will correct my mistakes... (the trial error strategy is a good paradigm to train many systems: also human brains to learn quickly foreign languages!!).
My contributes to this blog will be mainly summarized in three points:
1) specialization of general purpose algorithms;
2) visual example of real scenarios;
3) analysis of systems (understood as mix of cooperative algorithms) for complex problems.
Of course we get the focus on text mining and data mining problem, but as you know many techniques are so general that you can easily reuse in different context.
Our interests in the last years have been oriented on problems related to content auto classification, extraction of relevant information from text, and selective search enforcing standard capabilities of "full text search engine".
If you are attracted to these themes and you want share with us your experiences and points of view, well: STAY TUNED!

Wednesday, June 29, 2011

Linear classifier and non linear Classifier or rather: Input space and Feature Space

Sometimes during some meeting to explain a general classification problem I notice that many people don't understand the essential concept behind the classification.

In all fairness I discovered that many people confuse very often clustering with classification, but this is another story, and I promise to treat that in other post :).

What is the core concept to understand deeply the classification problem? ...the complexity of the problem itself!

Let me be more clear, and observe the following scenario:

Imagine that you are interested to divide the blue points from the red points, and more specifically you are interested to a black box taking in input a point, it is able to assign automatically the class (red or blue) to the point.

Well, Imagine that the only tool admitted to divide the plane to separate the points is the linear function.

Mhmm.... as you can see whatever line you try to draw is not able to split the points red from the blue points.

-> A linear function able to divide the plane splitting correctly the blue points from the red points is called linear classifier.

-> A non linear function able to that is called non linear classifier. (in this case the circle equation solve the problem.... but the real world is often more complex...)

Question: is it possible map the points of the original problem in another space more complex (that is having higher dimension) where is possible to divide the points using only a linear classifier?

The answer is YES. The trick is honor of famous mathematicians, but it is very very useful to make up a linear classifier (read easy to develop and easy to reuse in other context!!) and you can find the theory everywhere in internet.

Let me show the effect with the above problem:

I took every points in 2D and i mapped in 3D space using the function (called Kernel function):

[x,y]->[x^2, Sqrt[2]*x*y, y^2]

And I obtained the following:

The red points now are grouped together and with a linear plane (that is a linear function in three dimensional space) we are able to divide again our points!

Here you are the overall problem:

If you are interested to further details I can share the Mathematica notebook I coded to generate the above images.

You don't need the License, you can play also with the player available here:http://www.wolfram.com/products/player/.

Let me cleanup the 10 rows of code and I'll publish the source file.

See you soon!