Thursday, June 30, 2011

Full text search engines competition: ... the winner is... the stupid "occurrences counter"!!

I was reading some technical features on full text search engines.
To be honest I'm not an expert on this kind of technology, but I saw that it is very easy impress people with astonishing announcements like:
"forget the occurrences, it is time of concepts; our engine is able to really understand the meaning of the key search!".
Wow! At the glance seems that these engines are able to order autonomously a pizza with out any human interaction.
But. ...there is alway a "but":
1) is it always the best solution a "semantic seeker"?
2) is it so complex build a customer solution starting from our old search engine?
Well. For the first point my opinion is that for each domain exists a specific solution. For example if the set of documents indexed in your docbase is mainly composed by invoices, or table documents (like spreadsheets) or advice for new orders, I don't see many advantages using semantic engine.
May be for documents containing many phrases and free text, this kind of technology could bring some added value. ...But I would like to see some honest benchmark on that.
Even for the classification problem is not yet proved that a semantic classifier works better a unstructured  classifier (I'll arrange a post on that with some real test!).
For the second point.... the only way to check it is.... to try!
So let start with a stupid key search like "turbojet airplane"
Our old full text search engine (FTSE) will retrieve docs containing both "turbojet" and "airplane".
we could enrich with our specific (business) requirements!
For Example, using many different dictionaries available in internet I arranged a recursive (2 loops) lookup on the first set of "synonyms" (extend in this trivial example with broader terms and morphological derivatives terms). After that I built the graph of the connections and to restrict the set of pertinent terms i clustered the graph using a stupid k-means.  The final list of words is selected choosing the cluster containing the original word! ...No more chatting:
for the word "turbojet" the set of words to enrich our old FTSE is:
The cluster of words selected is painted in orange

and for the word "airplane":
Even in this case i painted the selected cluster of words in orange.
The method is very general and tunable because:
1) you can select or deselect you preferred dictionaries (depending on your data domain),  
2) you can enrich/impoverish the set of words tuning the number of loops for the recursive search.
3) you can choose different clusters
Of course the latest FTSE are faster and the sophisticated, but we can extend the life of our old engine with few maintenance.
I love simplicity :) ...occam's razor.


  1. Sorry, Christian, I failed to get the point. Are you trying to say that a plain-vanilla text search can do synonyms? If so, it certainly can. However, this should be done carefully, with proper ranking adjustments. I bet that with a naive application synonyms will not improve ranking, but rather degrade it.

  2. Hi Itman, sorry if my post wasn't clear.

    What I have done is to enrich the native query using some trivial dictionaries and some notions of graph theory. (the code in Mathematica takes exactly 7 rows).
    When I decided to do the above experiment, the question was: What is the real benefit of the new FTSE in terms of "data quality" returned after a search?
    The results of my experiments say that you can obtain very good results (in terms of data quality) even with old FTSE.
    Moreover you can refine your query in agreement with your requirements.
    This kind of approach lye on the top of the FTSE and it is independent on that, so the ranking is not affected.
    The only operation is to enrich the original query with new words and logical condition.

    Of course when you enrich your query, you are adding constraints to your search, so when these constraints are very strong, you can consider the ranking function like an heavy side function (All results returned have the max rank).

    BTW,My analysis doesn't concern speed to return data and other factors.

  3. The only operation is to enrich the original query with new words and logical condition.

    It will most likely increase recall, but what about precision?

  4. The precision of the results depends on many factors like domain and scenarios search. In My case for example using the pure FTSE, the set of docs retrieved had size =0. After the "enrichment" it returns correct documents.
    If you extend the search with many loop of recursive synonyms, I imagine that the number of false positive will increase.
    ...But to be honest I saw a lot of false positive also with new FTSE.
    However, I cannot say that my heuristic works better than a latest FTSE, I can only say that:
    1) under some conditions it works properly.
    2) some features of FTSE are oversold.
    3) To select the "final terms " I used a stupid k-means clustering on the graph of synonyms: Choosing other kind of clustering (e.g. fuzzy clust.) We could obtain better results.
    ...I love have the total control on my hand: I hate the black boxes!

  5. Here you are the code of the experiment:

  6. I may look into it someday. I currently deal with a rather large web dataset, where you have plenty of everything. I saw examples when people improved both precision and recall using synonyms on this data set, but they were very careful. One group, e.g., used only Wikipedia synonyms.

  7. Hi Itman,
    First of all, thanks for your interest in my approaches.
    The solution I proposed is only a conceptual prototype, not a complete solution.
    In order to dictionary, (as you can see in the notebook) I used the following:
    1) English Dictionary (it contains also Proper names of towns, Country, and so on)
    2) "BroaderTerms" Dictionary: generalizations of a word (e.g. "shape" from "circle")
    3) "NarrowerTerms" Dictionary: specializations of a word (e.g. "circle" from "shape")
    BTW, I found useful this procedure to improve the recall, minimizing the "false positive" set.
    In all fairness I don't know what can happen in very heterogenous sets. however, often the domains of a company are pretty much homogeneous or at least well partitioned in homogeneous sets.
    Looking forward to receive some feedback from your tests :) ...and maybe some suggestions to improve the strategy!

  8. Hi,
    I was playing with TREC data-set recently. Well, there is research showing that expansion in heterogeneous collections can improve the quality considerably. See, e.g.,
    However, I am not sure that a generic thesaurus would work well. I was once told by some researcher, that using Wordnet did not improve the quality in their system.
    Anyways, in a corporate or a specialized search engine, query expansion should do a better job, IMHO.

  9. Hi, Cristian. Sorry for may be stupid questions...

    1. What predictors did you use to make K-means segmentation?
    2. "extend in this trivial example with broader terms and morphological derivatives terms" - what did you use?


  10. Hi Dmitry, the clustering I used in this experiment is not the classical k-means.
    @1 I used the implementation of:
    A. Clauset, "Finding Local Community Structure in Networks," Physical Review E, 72, 026132, 2005.
    There are available other kind of graph clustering that give you good results (like the voltage, the bi-components...)
    Have a look at jung library.

    @2 I used the wordnet library