Tuesday, July 10, 2012

Data Mining: Tools and Certificates

As member of many Linkedin groups related to data mining & text mining I read many threads related to certificates that should help either in job seeking and consolidating curriculum, and many other threads about miraculous tools able to solve whatever problem.
Is being certified really worth it?
In my experience I think that a certificate in a specific data mining tool could be a positive point on the curriculum, but it doesn't really help to improve your knowledge on the field.
Let me explain better (which is not easy with my bad English): The certificates system is a market and its target is to generate profit or to promote products.

Data mining tools
My question is: do you really think that can exist a tool able to embrace all aspects related to data mining?

I guess that the number of problems data mining related are so high that maybe we could use the Cantor diagunalization to proof that are uncountable :)

In my opinion is too naive the common thought that through a software, clicking here and there you can obtain tangible benefits in mining your data.
The "data mining" definition has been created by marketing industries just to summarize in a buzz word  technics of applied statistics and applied mathematics to the data stored in your hard disk.
I don't want say that tools are useless, but it should be clear that tools are only a mean to solve a problem, not the solution.
  • In the real world the problems are never standard and really seldom you can take an algorithm as is to solve them! ...maybe I'm unlucky but I never solved a real problem through a standard method.
  • The tool X is able to load Terabyte of data. And so what? A good data miner should know that you cannot consider the entire population to analyze a phenomena, you should be able to sample your population in order to ensure the required margin of accuracy! ... this technic is simply called Statistic!
  • If you really want to claim "I know very well this approach", you must be able to implement it by your self: only implementing it by your self you can deeply understand in which context the algorithm works, under which conditions it performs better than other tools and so on. Don't rely only on one paper that compare few techniques: if you change just one of the conditions the results are terrible different.
  • Without theory you cannot go deep: Let's consider a tool as Mathematica or R or ... These tools allow the user to have access to a large library of pre defined algorithms and routine, they provide visualization functions to show results in a fancy way, and last but not least they provide a complete language programming to code whatever you want. I love them, but I couldn't do anything without the theory behind the problem. Mathematica can provide me the algorithm to cluster a data set through k-means: but how can I be sure that it is the right algo for your problem? (click here to have a demo).
Actually I would prefer attend a course to deepen some aspects of multivariate statistic or seminars on new methodology to solve some problem respect pay plenty money to know every single detail of a tool, that maybe will not be in the market in the next 5 years.

I know that the companies often are looking for certified guys on a famous tool just because they bought it and they need to reduce the time to "integrate" a new resource on a team. Fair enough! ...but I think it is ridiculous require certificates as strict requirement!
I'm really curious to know your experiences and opinions.

Tuesday, July 3, 2012

Simulated Annealing: How to boost performance through Matrix Cost rescaling

One of the most widely algorithm used in Machine Learning is the Simulated Annealing (SA).
The reason of its celebrity lays in:
  • Simplicity of implementation
  • broad spectrum of applicability
There are many implementations of it available for almost all languages, and many books and papers on the topic; personally the best book I would like to suggest is the Simulated Annealing and Boltzmann Machines: A Stochastic Approach to Combinatorial Optimization and Neural Computing wrote by Korst & Aarts.

The problem
In this post I'm going to describe a very canonic combinatoric problem (NP hard complexity) often approached through annealing methods: the complete Asymmetric Travelling Salesman Problem (ATSP).
The ATSP problem, as you know, can be taken as base to model several problems as for instance network routing problems, chemistry and physics problems. The ATSP is so charming (at least for me) because the nature of the problem is very easy and the way to find out the best solution is extremely trivial: you have to explore all the possible solutions!
The only problem is the number of solutions you have to compare:
  • with 10 towns you have to explore ~ 3.6*10^6 solution.
  • with 50 towns you have to explore ~ 3 *10^64 solution.
In the past (I'm not aware if it is still open) there were many competitions to find the best solution of the asymmetric and complete TSP with more than 30000 towns.

...Considering the time required to obtain the optimal solution, in many cases a sub optimal solution obtained exploring less than 1% (as usually happens with SA) of the solution space could be enough!

The experiments
I considered two instances of the problem, the first one with 10 towns and the second one with 50 towns.
Even if for this kind of problem there are more efficient algorithms to face TSP, I chose the simulated annealing just to show how versatile it is and to show how much the rescaling operation is essential in order to maximize the performance of your algorithm. 

Why rescaling? (I'll try to be technical as little as possible)
The SA bases its powerful on the ability to overcome the local minimum through the acceptance at time "t" of a bad solution respect a better solution found at the time "t-1". It assigns an acceptance probability based on the sigmoidal function that decrease if the cost of the new solution is higher respect the current solution.
So if the cost distance between the current solution and the new solution is very high, the acceptance probability will be very small.
This probability is mitigated by the temperature schedule that allow more acceptance rate (of bad solutions) when the temperature of the system is high, and it is reduced once the temperature of the system decreases during the exploration of the solution space.
The aim of the rescaling is to increase the distance between towns having high cost and to reduce the cost of short distance between towns.
To obtain it I apply a "quasi" n log n function to the cost matrix.
(A publication on the topic should be ready by the end of the year).

All experiments as usual as been done through "home made" application written in Mathematica.
  • The cost matrix has been built  using a uniform distribution with integer values (1 to 10 for the first experiment and 1 to 100 for the second one).
  • Just to have a reference point I assigned to the cost matrix the minimum value (1) in correspondence of the external perimeter of the "town system"
  • The ATSP with 10 towns has optimal solution cost = 9. The ATSP with 50 towns has optimal solution cost = 49.
After 2000 cycles, SA found a solution having cost = 13. (the graph on the II quadrant represents the acceptance prob.).
The initial temperature beta has been set = 5.

After 1000 cycles, SA with MatrixCost rescaled found the optimal solution having cost 9.
The initial temperature beta has been set = 5.
I repeated the same experiment increasing the initial temperature to 15.
SA launched with beta= 15. After 5000 cycles the sub optimal solution found has cost = 14

SA with matrix cost rescaled after 2000 cycles found better solution (same initialization of the above experiment)
A larger test has been done to understand the statistical behavior:
The blue line represents the solution costs of the traditional SA.
The red line represents the solution costs of the SA with matrix cost rescaled 

And to conclude, here you are the results obtained with higher complexity (50 towns).

ATSP (50 towns) comparison between traditional SA and SA with rescaled matrix cost.

Let's try to increase the number of cycles from 1000 to 3000.

The % of success of the new strategy seems increase with the increase of the number of cycles.

Let's see what happens increasing once again the number of cycles:

The assumption seems to be confirmed: the new strategy performs better than the traditional SA even if the number of cycles is higher. 
  1. Often the accuracy/performance of an algorithms depends on the way you feed it with the data! Rescaling your data is quite always essential!
  2. The strategy requires further tests and validation.
...We will discuss about the function used to rescale the matrix costs and some other application of SA on different problems in the next posts.
cheers & stay tuned
Special thx to Tom V. : he is on the top of the list of the contributors for the sponsorship program.