Wednesday, May 23, 2012

Outlier analysis: Chebyschev criteria vs approach based on Mutual Information

As often happens, I usually do many thing in the same time, so during a break while I was working for a new post on applications of mutual information in data mining, I read the interesting paper suggested by Sandro Saitta on his blog (dataminingblog)  related to the outlier detection. 
...Usually such behavior is not proficient to obtain good results, but this time I think that the change of prospective has been positive!

Chebyshev Theorem
In many real scenarios (under certain conditions) the Chebyshev Theorem provides a powerful algorithm to detect outliers.
The method is really easy to implement and it is based on the distance of Zeta-score values from k standard deviation.
...Surfing on internet you can find several explanations and theoretical explanation of this pillar of the Descriptive Statistic, so I don't want increase the Universe Entropy explaining once again something already available and better explained everywhere :)

Approach based on Mutual Information
Before to explain my approach I have to say that I have not had time to check in literature if this method has been already implemented (please drop a comment if someone find out a reference! ... I don't want take improperly credits).
The aim of the method is to remove iteratively the sorted Z-Scores till the mutual information between the Z-Scores and the candidates outlier I(Z|outlier) increases.
At each step the candidate outlier is the Z-score having the highest absolute value.

Basically, respect the Chebyschev method, there is no pre-fixed threshold.

I compared the two methods through canonical distribution, and at a glance it seems that results are quite good.
Test on Normal Distribution
As you can see in the above experiment the Mutual information criteria seems more performant in the outlier detection.
Test on Normal Distribution having higher variance

The following experiments have been done with Gamma Distribution and Negative Exponential
Results on Gamma seem comparable.

Experiment done using Negative Exponential distribution
...In the next days I'm going to test the procedure on data having multimodal distribution.
Stay Tuned


  1. Interesting. Would adding/using a leverage plot with an abline cut-off be possible so you can drop more than one at once.


  2. Hi George,
    thanks for your contribute: you are right: all the points above threshold imposed by the straight line between two consecutive outliers found could be removed without further processing.

    My iterative approach works removing one point at each step, but extending the conditioned normalized entropy to a bunch of outliers we should obtain the same results as in your proposal
    I mean using the chain rule: I(Z|o1,o2,...ok) instead of the iteration of each single I(Z|o1),I(Z|o2),....
    is it in your opinion correct?
    thanks again for the input!