Text & Data Mining by practical means: Outlier analysis: Chebyschev criteria vs approach based on Mutual Information

Wednesday, May 23, 2012

Outlier analysis: Chebyschev criteria vs approach based on Mutual Information

As often happens, I usually do many thing in the same time, so during a break while I was working for a new post on applications of mutual information in data mining, I read the interesting paper suggested by Sandro Saitta on his blog (dataminingblog) related to the outlier detection.

...Usually such behavior is not proficient to obtain good results, but this time I think that the change of prospective has been positive!

Chebyshev Theorem
In many real scenarios (under certain conditions) the Chebyshev Theorem provides a powerful algorithm to detect outliers.
The method is really easy to implement and it is based on the distance of Zeta-score values from k standard deviation.
...Surfing on internet you can find several explanations and theoretical explanation of this pillar of the Descriptive Statistic, so I don't want increase the Universe Entropy explaining once again something already available and better explained everywhere :)

Approach based on Mutual Information
Before to explain my approach I have to say that I have not had time to check in literature if this method has been already implemented (please drop a comment if someone find out a reference! ... I don't want take improperly credits).
The aim of the method is to remove iteratively the sorted Z-Scores till the mutual information between the Z-Scores and the candidates outlier I(Z|outlier) increases.
At each step the candidate outlier is the Z-score having the highest absolute value.

Basically, respect the Chebyschev method, there is no pre-fixed threshold.

Experiments
I compared the two methods through canonical distribution, and at a glance it seems that results are quite good.

Test on Normal Distribution

As you can see in the above experiment the Mutual information criteria seems more performant in the outlier detection.

Test on Normal Distribution having higher variance

The following experiments have been done with Gamma Distribution and Negative Exponential

Results on Gamma seem comparable.

Experiment done using Negative Exponential distribution

...In the next days I'm going to test the procedure on data having multimodal distribution.
Stay Tuned
Cristian

3 comments:

George HartMay 25, 2012 at 10:09 AM
Interesting. Would adding/using a leverage plot with an abline cut-off be possible so you can drop more than one at once.

George
ReplyDelete
Replies
Cristian MesianoMay 25, 2012 at 12:15 PM
Hi George,
thanks for your contribute: you are right: all the points above threshold imposed by the straight line between two consecutive outliers found could be removed without further processing.

My iterative approach works removing one point at each step, but extending the conditioned normalized entropy to a bunch of outliers we should obtain the same results as in your proposal
I mean using the chain rule: I(Z|o1,o2,...ok) instead of the iteration of each single I(Z|o1),I(Z|o2),....
is it in your opinion correct?
thanks again for the input!
cheers
c.
ReplyDelete
Replies
leoJune 27, 2019 at 1:23 AM
Amazing content.
Data Mining Process
ReplyDelete
Replies

Add comment

Pages

Wednesday, May 23, 2012

Outlier analysis: Chebyschev criteria vs approach based on Mutual Information

3 comments: