Text & Data Mining by practical means: Waiting Time Polynomials Document Classifier

Tuesday, March 18, 2014

Waiting Time Polynomials Document Classifier - Part III

Abstract
In the post is presented an innovative definition of polynomials associated to waiting time processes (analyzed in the former posts). Such polynomials are here successfully used as document classifier. Comparative tests with SVM show significant accuracy improvements.

Boolean Classification tests based on test set of 8k randomly generated documents composed using an alphabet of three words.

Introduction
To encourage who found quite intricate the formula I presented a couple of posts ago, I'm going to present you an its practical application that might be a good incentive to analyze the formal aspects with more attention :)
What I show to you today is one of several application of such approach: a document classifier having higher accuracy than traditional methods as SVM (trained with gaussian kernel) and Neural Networks back propagated.

Characteristics of the classifier

It's a supervised learning algorithm
It's completely non parametric
It can be used natively to classify multiple classes datasets.

The Algorithm
Let's assume to have a training set composed by two classes of documents: Cl_1, Cl_2.

Learning Phase: Estimation of Geometrically distributed Random Variables.

Define an alphabet of three words {w1,w2,w3} using frequencies criteria or more sophisticated techniques.
For each class of training set:

estimate parameters {p1, p2, p3} of the respective {X1(w1),X2(w2),X3(w3)} geometrically distributed random variables.

Calculate the polynomials associated to {X1(w1),X2(w2),X3(w3)} using:

Testing Phase: document classification

for each document Di of the test set:

Identify the number of occurrences of {w1,w2,w3}: {O_w1,O_w2,O_w3}
Select the polynomial for which:
{O_w1,O_w2,O_w3} =p1^O_w1 p2^O_w2 p3^O_w3.
Calculate the value of the polynomials P_Cl_1, P_Cl_2 using:

{p1, p2, p3} estimated for Cl_1
{p1, p2, p3} estimated for Cl_2

Classify the document:

If (P_Cl_1>P_Cl_2) Di belongs to Cl_1
Else Di belongs to Cl_2

Examples
1. How to select the polynomial

Let's consider the polynomials calculated using the aforementioned formula and assume that in document Di the word w1 is repeated 2 times, the word w2 is repeated 2 times and w3 is repeated 1 time.
Then, In the step 1 of the testing phase we have to choose the polynomial boxed in the below list:

Polynomials generated for O_w1+ O_w2+O_w3=5

2. Same polynomial, different values of p1, p2, p3.

- How many polynomials are generated for O_w1+ O_w2+O_w3 = 25?

The answer is straightforward: 276 that is, all the possible configurations of 3 addends for which the sum =25. In the Encyclopedia of Integer Sequences, there are exciting explanation on such series.

- How does the polynomial change for different values of p1, p2, p3?

It's quite impressive how two polynomials can be different despite the setting of {p1, p2, p3} is almost identical.

Look at this example for which:

On the left chart I plotted a polynomial with p1= 0.33, p2= 0.33, p3=0.34.
On the right chart I plotted a polynomial with p1= 0.4, p2= 0.3, p3=0.3.

In the first case the polynomial takes the maximum value for O_w1=8, O_w2= 8, O_w3 =9 ...not a big surprise!
In the second case the polynomial takes the maximum value for O_w1=15, O_w2= 9, O_w3 =1. In this case the result is more complicated to be explained!

Document classification comparative test

To test the accuracy of my method I performed a comparative test using a randomly generated document set.

The training set was composed by 100 documents (used to train a Gaussian Kernel SVM and to estimate params p_i for my method).
The test set was composed by:

Cl_1: 4000 documents randomly generated using a configuration of {p1, p2, p3}.
Cl_2: 4000 documents randomly generated using a different configuration of {p1, p2, p3}.

The accuracy has been tested using different configurations of {p1, p2, p3}, and considering different size of documents.

Results

First experiment:

just the first 25 words have been considered to estimate the params {p1, p2, p3}, to train the SVM and to test the the accuracy.

Accuracy results considering just the first 25 words.

Second experiment:

Same test as above using the first 35 words of the documents.

Accuracy results considering just the first 35 words.

All the results showed there are referred to accuracy achieved on average using 10 different randomly generated test set and trying to configure SVM params to maximize the accuracy.

Considerations

As you can see the method based on definition of "Waiting Time Polynomials" that I'm proposing, performs significantly better than SVM.

More comparative tests will be shown in the publication I'm writing on such topic.

Further notes

Processes based on waiting time or geometrically distributed random variables are extremely important for risk assessment and risk evaluation.

I'll show you in another post some application of such polynomials in this field.

As usual, stay tuned

cristian

5 comments:

Unique Computer CasesMarch 18, 2014 at 8:21 PM
Extremely informative. I wanted to thank you for this excellent post.
ReplyDelete
Replies
used machineryOctober 13, 2016 at 10:20 PM
Great blog! Interesting + knowledgeable post.
ReplyDelete
Replies
used machineryOctober 17, 2016 at 1:48 AM
Excellent information
ReplyDelete
Replies
leoJune 27, 2019 at 1:07 AM
Amazing content.
Data Mining Process
ReplyDelete
Replies
sriApril 11, 2022 at 5:03 AM

Thanks for sharing this.,
Scrum master certification online
csm certification online
ReplyDelete
Replies

Add comment

Pages

Tuesday, March 18, 2014

Waiting Time Polynomials Document Classifier - Part III

5 comments: