Tuesday, March 18, 2014

Waiting Time Polynomials Document Classifier - Part III

Abstract
In the post is presented an innovative definition of polynomials associated to waiting time processes (analyzed in the former posts). Such polynomials are here  successfully used as document classifier. Comparative tests with SVM show significant accuracy improvements.
Boolean Classification tests based on test set of 8k randomly generated documents composed using an alphabet of three words.
Introduction
To encourage who found quite intricate the formula I presented a couple of posts ago, I'm going to present you an its practical application that might be a good incentive to analyze the formal aspects with more attention :)
What I show to you today is one of several application of such approach: a document classifier having higher accuracy than traditional methods as SVM (trained with gaussian kernel) and Neural Networks back propagated.

Characteristics of the classifier
  • It's a supervised learning algorithm
  • It's completely non parametric
  • It can be used natively to classify multiple classes datasets.

The Algorithm
Let's assume to have a training set composed by two classes of documents: Cl_1, Cl_2.

    Learning Phase: Estimation of Geometrically distributed Random Variables.
  1. Define an alphabet of three words {w1,w2,w3} using frequencies criteria or more sophisticated techniques.
  2. For each class of training set:
    • estimate parameters {p1, p2, p3} of the respective {X1(w1),X2(w2),X3(w3)} geometrically distributed random variables.
  3. Calculate the polynomials associated to {X1(w1),X2(w2),X3(w3)} using:



    Testing Phase: document classification
  1.  for each document Di of the test set:
    • Identify the number of occurrences of  {w1,w2,w3}: {O_w1,O_w2,O_w3}
    • Select the polynomial for which:
    •  {O_w1,O_w2,O_w3} =p1^O_w1 p2^O_w2 p3^O_w3.
    • Calculate the value of the polynomials P_Cl_1, P_Cl_2 using:
      1. {p1, p2, p3} estimated for Cl_1
      2. {p1, p2, p3} estimated for Cl_2
  2. Classify the document:
    1. If (P_Cl_1>P_Cl_2) Di belongs to Cl_1
    2. Else Di belongs to Cl_2
Examples
1. How to select the polynomial

Let's consider the polynomials calculated using the aforementioned formula and assume that in document Di the word w1 is repeated 2 times, the word w2 is repeated 2 times and w3 is repeated 1 time.
Then, In the step 1 of the testing phase we have to choose the polynomial boxed in the below list:
Polynomials generated for O_w1+ O_w2+O_w3=5
2. Same polynomial, different values of  p1, p2, p3.

- How many polynomials are generated for O_w1+ O_w2+O_w3 = 25?
The answer is straightforward: 276 that is, all the possible configurations of 3 addends for which the sum =25. In the Encyclopedia of Integer Sequences, there are exciting explanation on such series.

- How does the polynomial change for different values of p1, p2, p3?
It's quite impressive how two polynomials can be different despite the setting of {p1, p2, p3} is almost identical.
Look at this example for which:
  • On the left chart I plotted a polynomial with p1= 0.33, p2= 0.33, p3=0.34.
  • On the right chart I plotted a polynomial with p1= 0.4, p2= 0.3, p3=0.3.
  • In the first case the polynomial takes the maximum value for O_w1=8, O_w2= 8, O_w3 =9 ...not a big surprise!
  • In the second case the polynomial takes the maximum value for O_w1=15, O_w2= 9, O_w3 =1. In this case the result is more complicated to be explained!
Document classification comparative test

To test the accuracy of my method I performed a comparative test using a randomly generated document set.
  • The training set was composed by 100 documents (used to train a Gaussian Kernel SVM and to estimate params p_i for my method).
  • The test set was composed by:
    • Cl_1: 4000 documents randomly generated using a configuration of {p1, p2, p3}.
    • Cl_2: 4000 documents randomly generated using a different configuration of {p1, p2, p3}.
  • The accuracy has been tested using different configurations of {p1, p2, p3}, and considering different size of documents.
Results
First experiment: 
just the first 25 words have been considered  to estimate the params {p1, p2, p3}, to train the SVM and to test the the accuracy.
Accuracy results considering just the first 25 words.
Second experiment: 
Same test as above using the first 35 words of the documents.

Accuracy results considering just the first 35 words.
All the results showed there are referred to accuracy achieved on average using 10 different randomly generated test set and trying to configure SVM params to maximize the accuracy.
Considerations
As you can see the method based on definition of "Waiting Time Polynomials" that I'm proposing, performs significantly better than SVM.
More comparative tests will be shown in the publication I'm writing on such topic.

Further notes
Processes based on waiting time or geometrically distributed random variables are extremely important for risk assessment and risk evaluation.
I'll show you in another post some application of such polynomials in this field.
As usual, stay tuned
cristian

5 comments: