Monday, March 3, 2014

Statistical Distribution to describe a Document - My Contribute Part II

it's presented the second step to generalize a probabilistic density function (based on geometrical distribution) to describe a document thru the underlying stochastic process.
Marginal PDF and CDF for document described by three words.
In red has been depicted the distribution of a document having size = 35, in Yellow size=34, in Blue size 33.

The marginal distribution that describes the density of a document for a fixed size composed by three words is the following:

Example of alpha, beta, gamma parameters determination. 

The bag of words approach and almost all the related techniques to extract features from a document are based on the manipulation of frequencies associated to the words of the document. Such methodologies tend to fail when documents are characterized exactly by the same bag of words, and by the same frequency.

The proposed approach bypasses such problem since it takes in account the waiting time of the words in a document.

From 2 variables to 3 variables
In the former post I presented the easiest case: a document depicted by two words. The generalization of the problem it's not painless process.
To understand the tricks used to tackle the problem I'm going to explain the passage from 2 to 3 variables.
...To be continued.
PS: I'm back!


1 comment:

  1. Great to have you back!
    I learned a lot from your posts, keep it going!