it's presented the second step to generalize a probabilistic density function (based on geometrical distribution) to describe a document thru the underlying stochastic process.
|Marginal PDF and CDF for document described by three words.|
In red has been depicted the distribution of a document having size = 35, in Yellow size=34, in Blue size 33.
The bag of words approach and almost all the related techniques to extract features from a document are based on the manipulation of frequencies associated to the words of the document. Such methodologies tend to fail when documents are characterized exactly by the same bag of words, and by the same frequency.
The proposed approach bypasses such problem since it takes in account the waiting time of the words in a document.
From 2 variables to 3 variables
In the former post I presented the easiest case: a document depicted by two words. The generalization of the problem it's not painless process.
To understand the tricks used to tackle the problem I'm going to explain the passage from 2 to 3 variables.
...To be continued.
PS: I'm back!