Text & Data Mining by practical means: Statistical Distribution to describe a Document

Monday, March 3, 2014

Statistical Distribution to describe a Document - My Contribute Part II

Abstract
it's presented the second step to generalize a probabilistic density function (based on geometrical distribution) to describe a document thru the underlying stochastic process.

Marginal PDF and CDF for document described by three words.
In red has been depicted the distribution of a document having size = 35, in Yellow size=34, in Blue size 33.

The marginal distribution that describes the density of a document for a fixed size composed by three words is the following:

Example of alpha, beta, gamma parameters determination.

Introduction

The bag of words approach and almost all the related techniques to extract features from a document are based on the manipulation of frequencies associated to the words of the document. Such methodologies tend to fail when documents are characterized exactly by the same bag of words, and by the same frequency.

The proposed approach bypasses such problem since it takes in account the waiting time of the words in a document.

From 2 variables to 3 variables

In the former post I presented the easiest case: a document depicted by two words. The generalization of the problem it's not painless process.

To understand the tricks used to tackle the problem I'm going to explain the passage from 2 to 3 variables.

...To be continued.

PS: I'm back!

3 comments:

UnknownMarch 4, 2014 at 5:02 PM
Great to have you back!
I learned a lot from your posts, keep it going!
ReplyDelete
Replies
leoJune 27, 2019 at 1:09 AM
Amazing content.
Data Mining Process
ReplyDelete
Replies
Hurain.00May 3, 2022 at 10:36 AM
This comment has been removed by the author.
ReplyDelete
Replies

Add comment

Pages

Monday, March 3, 2014

Statistical Distribution to describe a Document - My Contribute Part II

3 comments: