**Part I: building the vectors**

In the former post we have created a features set for the class GOLD, now we show how use this features set to build the vectors to feed the svm.

The Features set is the fingerprint that characterizes the document class.

To build the vectors, you have to follow the same parsing and filtering process used to extract the features.

As described in the previous post, from the original file, you have to follow the below steps:

- Split the words of the file;
- Remove the words having Length <= 2;
- Remove the stopWords
- Take the "root" of the word (e.g. Large from Larger)

On the filtered file just obtained, you can build every sort of statistic to associate to each word of the filtered file a score.

For our purpose the simple frequency could be enough.

The last step is to associate to each word of the filtered file the score just calculated with the score associated to each word of the features set.

This step is illustrated in the below illustration:

In a enterprise environment it could be a bottleneck in terms of computational capabilities, this due to the strict relation between vector dimension and time complexity to classify.

How can we reduce the dimension of the problem to speed up the entire classification process?

There are different methods, the first I would show is the Principal Component Analysis (PCA).

As usual, I'm not going to enter in technical and theoretical issues: there are thousands websites, tutorial and paper where you can find every sort of details on that.

Just to have an overview, PCA provides the most important linear components in a set of data. It is based on the concept of eigenvectors and eigenvalues :

PCA has two great features:

Basically you standardize the data (to have mean = 0), you compute the covariance matrix, and on that you extract the eigenvalues and eigenvectors.

The position of eigenvectors having higher eigenvalues are the the component most relevant: the components relevant are the features most important to classify the documents with a minimum loss of accuracy!

Just to understand the role played by eigenvectors consider the following multivariate gaussian data set:

**Part II: reducing the dimension**How can we reduce the dimension of the problem to speed up the entire classification process?

There are different methods, the first I would show is the Principal Component Analysis (PCA).

As usual, I'm not going to enter in technical and theoretical issues: there are thousands websites, tutorial and paper where you can find every sort of details on that.

Just to have an overview, PCA provides the most important linear components in a set of data. It is based on the concept of eigenvectors and eigenvalues :

**M.x = λ x**where**λ**is the eigenvalue and x is the eigenvector.PCA has two great features:

- it is completely unparameterized;
- it is extremely easy to implement;

the variable matrix is the dataset having whatever Dimension |

Basically you standardize the data (to have mean = 0), you compute the covariance matrix, and on that you extract the eigenvalues and eigenvectors.

The position of eigenvectors having higher eigenvalues are the the component most relevant: the components relevant are the features most important to classify the documents with a minimum loss of accuracy!

Just to understand the role played by eigenvectors consider the following multivariate gaussian data set:

Data set gaussian distributed |

The principal components are well represented by the red vectors:

Principal Components for the above data set |

What happens in our classes?

I computed the PCA for the GOLD class, but the question now is "What is the minimum threshold to discriminate the important components against the negligible features?

...Sometimes a simple "logPlot" helps the analyst:

The blu arrow represent the discontinuity point for the eigenvalues |

As shown above, around at the component number 82, there is a discontinuity: after this point the magnitude of the eigenvalue is negligible. Here you are the threshold!

**Remark**

The vectors have been built using the sorted Feature Set list (see the previous post).

and the criteria to sort was based on the TF- DF (or TF IDF, or ...): from the most important from TF DF prospective to the less important.

The graph above provides us a sort of countercheck of the goodness of the criteria chosen to select the features, because there is a perfect correspondence among the weight assigned with TF DF and the eigenvalues (notice the monotonicity of the graph)!

So for the GOLD class, we will consider only the

**first 82 components**.As you can see we used PCA in two different ways:

1) to countercheck our features extraction method

2) to reduce the complexity of the problem from 1000 to 82.

In the next post, we will see another algorithm of feature reduction: C4.5.

...We are now introducing many different approaches to play with the data, in the next post I'll show you the graph of experiments we are doing to keep track of every results and to chose the best strategy.

as usual: stay tuned

cristian

That is a great practical.

ReplyDeleteBuy Televisions online

This comment has been removed by the author.

ReplyDeleteThis is very useful info for me.

ReplyDeletemozilla firefox 2017 free download

ReplyDeleteAVG free download 2017