Comments on Text & Data Mining by practical means: Document Classification part 1 - Data Set Analysis

Thanks for sharing, check out Data Mining Serv...

2019-06-13T03:41:54.559-07:00

Thanks for sharing, check out

Data Mining Service Providers

Data Mining Process

Exactly David, I'm selecting documents in whic...

2011-08-24T10:12:59.309-07:00

Exactly David,
I'm selecting documents in which the tag "TOPICS" contains only 1 class.
I intend classifying 4-5 classes of docs.
@blog community:
Dr. David Lewis is a LEGEND of "Text Mining", "Text categorization".
He is the Father of "Reuters Data Set" as well!

Cristian - And thanks to you for making the effort...

2011-08-24T05:15:19.036-07:00

Cristian - And thanks to you for making the effort to teach people.

Re point 2, so is the idea that you will be working only with the subset of documents that have exactly one category?

First of all, many thanks to take part in this dis...

2011-08-22T11:31:39.277-07:00

First of all, many thanks to take part in this discussion. I'm very honored of that!
About the point 2: One of the aims of this blog-tutorial is to provide an overview of the techniques and methods to classify texts. So, in the beginning I'm going to classify documents without overlap. When the level will be mature I'll show some techniques to analyze the overlapping.

point 3: totally with you!
Thanks again.
I really hope you will give us other precious contributes!
...I cannot forget some your papers I read years ago during my university-time!!

It's fun to see that Reuters-21578 is still fi...

2011-08-22T09:58:13.222-07:00

It's fun to see that Reuters-21578 is still finding uses, at least pedagogically. A few comments:

1. For most research purposes RCV1-v2 is today more useful: more documents, more categories, better understanding of the accuracy of the manual categorization, etc.

2. It's important to mention that Reuters documents can belong to 0, 1, or several of the 135 TOPICS categories. Given that, I'm not clear on how you're defining coverage. Upper bound on proportion of documents that have at least one category that get at least one of their categories?

3. Building and maintaining classifiers for 135 categories is not necessarily, or even usually, a problem operationally. The limiting factor would not be computation, but availability of people who have been trained to label training and validation data for those 135 categories.