TEXT CATEGORIZATION USING ONLY FRAGMENTS OF DOCUMENTS

Pilaszy, Istvan; Dobrowiecki, Tadeusz

doi:ISSN 0046-5518

TEXT CATEGORIZATION USING ONLY FRAGMENTS OF DOCUMENTS

Pilaszy, Istvan; Dobrowiecki, Tadeusz

2007

Download

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS

Add to Basket

Cite

Files

Abstract

In this paper we presented a lot of experiments that examine how the particular parts of the documents do contribute to the performance of a classifier. We evaluated text classifiers on two very different text corpora. We conclude that some parts of the text are more important from the point of text classification performance. Giving higher weights to more important parts can increase the performance of the classifier. The question, that which parts are more or less important depends on the nature of the documents in the corpora. Some tasks that remains to be done: − More text corpora should be investigated. − In section 6.4 we optimized the number of features to be kept independent from the section. However, it could be optimized for each section. − Splitting the documents into parts of 50 words, to examine what if the parts are of equal size not only inside a document, but among the documents too. − When splitting documents into k equal parts, we may combine the classifiers resulted from different k values.