РЕВЕ 2010 - The Impact of Semi-Supervised Clustering on Text Classification.

The objective of the project is to investigate the use of clustering as a prior and preliminary step to text classification, in order to improve the performance of a classifier.

In traditional supervised classification an inductive learner is first trained on a training set, and then is called to classify a testing set, about which it has no prior knowledge. An ideal situation would be for the classifier to have information about the distribution of the testing examples before it classifies them. This remark motivated the proposed research. In this vein, the goal of this research is to deal with the problem of learning from training sets of different sizes, by exploiting information derived from clustering the whole dataset (both training and testing examples), and embodied in it in the form of meta-information. An algorithm that combines semi-supervised clustering and (supervised) classification is proposed. In the case of semi-supervised clustering, the aim is to extract a kind of “structure" from a given sample of objects, to learn a concise representation of these data, provided with knowledge of the given category structure, i.e. the class labels of the training examples. Its goal is to identify class-uniform clusters that have high probability densities with respect to a single class. The reasoning behind this is that if some structure exists in the objects, it is possible to take advantage of this information and find a short description of the data. In our approach, given a classification problem, the training and testing examples will be both clustered before the classification step, in order to extract the “structure" of the whole dataset, exploiting the dependence or association between index terms and documents, and the prior knowledge of the class labels of the training set. The structure extracted from the dataset will be “translated" in such a way that each cluster is represented by one representative. This concise representation of the whole dataset will be incorporated in the existing data representation; each object will be assigned the corresponding cluster id using appropriate artificial meta-features. It is expected that the use of prior knowledge about the nature of the testing set will help in building a more efficient classifier for this set.

In this vein, an extended experimental study will take place. The proposed classification algorithm will be implemented, studied and applied into many real-world classification tasks and will be compared with other state-of-the-art classifiers. Also, we aim at discovering the most effective combinations of clustering and classification under different views and examination of the problem of text classification. We will try to answer various questions that can be important to learning, such as how much of training data can help, what is the effect of independence or dependence among features, and more.

The outcomes of the project will have impact on both the research areas of clustering and classification.