Cluster analysis (or Clustering), which is widely used in many computer science areas, such as data mining, information retrieval, machine learning and image segmentation, studies algorithms and methods for grouping or classifying ”similar” objects into clusters. The notion of ”similarity” varies depending on the purpose of the study, the domain-specific assumptions and the prior knowledge of the problem. The objects (individuals, cases, or subjects) which comprise the raw data in cluster analysis are described either by a set of measurements or by relationships between them.
The unsupervised clustering methods are classified into:
Word Clustering aims to produce groups of words with similar characteristics, such as meaning, syntactical and grammatical information.
Several hierarchical clustering algorithms have been implemented and applied in the field of word clustering:
The use of Long-Distanced Bigrams models in word clustering by minimizing the loss of mutual information comprises our novelty together with an improvement of the clustering algorithm in using robust estimates for the mean and covariance matrix.
Representative experimental results have been conducted on a subset of the Reuters corpus collection.
Cluster validity assessment have been performed by using:
while the performance of the language models employed have been tested by means of perplexity.
N. Bassiou and C. Kotropoulos, "Interpolated Distanced Bigram Language Models for Robust Word Clustering", in Proc. of IEEE Int. Workshop on Nonlinear Signal and Image Processing (NSIP 2005) , Sapporo, Japan, 18-20 May, 2005.
N. Bassiou, C. Kotropoulos and I. Pitas, "Hierarchical word clustering for relevance judgements in information retrieval", in Proc. of Workshop on Pattern Recognition in Information Systems (PRIS'01), pp. 139-148, Setubal, Portugal, 6-7 July, 2001.
MUSCLE - “Multimedia Understanding through Semantics, Computation and LEarning” (FP6-507752)