Word Clustering

Cluster analysis (or Clustering), which is widely used in many computer science areas, such as data mining, information retrieval, machine learning and image segmentation, studies algorithms and methods for grouping or classifying ”similar” objects into clusters. The notion of ”similarity” varies depending on the purpose of the study, the domain-specific assumptions and the prior knowledge of the problem. The objects (individuals, cases, or subjects) which comprise the raw data in cluster analysis are described either by a set of measurements or by relationships between them.

The unsupervised clustering methods are classified into:

  • Partitional: the data set is directly decomposed into a set of disjoint clusters that iteratively optimize a certain criterion function
  • Hierarchical: produces a nested sequence of partitions resulting in a tree of clusters (dendrogram) which show the inter-relationship between clusters.

Word Clustering aims to produce groups of words with similar characteristics, such as meaning, syntactical and grammatical information.


Our Method

Several hierarchical clustering algorithms have been implemented and applied in the field of word clustering:

  • Single-link
  • Complete-link
  • Minimization of the loss of the mutual information, which corresponds to the information carried on a particular word when prior information is known. Prior information is captured by using the appropriate statistical language models:
    • Bigrams (or Trigrams): the current word is predicted based on one (or two) immediate preceding words
    • Long-Distanced Bigrams met in different distances and interpolated either over the component probabilities of the models or over the full models. It serves in capturing long distance dependencies with a small number of free parameters.

The use of Long-Distanced Bigrams models in word clustering by minimizing the loss of mutual information comprises our novelty together with an improvement of the clustering algorithm in using robust estimates for the mean and covariance matrix.

Representative experimental results have been conducted on a subset of the Reuters corpus collection.

Cluster validity assessment have been performed by using:

  • external indices: Rand, Jaccard
  • measurements based on mutual information: entropy, average mutual information, normalized avergae mutual information and variation of information

while the performance of the language models employed have been tested by means of perplexity.





Relevant Publications

N. Bassiou and C. Kotropoulos, "Interpolated Distanced Bigram Language Models for Robust Word Clustering", in Proc. of IEEE Int. Workshop on Nonlinear Signal and Image Processing (NSIP 2005) , Sapporo, Japan, 18-20 May, 2005.

N. Bassiou, C. Kotropoulos and I. Pitas, "Hierarchical word clustering for relevance judgements in information retrieval", in Proc. of Workshop on Pattern Recognition in Information Systems (PRIS'01), pp. 139-148, Setubal, Portugal, 6-7 July, 2001.


Research Projects

MUSCLE - “Multimedia Understanding through Semantics, Computation and LEarning” (FP6-507752)


© 2006