Projects Rationale


MUSCLE is a European Network of Excellence that aims at creating and supporting a pan-European Network of Excellence to foster close collaboration between research groups in multimedia datamining on the one hand and machine learning on the other in order to make breakthrough progress towards the following objectives:

  • Harnessing the full potential of machine learning and cross-modal interaction for the (semi-) automatic generation of metadata with high semantic content for multimedia documents;
  • Applying machine learning for the creation of expressive, context-aware, self-learning, and human-centered interfaces that will be able to effectively assist users in the exploration of complex and rich multimedia content; 
  • Improving interoperability and exchangeability of heterogeneous and distributed (meta)data by enabling data descriptions of high semantic content (e.g. ontologies, MPEG7 and XML schemata) and inference schemes that can reason about these at the appropriate levels. 
  • Through dissemination, training and industrial liaison, contribute to the distribution and uptake of the technology by relevant end-users such as industry, education, and the service sector. In particular, close interactions with other IP's and NOE's in this and related activity fields are planned; 
  • Through accomplishing the above, facilitate the broad and democratic (i.e. obviating the need for special expertise) access to information and knowledge for all European citizens (e.g. e-Education, enriched cultural heritage).




  • Advanced Computer Vision
  • Aristotle University of Thessaloniki
  • Bilkent University
  • Commissariat a l'Energie Atomique, France
  • Consiglio Nazionale delle Ricerche
  • Centre National de la Recherche Scientifique
  • Centre for Mathematics and Computer Science
  • Ecole Nationale Suprieure de l'Electronique et de ses Applications
  • European Research Consortium for Informatics and Mathematics
  • Foundation for Research and Technology Hellas
  • France Telecom SA
  • Ecole Nationale Suprieure des Telecommunications
  • Institute fur Bildverarbeitung und angewandte Informatik e.V
  • National Technical University of Athens
  • INRIA-Ariana
  • INRIA-Imedia
  • INRIA-Parole
  • INRIA-Texmex
  • INRIA-Vista
  • Royal Institute of Technologie
  • LTU Technologies
  • Computer and Automation Research Institute of the Hungarian Academy of Sciences
  • Austria Research Centers, Seibersdorf Research, Gmbh
  • Tel Aviv University
  • Trinity College Dublin, Ireland
  • Israel Institute of Technology
  • Technical University of Crete, Greece
  • Graz University of Technology
  • Vienna University of Technology
  • Cambridge University
  • University College London
  • Albert-Ludwigs-Universitaet Freiburg
  • University of Surrey
  • Universitat Politecnica de Catalunya
  • Institute of Information Theory and Automation
  • University of Ulster
  • University of Amsterdam
  • Technical Research Centre of Finland



Our Research Objectives


The research performed by AUTH within the framework of the MUSCLE NOE includes the following objectives:

  • Recognition of human emotions in video sequences using facial expressions
  • Emotional classification of speech
  • Language modelling
  • Indexing and fingerprinting of videos using semantic information
  • Medical diagnosis from the analysis of voice characteristics



Contributions of AUTH

Emotion Recognition from Speech based on gender information


Emotional speech recognition aims to automatically classify speech units (e.g., utterances) into emotional states, such as anger, happiness, neutral, sadness and surprise. The major contribution of this work is to rate the discriminating capability
of a set of features for emotional speech recognition when gender information is taken into consideration. A total of 87 features has been calculated over 500 utterances of the Danish Emotional Speech database. The class pdfs of the mean value of the pitch contour for the five emotions under study are plotted below. We note that the pdf curves are splines fitted to the discrete pdf of each class.

In order to study the classification ability of each feature, a rating method has been implemented. Each feature is evaluated by the ratio between the between-class variance and the within-class variance. The between-class variance measures the distance between the class means, whereas the within-class variance measures the dispersion within each class. The best features should be characterized by a large and a small. The 15 features with the highest ration ( 2b/2w) are shown below, where both 2b and 2w are depicted.

The Sequential Forward Selection method (SFS) has been used in order to discover the 5-10 features which are able to classify the samples in the best way for each gender. The criterion used in SFS is the crossvalidated correct classification rate of a Bayes classifier where the class probability distribution functions (pdfs) are approximated via Parzen windows or modeled as Gaussians.

When a Bayes classifier with Gaussian pdfs is employed, a correct classification rate of 61.1% is obtained for male subjects and a corresponding rate of 57.1% for female ones. In the same experiment, a random classification would result in a correct classification rate of 20%. When gender information is not considered a correct classification score of 50.6% is obtained. The partial correct classificaction for each class in the following figure.


Bayes classifier with Gauss pdfs




Correct classification rate




The rates reported in Tables 3 and 4 can be further improved by analyzing the properties of the above mentioned two-class problems. The features which can separate two classes could be different from those which separate 5 classes. By designing proper decision fusion algorithms, we may combine several two-class classifiers and the overall system could outperform the rates obtained by the five-class classifiers.




Automatic Detection Of Vocal Fold Paralysis and Edema


In this paper we propose a combined scheme of linear prediction analysis for feature extraction along with linear projection
methods for feature reduction followed by known pattern recognition methods on the purpose of discriminating between normal and pathological voice samples. Two different cases of speech under vocal fold pathology are examined: vocal fold
paralysis and vocal fold edema. Three known classifiers are tested and compared in both cases, namely the Fisher linear
discriminant, the K-nearest neighbor classifier, and the nearest mean classifier. The performance of each classifier is evaluated in terms of the probabilities of false alarm and detection or the receiver operating characteristic. The datasets used are part of a database of disordered speech developed by Massachusetts Eye and Ear Infirmary. The experimental results indicate that vocal fold paralysis and edema can easily be detected by any of the aforementioned classifiers.

In the first experiment, the dataset contains recordings from 21 males aged 26 to 60 years who were medically diagnosed as normals and 21 males aged 20 to 75 years who where medically diagnosed with vocal fold paralysis. In the second experiment 21 females aged 22 to 52 years who were medically diagnosed as normals and 21 females aged 18 to 57 years who where medically diagnosed with vocal fold edema served as subjects. The subjects might suffer from other diseases too, such as hyperfunction, ventricular compression, atrophy, etc. Two different kinds of recordings were made in each session: in the first recording the patients were called to articulate the sustained vowel Ah (/a/) and in the second one to read the Rainbow Passage. The former is the one concerned with the present work. Therefore, all procedures were applied to voiced speech frames far away from transition periods.

The feature vector extraction is performed via short-term linear prediction of order 14. The LP model of order 14 is
regarded as a good choice. It has been reported that the use of more than 14 LPCs does not improve significantly the discrimination of laryngeal diseases. The dimensionality of the feature space is then reduce by principal component analysis.

The whole 2-D feature space for (a) the rst experiment concerned with vocal fold paralysis and (b) the second experiment
concerning vocal fold edema. (Each normal feature vector is represented with an `o', while each pathological feature vector is represented by a `*'.)



The first classifier is based on the K-nearest neighbor (K-NN) method applied as follows: for each feature vector of the test set we peak the feature vectors of the training set within a circle around it, whose radius is increased until at least training feature vectors are enclosed, the -nearest ones. The test sample is assigned to the class where the majority of the training feature vectors belongs to. The second classifier depends on the class-dependent mean vector computed from the training samples, employs the distance of each test feature vector from the mean vector of each class and assigns the test sample to the class of the nearest mean vector.

It has been demonstrated by experiments, that efficient detection of voice disorders can be achieved by Fisher's linear discriminant, K-NN, and the nearest mean classifier for vocal fold paralysis. Slightly worse results have been reported for vocal fold edema detection. The spectral characteristics extracted by linear prediction analysis of order 14 combined with principal component analysis of order 2 for feature reduction have been proved to be very efficient for the aforementioned classification tasks.




Related Group Publications


I. Kotsia, and I. Pitas, "Real time facial expression recognition from video sequences using Support Vector Machines", in Proc. of Visual Communications and Image Processing (VCIP 2005), Beijing, China, 12-15 July, 2005


C.I.Cotsaces, N.Nikolaidis and I.Pitas, "The use of face indicator functions for video indexing and fingerprinting", in Proc. of International Workshop on Content-Based Multimedia Indexing (CBMI 2005), Riga, Latvia, 21-23 June 2005


D. Ververidis and C. Kotropoulos, "Sequential Forward Feature Selection with Low Computational Cost", in Proc. of European Signal Processing Conference (EUSIPCO 2005),, Antalya, Turkey, 4-8 September, 2005


D. Ververidis and C. Kotropoulos "Automatic Speech Classification to five emotional states based on gender information", in Proc. of 12th European Signal Processing Conference (EUSIPCO '04), pp. 341-344, Vienna, Austria, September 2004


M. Marinaki, C. Kotropoulos, I. Pitas, and N. Maglaveras, "Automatic detection of vocal fold paralysis and edema", in Proc. of 8th Int. Conf. Spoken Language Processing (INTERSPEECH 2004), Jeju, Korea, October 2004


N. Bassiou and C. Kotropoulos, "Interpolated Distanced Bigram Language Models for Robust Word Clustering", in in Proc. of IEEE International Workshop on Nonlinear Signal and Image Processing (NSIP 2005), Sapporo, Japan, 18-20 May, 2005