MUSCLE Web Page
MUSCLE is a European Network of Excellence that aims at creating and supporting a pan-European Network of Excellence to foster close collaboration between research groups in multimedia datamining on the one hand and machine learning on the other in order to make breakthrough progress towards the following objectives:
Our Research Objectives
The research performed by AUTH within the framework of the MUSCLE NOE includes the following objectives:
Contributions of AUTH
Emotion Recognition from Speech based on gender information
speech recognition aims to automatically classify speech units (e.g.,
utterances) into emotional states, such as anger, happiness, neutral, sadness
and surprise. The major contribution of this work is to rate the discriminating
of a set of features for emotional speech recognition when gender information is taken into consideration. A total of 87 features has been calculated over 500 utterances of the Danish Emotional Speech database. The class pdfs of the mean value of the pitch contour for the five emotions under study are plotted below. We note that the pdf curves are splines fitted to the discrete pdf of each class.
In order to study the classification ability of each feature, a rating method has been implemented. Each feature is evaluated by the ratio between the between-class variance and the within-class variance. The between-class variance measures the distance between the class means, whereas the within-class variance measures the dispersion within each class. The best features should be characterized by a large and a small. The 15 features with the highest ration ( ó2b/ó2w) are shown below, where both ó2b and ó2w are depicted.
Sequential Forward Selection method (SFS) has been used in order to discover
the 5-10 features which are able to classify the samples in the best way for
each gender. The criterion used in SFS is the crossvalidated correct
classification rate of a Bayes classifier where the class probability
distribution functions (pdfs) are approximated via Parzen windows or modeled as
When a Bayes classifier with Gaussian pdfs is employed, a correct classification rate of 61.1% is obtained for male subjects and a corresponding rate of 57.1% for female ones. In the same experiment, a random classification would result in a correct classification rate of 20%. When gender information is not considered a correct classification score of 50.6% is obtained. The partial correct classificaction for each class in the following figure.
Correct classification rate
The rates reported in Tables 3 and 4 can be further improved by analyzing the properties of the above mentioned two-class problems. The features which can separate two classes could be different from those which separate 5 classes. By designing proper decision fusion algorithms, we may combine several two-class classifiers and the overall system could outperform the rates obtained by the five-class classifiers.
Automatic Detection Of Vocal Fold Paralysis and Edema
this paper we propose a combined scheme of linear prediction analysis for
feature extraction along with linear projection
methods for feature reduction followed by known pattern recognition methods on the purpose of discriminating between normal and pathological voice samples. Two different cases of speech under vocal fold pathology are examined: vocal fold
paralysis and vocal fold edema. Three known classifiers are tested and compared in both cases, namely the Fisher linear
discriminant, the K-nearest neighbor classifier, and the nearest mean classifier. The performance of each classifier is evaluated in terms of the probabilities of false alarm and detection or the receiver operating characteristic. The datasets used are part of a database of disordered speech developed by Massachusetts Eye and Ear Infirmary. The experimental results indicate that vocal fold paralysis and edema can easily be detected by any of the aforementioned classifiers.
In the first experiment, the dataset contains recordings from 21 males aged 26 to 60 years who were medically diagnosed as normals and 21 males aged 20 to 75 years who where medically diagnosed with vocal fold paralysis. In the second experiment 21 females aged 22 to 52 years who were medically diagnosed as normals and 21 females aged 18 to 57 years who where medically diagnosed with vocal fold edema served as subjects. The subjects might suffer from other diseases too, such as hyperfunction, ventricular compression, atrophy, etc. Two different kinds of recordings were made in each session: in the first recording the patients were called to articulate the sustained vowel Ah (/a/) and in the second one to read the Rainbow Passage. The former is the one concerned with the present work. Therefore, all procedures were applied to voiced speech frames far away from transition periods.
feature vector extraction is performed via short-term linear prediction of
order 14. The LP model of order 14 is
regarded as a good choice. It has been reported that the use of more than 14 LPCs does not improve significantly the discrimination of laryngeal diseases. The dimensionality of the feature space is then reduce by principal component analysis.
whole 2-D feature space for (a) the rst experiment concerned with vocal fold
paralysis and (b) the second experiment
concerning vocal fold edema. (Each normal feature vector is represented with an `o', while each pathological feature vector is represented by a `*'.)
The first classifier is based on the K-nearest neighbor (K-NN) method applied as follows: for each feature vector of the test set we peak the feature vectors of the training set within a circle around it, whose radius is increased until at least training feature vectors are enclosed, the -nearest ones. The test sample is assigned to the class where the majority of the training feature vectors belongs to. The second classifier depends on the class-dependent mean vector computed from the training samples, employs the distance of each test feature vector from the mean vector of each class and assigns the test sample to the class of the nearest mean vector.
It has been demonstrated by experiments, that efficient detection of voice disorders can be achieved by Fisher's linear discriminant, K-NN, and the nearest mean classifier for vocal fold paralysis. Slightly worse results have been reported for vocal fold edema detection. The spectral characteristics extracted by linear prediction analysis of order 14 combined with principal component analysis of order 2 for feature reduction have been proved to be very efficient for the aforementioned classification tasks.
Related Group Publications
I. Kotsia, and I. Pitas, "Real time facial expression recognition from video sequences using Support Vector Machines", in Proc. of Visual Communications and Image Processing (VCIP 2005), Beijing, China, 12-15 July, 2005
C.I.Cotsaces, N.Nikolaidis and I.Pitas, "The use of face indicator functions for video indexing and fingerprinting", in Proc. of International Workshop on Content-Based Multimedia Indexing (CBMI 2005), Riga, Latvia, 21-23 June 2005
D. Ververidis and C. Kotropoulos, "Sequential Forward Feature Selection with Low Computational Cost", in Proc. of European Signal Processing Conference (EUSIPCO 2005),, Antalya, Turkey, 4-8 September, 2005
Ververidis and C. Kotropoulos "Automatic Speech Classification to five
emotional states based on gender information", in Proc. of 12th European Signal Processing Conference (EUSIPCO
'04), pp. 341-344,
M. Marinaki, C. Kotropoulos, I. Pitas, and N. Maglaveras, "Automatic detection of vocal fold paralysis and edema", in Proc. of 8th Int. Conf. Spoken Language Processing (INTERSPEECH 2004), Jeju, Korea, October 2004