MUSCLE
Web Page
Project’s
Rationale
MUSCLE is
a European Network of Excellence that aims at creating and supporting a
pan-European Network of Excellence to foster close collaboration between
research groups in multimedia datamining on the one hand and machine learning
on the other in order to make breakthrough progress towards the following
objectives:
Partners:
Our
Research Objectives
The research performed by AUTH
within the framework of the MUSCLE NOE includes the following objectives:
Contributions
of AUTH
Emotion Recognition from Speech based on gender
information
Emotional
speech recognition aims to automatically classify speech units (e.g.,
utterances) into emotional states, such as anger, happiness, neutral, sadness
and surprise. The major contribution of this work is to rate the discriminating
capability
of a set of features for emotional speech recognition when gender information
is taken into consideration. A total of 87 features has
been calculated over 500 utterances of the Danish Emotional Speech database.
The class pdfs of the mean value of the pitch contour for the five emotions
under study are plotted below. We note that the pdf curves are splines fitted
to the discrete pdf of each class.
In
order to study the classification ability of each feature, a rating method has
been implemented. Each feature is evaluated by the ratio between the
between-class variance and the within-class variance. The between-class
variance measures the distance between the class means, whereas the
within-class variance measures the dispersion within each class. The best features
should be characterized by a large and a small. The 15 features with the
highest ration ( ó2b/ó2w) are shown below, where both ó2b and ó2w are depicted.
The
Sequential Forward Selection method (SFS) has been used in order to discover
the 5-10 features which are able to classify the samples in the best way for
each gender. The criterion used in SFS is the crossvalidated correct
classification rate of a Bayes classifier where the class probability
distribution functions (pdfs) are approximated via Parzen windows or modeled as
Gaussians.
When
a Bayes classifier with Gaussian pdfs is employed, a correct classification
rate of 61.1% is obtained for male subjects and a corresponding rate of 57.1%
for female ones. In the same experiment, a random classification would result
in a correct classification rate of 20%. When gender information is not
considered a correct classification score of 50.6% is obtained. The partial correct classificaction for each class in the following
figure.
|
Both |
Males |
Females |
Correct classification rate |
50.6% |
61.1% |
57.1% |
The
rates reported in Tables 3 and 4 can be further improved by analyzing the
properties of the above mentioned two-class problems. The features which can separate
two classes could be different from those which separate 5 classes. By
designing proper decision fusion algorithms, we may combine several two-class
classifiers and the overall system could outperform the rates obtained by the
five-class classifiers.
Automatic Detection Of Vocal Fold Paralysis and Edema
In
this paper we propose a combined scheme of linear prediction analysis for
feature extraction along with linear projection
methods for feature reduction followed by known pattern recognition methods on
the purpose of discriminating between normal and pathological voice samples.
Two different cases of speech under vocal fold pathology are examined: vocal
fold
paralysis and vocal fold edema. Three known classifiers are tested and compared
in both cases, namely the Fisher linear
discriminant, the K-nearest neighbor classifier, and the nearest mean
classifier. The performance of each classifier is evaluated in terms of the
probabilities of false alarm and detection or the receiver operating characteristic.
The datasets used are part of a database of disordered speech developed by
Massachusetts Eye and Ear Infirmary. The experimental results indicate that
vocal fold paralysis and edema can easily be detected by any of the
aforementioned classifiers.
In
the first experiment, the dataset contains recordings from 21 males aged 26 to
60 years who were medically diagnosed as normals and 21 males aged 20 to 75
years who where medically diagnosed with vocal fold paralysis. In the second
experiment 21 females aged 22 to 52 years who were medically diagnosed as
normals and 21 females aged 18 to 57 years who where medically diagnosed with
vocal fold edema served as subjects. The subjects might suffer from other
diseases too, such as hyperfunction, ventricular compression, atrophy, etc. Two
different kinds of recordings were made in each session: in the first recording
the patients were called to articulate the sustained vowel Ah (/a/) and in the
second one to read the Rainbow Passage. The former is the one concerned with
the present work. Therefore, all procedures were applied to voiced speech
frames far away from transition periods.
The
feature vector extraction is performed via short-term linear prediction of
order 14. The LP model of order 14 is
regarded as a good choice. It has been reported that the use of more than 14
LPCs does not improve significantly the discrimination of laryngeal diseases.
The dimensionality of the feature space is then reduce
by principal component analysis.
The
whole 2-D feature space for (a) the rst experiment concerned with vocal fold
paralysis and (b) the second experiment
concerning vocal fold edema. (Each normal feature vector is represented with an
`o', while each pathological feature vector is represented by a `*'.)
The
first classifier is based on the K-nearest neighbor (K-NN) method applied as
follows: for each feature vector of the test set we peak the feature vectors of
the training set within a circle around it, whose radius is increased until at
least training feature vectors are enclosed, the -nearest ones. The test sample
is assigned to the class where the majority of the training feature vectors belongs to. The second classifier depends on the
class-dependent mean vector computed from the training samples, employs the
distance of each test feature vector from the mean vector of each class and
assigns the test sample to the class of the nearest mean vector.
It
has been demonstrated by experiments, that efficient detection of voice
disorders can be achieved by Fisher's linear discriminant, K-NN, and the
nearest mean classifier for vocal fold paralysis. Slightly worse results have
been reported for vocal fold edema detection. The spectral characteristics
extracted by linear prediction analysis of order 14 combined with principal
component analysis of order 2 for feature reduction have been proved to be very
efficient for the aforementioned classification tasks.
Related
Group Publications
I. Kotsia,
and I. Pitas, "Real time facial expression recognition from video
sequences using Support Vector Machines", in Proc. of Visual Communications and Image Processing (VCIP
2005), Beijing, China, 12-15 July, 2005
C.I.Cotsaces,
N.Nikolaidis and I.Pitas, "The use of face indicator functions for
video indexing and fingerprinting", in Proc. of International Workshop on Content-Based Multimedia
Indexing (CBMI 2005), Riga, Latvia, 21-23 June 2005
D.
Ververidis and C. Kotropoulos, "Sequential Forward Feature Selection
with Low Computational Cost", in Proc. of European Signal Processing Conference (EUSIPCO 2005),,
Antalya, Turkey, 4-8 September, 2005
D.
Ververidis and C. Kotropoulos "Automatic Speech Classification to five
emotional states based on gender information", in Proc. of 12th European Signal Processing Conference (EUSIPCO
'04), pp. 341-344,
M.
Marinaki, C. Kotropoulos, I. Pitas, and N. Maglaveras, "Automatic
detection of vocal fold paralysis and edema", in Proc. of 8th Int. Conf. Spoken Language Processing
(INTERSPEECH 2004), Jeju, Korea, October 2004