Emotional Speech Recognition
Emotion is an important factor in communication. For example, a simple text dictation that does not reveal any emotion, it does not covey adequately the semantics of the text. An emotion speech synthesizer could solve such a communication problem. Speech emotion recognition systems can be used by disabled people for communication, by actors for emotion speech consistency as well as for interactive TV, for constructing virtual teachers, in the study of human brain malfunctions, and the advanced design of speech coders. Until recently many voice synthesizers could not produce faithfully a human emotional speech. This results to an unnatural and unattractive speech. Nowadays, the major speech processing labs worldwide are trying to develop efficient algorithms for emotion speech synthesis as well as emotion speech recognition. To achieve such ambitious goals, the collection of emotional speech databases is a prerequisite.
Our purpose is to design a useful tool which can be used in psychology to automatically classify utterances into five emotional states such as anger, happiness, neutral, sadness, and surprise. The major contribution of our investigation is to rate the discriminating capability of a set of features for emotional speech recognition. A total of 87 features has been calculated over 500 utterances from the Danish Emotional Speech database. The Sequential Forward Selection method (SFS) has been used in order to discover a set of 5 to 10 features which are able to classify the utterances in the best way.
The criterion used in SFS is the crossvalidated correct classification score of one of the following classifiers: nearest mean and Bayes classifier where class pdfs are approximated via Parzen windows or modelled as Gaussians. After selecting the 5 best features, we reduce the dimensionality to two by applying principal component analysis. The result is a 51.6% +- 3% correct classification rate at 95% confidence interval for the five aforementioned emotions, whereas a random classification would give a correct classification rate of 20%.
Furthermore, we find out those two-class emotion recognition problems whose error rates contribute heavily to the average error and we indicate that a possible reduction of the error rates reported in this paper would be achieved by employing two-class classifiers and combining them.