Semantic Media Analysis

::Home\|Research Profile\|Human Centered Interfaces
	Facial Expression Recognition Speech Emotional Recognition Face Detection and Tracking Face Clustering

Facial Expression Recognition

Computers in these days try to interpret certain human characteristics so as to react better. These characteristics include facial expressions, eyes gaze, body gait, speech etc. . Many applications such as virtual reality, videoconferencing, user profiling, customer satisfaction studies for broadcast and web services and interfaces for people with special needs require efficient facial expression recognition in order to achieve the desired results.

The basic facial expressions are defined as six anger, disgust, fear, happiness, sadness and surprise. A set of muscle movements (Facial Action Units-FAUs) was created to produce those facial expressions, forming the Facial Action Coding System ( FACS ).

An example of each facial expression for a poser from the Cohn-Kanade database

Facial expressions are generally hard to recognize as:

Every person expresses in a different way, no international patterns are available.
The conditions must be ideal, meaning that a full frontal pose of the poser has to be available.
The neutral state has to be found in videos in order to be able to define the fully expressive video frame and thus perform facial expression recognition.
No proper databases available and difficult to create a new one, as supervision from psychologists is required.

Our Method

A novel method that performs facial expression recognition using fusion of texture and shape information has been developed:

Extracts the texture information from the differences images of a video (calculated using the first and the last frame of the video) using the Discriminant Non-Negative Matrix Factorization (DNMF) algorithm.
Extracts the shape information using the geometrical information acquired from the method mentioned above (SVMs).
Fuses the above kinds of information using either SVMs or Radial Basis Function (RBF) Neural Networks (NNs).

By introducing fusion, certain confusions are resolved. For example, in cases where the gaze has changed, the geometrical information will not be important, while the texture information will be able to capture the change more properly.

The accuracy achieved is equal to 94.5% when achieving facial expression recognition and to 92.1% when achieving FAUs detection.

System architecture for the fusion system

Downloads

Relevant Publications

I. Kotsia and I. Pitas, "Facial expression recognition using shape and texture information", in Proc. of Int. Federation for Information Processing Conf. on Artificial Intelligence (IFIPAI 2006), Santiago, Chile, 21-24 August, 2006.

I. Kotsia, N.Nikolaidis and I. Pitas, "Fusion of Geometrical and Texture information for facial expression recognition", in Proc. of Int. Conf. on Image Processing (ICIP 2006), Atlanta, GA, USA, 8-11 October, 2006.

Research Projects

SIMILAR - The European research taskforce creating human-machine interfaces SIMILAR to human-human communication, IST, FP6

top

Speech Emotional Recognition

Affect recognition aims at automatically identifying the emotional or physical state of a human being from his or her face and voice. The emotional and physical states of a speaker are known as emotional aspects of speech and are included in the so-called paralinguistic aspects. Although the emotional state does not alter the linguistic content, it is an important factor in human communication, because it provides feedback information in many applications.

Affect Recognition is related to the following tasks:

Data collection procedures, the kind of speech (natural, simulated, or elicited), the content, and other physiological signals that may accompany the emotional speech.
Short-term features (i.e. features that are extracted on speech frame basis) that are related to the emotional content of speech. The emotions affect the contour characteristics, such as statistics and trends.
Emotion classification techniques that exploit timing information and other techniques that ignore time context.

Our Method

A data collection is under construction. Subjects are a) kids while trying to imitate an actor, b) kids imerged in a VR enviroment. Two cameras, a professional condense microphone, a sweat sensor, and a heart beat sensor were used.
Two databases are obtained, namely: a) Danish Emotional Speech (DES), and b) Speech under simulated and Actual Stress (SUSAS).
Feature extraction algorithms were developed: a) fundamental frequency (pitch), b) formants from reflection coefficients of the linear prediction model, and c) cepstral coefficients.
Feature selection algorithms are improved, namely the Sequential Floating Forward Selection algorithm is improved by statistical comparisons between feature sets using confidence intervals of the prediction error achieved by a feature set.
Several classifiers are developed a) Bayes classifier using mixtures of Gaussian densities (GMM), b) Support vector machines, c) Bayes classifier by using Parzen windows, d) neural networks (self organizing maps), e) the Brunswik model for emotion perception is under development.

Downloads

Relevant Publications

D. Ververidis and C. Kotropoulos "Fast Sequential Floating Forward Selection applied to emotional speech features estimated on DES and SUSAS data collections", in Proc. of European Signal Processing Conf. (EUSIPCO 2006), Florence, Italy, 4-8 September, 2006.

M. Haindl, P. Somol, D. Ververidis and C. Kotropoulos, "Feature Selection Based on Mutual Correlation", in Proc. 11th Iberoamerican Congress on Pattern Recognition (CIAPR) , Mexico, 2006.

V. Moschou, D. Ververidis, and C. Kotropoulos, "On the Variants of the Self-Organizing Map That Are Based on Order Statistics ", in Proc. 2006 Int. Conf. Artificial Neural Networks, Athens, Sep. 2006.

D. Ververidis and C. Kotropoulos, "Emotional speech classification using gaussian mixture models and the sequential floating forward selection algorithm", in Proc. of 2005 IEEE Int. Conf. on Multimedia and Expo (ICME 2005), Amsterdam, 6-8 July, 2005.

D. Ververidis and C. Kotropoulos, "Emotional speech classification using Gaussian mixture models", in Proc. of2005 IEEE Int. Symposium Circuits and Systems (ISCAS 2005), pp. 2871-2874, Kobe, Japan, May, 2005.

D. Ververidis, C. Kotropoulos and I. Pitas, "Automatic emotional speech classification", in Proc. of ICASSP 2004, vol. I, pp. 593-596, Montreal, Montreal, Canada, May, 2004.

D. Ververidis and C. Kotropoulos "Automatic Speech Classification to five emotional states based on gender information", in Proc. of 12th European Signal Processing Conf. (EUSIPCO '04), pp. 341-344, Vienna, Austria, September, 2004.

D. Ververidis and C. Kotropoulos, "A Review of Emotional Speech Databases", in Proc. of 9th Panhellenic Conf. on Informatics (PCI `03) , pp. 560-574, Thessaloniki, Greece, 21-23 November, 2003.

D. Ververidis and C. Kotropoulos, "A State of the Art Review on Emotional Speech Databases", in Proc. of 1st Richmedia Conf., pp. 109-119, Laussane, Switzerland, October, 2003.

D. Ververidis and C. Kotropoulos, "Emotional Speech Recognition: Resources, features and methods", Elsevier Speech communication, vol. 48, no. 9, pp. 1162-1181, September, 2006.

I. Kotsia, and I. Pitas, "Facial Expression Recognition in Image Sequences using Geometric Deformation Features and Support Vector Machines", IEEE Transactions on Image Processing, December, 2006.

Research Projects

PENED 2003 - “Use of Virtual Reality for training pupils to deal with earthquakes” (01ED312)

MUSCLE - “Multimedia Understanding through Semantics, Computation and LEarning” (FP6-507752)

top

Face Detection and Tracking

Achieving a good localization of faces on video frames is of high importance for an application such as video indexing. Face localization on movies is an ambiguous task due to various scale, pose and lighting conditions.

Our Method

A novel deterministic approach has been developed that it applies face detection, forward tracking and backward tracking, using some predefined rules. From all the possible extracted candidates, a Dynamic Programming algorithm selects those that minimize a cost function.

Face detection:

Use of the Haar-like features to provide first face candidates. For tracking purposes, a post-processing step is added to reduce the number of false alarms and remove some of the background. Candidates are rejected if the number of pixels fulfilling the criteria below are under a certain threshold.

0 < h < 1 and 0.23 < s < 0.68 and 0.27 < v

The remaining candidates are replaced by the smallest bounding box containing the skin-like pixels. The detection is performed every 5 frames.

Forward tracking process:

The Morphological Elastic Graph Matching (MEGM) tracking algorithm is used.
The tracking is initialized by the detection output.
While the detection is efficient for frontal faces, the tracking provides candidates for other possible poses.

Backward tracking process:

If a face is detected in the middle of a shot, no information is available about its localization in the previous video frames.
The same tracker (MEGM) is applied backwards.
The backward tracker is initialized with the labeled detections.
Detected, forward and backward tracked faces are now grouped in actor appearances (labels).

Costs

The node cost C is the distance between the center of the bounding box and the centroid of the skin-like pixels.
The transition cost combines:
- the overlap between two bounding boxes
- the ratio of the bounding boxes areas to penalize big changes in the area during tracking

Structure of the trellis

The trellis selects the path with the minimum cost. The selection is performed using dynamic programming.

Experimental results

Downloads

Relevant Publications

I. Cherif, V. Solachidis and I. Pitas, "A Tracking Framework for Accurate Face Localization", in Proc. of Int. Federation for Information Processing Conf. on Artificial Intelligence (IFIP AI 2006), Santiago, Chile, 21-24 August, 2006.

Research Projects

NM2 - “New media for a new millennium” (IST-004124), FP6

top

Face Clustering

Clustering could be considered as a form of unsupervised classification imposed over a finite set of objects. Its goal is to group sets of objects into classes, such that similar objects are placed in the same cluster, while dissimilar objects are placed in different clusters.

Human faces are some of the most important and frequently encountered entities in videos and can be considered as high-level semantic features. Face clustering in videos can be used in many applications such as video indexing and content analysis, as a pre processing step for face recognition, or even as a basic step for extracting the principal cast of a feature length movie and much more.

Our Methods

A) Clustering based on Mutual Information: The capabilities of joint entropy and mutual information are exploited in order to classify face images exported from a Haar detector. We use the intensity images and we define for every image the probability density function as the histogram of the intensities of that image summed to one. In order to calculate the joint entropy between the two images we construct a 2D histogram of 256 bins which take into account the relative positions of intensities so that similarity occurs between two images, when same intensities are located in same spatial locations.

Problem Definition

Face clustering is the task where from a set of face images A we create n subsets .
We exploit the mutual information between the face images to create a similarity matrix which afterwards will be clustered.
The mutual information is shown to be a good measure for similarity between face images where light conditions and poses are variant.
The movies' context for such an algorithm gives a new dimension to the problem where no calibrated images are used as input.
Purpose of such an algorithm: Define primordial actors, automatic (not manually annotated) database search, registration, content analysis.

Clustering Process:

The clustering process is based on the Fuzzy-C Means (FCM) algorithm.
We provide the number of classes and the similarity matrix to the algorithm.
In order to use this algorithm we define every row of the aforementioned similarity matrix as a different vector in an M-dimensional L2-normed vector space over R.

Darker regions belong to the first actor and clearer ones to the second actor. The video sequence has four consecutive shots in the order FA-FA-SA-SA where FA and SA first and second actor respectively

B) Hierarchical Clustering: An algorithm to cluster face images found in feature length movies and generally in video sequences is proposed. A novel method for creating a dissimilarity matrix using SIFT image features is introduced. This dissimilarity matrix is used as an input in a hierarchical average linkage clustering algorithm, which finally yields the clustering result.

The final result is found to be quite robust to significant scale, pose and illumination variations, encountered in facial images.

Clusters 1, 2 and 4 contained only facial images from the same person. The third cluster contained the false face detections (non-facial images) as we expected, but it also included certain instances of the actor in cluster 1, due to a significant change in the person's pose.

Downloads

Relevant Publications

Ν. Vretos, V. Solachidis and I. Pitas "A Mutual Information Based Algorithm for Face Clustering", in Proc. of Int. Conf. on Multimedia and Expo (ICME 2006) , Toronto Ontario, Canada, 9-12 July, 2006.

P. Antonopoulos, N. Nikolaidis and I. Pitas, “Hierarchical Face Clustering Using SIFT Image Features”, submitted in Proc. of IEEE Symposium on Computational Intelligence in Image and Signal Processing (CIISP 2007), Honolulu, HI , USA.

Research Projects

NM2 - “New media for a new millennium” (IST-004124), FP6

Pythagoras II - Funded by the Hellenic Ministry of Education in the framework of the program

top