Semantic Media Analysis

::Home\|Research Profile\|Semantic Media Analysis
	Anthropocentric Video Content Description Face Detection and Tracking Face Clustering Dialog Detection Shot Boundary Detection Audio-visual Scene Change Detection

Anthropocentric Video Content Description

MPEG-7 has emerged as the standard for multimedia data content description. As it is in his early age, it tries to evolve in a direction in which semantic content description can be implemented. Although many descriptors (Ds) and description schemes (DSs) provided by the MPEG-7 standard can help to implement semantics of a media, grouping together several mpeg-7 classes can provide better results in the video production and video analysis tasks.

Our Method

We provide some classes to extend the mpeg-7 standard so it can handle, in a more uniform way, the video media data. Several classes are proposed in this context and we prove that this kind of schemes can provide more flexible tools

By those new descriptors we achieve:

An Anthropocentric Perspective for Movies.
We introduce Descriptors and Description Schemes, in order to manipulate in better way low level information, and thus provide semantic entities.
The relations between objects within a movie are very informative for high level information extraction.
Information like “This actor is in this shot and smiling” can be ingested in the proposed profile and in a post process of this information one can extract semantics for this shot.

Main Characteristics

Descriptors (Ds) and Description Schemes (DSs) which are gathering low information within their tags.
Tags which are selected to suit research areas like face detection, object tracking, motion detection, facial expression extraction etc.
The organization of the aforementioned low level information will produce objects with meaning in order to extract high level information

The Descriptors and Description Schemes

Class Name	Characterization
Movie Class	Container Class
Version Class	Container Class
Scene Class	Container Class
Shot Class	Container Class
Take Class	Container Class
Frame Class	Object Class
Sound Class	Container Class
Actor Class	Object Class
Object Appearance Class	Event Class
High Order Semantic Class	Container Class
Camera Class	Object Class
Camera Use Class	Event Class
Lens Class	Object Class

Anthropocentric View

Simple annotation can provide information about everything but it can not be generated automatically.
The annotation process is therefore subjective as in all manual annotation processes and also demands for an intensive labor.
The proposed profile aims at providing support for combining calculated low level features into higher level semantic entities.

Downloads

Relevant Publications

N. Vretos, V. Solachidis and I. Pitas, "An Anthropocentric Description Scheme For Movies Content Classification And Indexing" , in Proc. of European Signal Processing Conf. (EUSIPCO 2005) , Antalya, Turkey, 4-8 September, 2005.

N. Vretos, V. Solachidis and I. Pitas, "An MPEG-7 Based Description Scheme For Video Analysis using Anthropocentric Video Content Descriptors", in Lecture Notes in Computer Science, Advances in Informatics: 10th Panhellenic Conf. on Informatics, PCI 2005 , vol. 3746 / 2005, pp. 725 - 734, Volos, Greece, 11-13 November, 2005.

Research Projects

NM2 - “New media for a new millennium” (IST-004124), FP6S

top

Face Detection and Tracking

Achieving a good localization of faces on video frames is of high importance for an application such as video indexing. Face localization on movies is an ambiguous task due to various scale, pose and lighting conditions.

Our Method

A novel deterministic approach has been developed that it applies face detection, forward tracking and backward tracking, using some predefined rules. From all the possible extracted candidates, a Dynamic Programming algorithm selects those that minimize a cost function.

Face detection:

Use of the Haar-like features to provide first face candidates. For tracking purposes, a post-processing step is added to reduce the number of false alarms and remove some of the background. Candidates are rejected if the number of pixels fulfilling the criteria below are under a certain threshold.

0 < h < 1 and 0.23 < s < 0.68 and 0.27 < v

The remaining candidates are replaced by the smallest bounding box containing the skin-like pixels. The detection is performed every 5 frames.

Forward tracking process:

The Morphological Elastic Graph Matching (MEGM) tracking algorithm is used.
The tracking is initialized by the detection output.
While the detection is efficient for frontal faces, the tracking provides candidates for other possible poses.

Backward tracking process:

If a face is detected in the middle of a shot, no information is available about its localization in the previous video frames.
The same tracker (MEGM) is applied backwards.
The backward tracker is initialized with the labeled detections.
Detected, forward and backward tracked faces are now grouped in actor appearances (labels).

Costs

The node cost C is the distance between the center of the bounding box and the centroid of the skin-like pixels.
The transition cost combines:
- the overlap between two bounding boxes
- the ratio of the bounding boxes areas to penalize big changes in the area during tracking

Structure of the trellis

The trellis selects the path with the minimum cost. The selection is performed using dynamic programming.

Experimental results

Downloads

Relevant Publications

I. Cherif, V. Solachidis and I. Pitas, "A Tracking Framework for Accurate Face Localization", in Proc. of Int. Federation for Information Processing Conf. on Artificial Intelligence (IFIP AI 2006), Santiago, Chile, 21-24 August, 2006.

Research Projects

NM2 - “New media for a new millennium” (IST-004124), FP6

top

Face Clustering

Clustering could be considered as a form of unsupervised classification imposed over a finite set of objects. Its goal is to group sets of objects into classes, such that similar objects are placed in the same cluster, while dissimilar objects are placed in different clusters.

Human faces are some of the most important and frequently encountered entities in videos and can be considered as high-level semantic features. Face clustering in videos can be used in many applications such as video indexing and content analysis, as a pre processing step for face recognition, or even as a basic step for extracting the principal cast of a feature length movie and much more.

Our Methods

A) Clustering based on Mutual Information: The capabilities of joint entropy and mutual information are exploited in order to classify face images exported from a Haar detector. We use the intensity images and we define for every image the probability density function as the histogram of the intensities of that image summed to one. In order to calculate the joint entropy between the two images we construct a 2D histogram of 256 bins which take into account the relative positions of intensities so that similarity occurs between two images, when same intensities are located in same spatial locations.

Problem Definition

Face clustering is the task where from a set of face images A we create n subsets .
We exploit the mutual information between the face images to create a similarity matrix which afterwards will be clustered.
The mutual information is shown to be a good measure for similarity between face images where light conditions and poses are variant.
The movies' context for such an algorithm gives a new dimension to the problem where no calibrated images are used as input.
Purpose of such an algorithm: Define primordial actors, automatic (not manually annotated) database search, registration, content analysis.

Clustering Process:

The clustering process is based on the Fuzzy-C Means (FCM) algorithm.
We provide the number of classes and the similarity matrix to the algorithm.
In order to use this algorithm we define every row of the aforementioned similarity matrix as a different vector in an M-dimensional L2-normed vector space over R.

Darker regions belong to the first actor and clearer ones to the second actor. The video sequence has four consecutive shots in the order FA-FA-SA-SA where FA and SA first and second actor respectively

B) Hierarchical Clustering: An algorithm to cluster face images found in feature length movies and generally in video sequences is proposed. A novel method for creating a dissimilarity matrix using SIFT image features is introduced. This dissimilarity matrix is used as an input in a hierarchical average linkage clustering algorithm, which finally yields the clustering result.

The final result is found to be quite robust to significant scale, pose and illumination variations, encountered in facial images.

Clusters 1, 2 and 4 contained only facial images from the same person. The third cluster contained the false face detections (non-facial images) as we expected, but it also included certain instances of the actor in cluster 1, due to a significant change in the person's pose.

Downloads

Relevant Publications

Ν. Vretos, V. Solachidis and I. Pitas "A Mutual Information Based Algorithm for Face Clustering", in Proc. of Int. Conf. on Multimedia and Expo (ICME 2006) , Toronto Ontario, Canada, 9-12 July, 2006.

P. Antonopoulos, N. Nikolaidis and I. Pitas, “Hierarchical Face Clustering Using SIFT Image Features”, submitted in Proc. of IEEE Symposium on Computational Intelligence in Image and Signal Processing (CIISP 2007), Honolulu, HI , USA.

Research Projects

NM2 - “New media for a new millennium” (IST-004124), FP6

Pythagoras II - Funded by the Hellenic Ministry of Education in the framework of the program

top

Dialog Detection

Digital movie archives have become a commonplace nowadays. Research on movie content analysis has been very active. A dialogue scene can be defined as a set of consecutive shots which contain conversations of people. However, there is a possibility of having shots in a dialogue scene that do not contain any conversation or even any person.

Our Method

Our lab is activated in dialogue detection. In our work, we investigate a novel framework for dialogue detection that is based on indicator functions, that are error-free. An indicator function defines that a particular actor is present at each time instant.

Two dialogue detection rules are developed:

The first rule relies on the value of the cross-correlation function at zero time lag that is compared to a threshold.
The second rule is based on the cross-power in a particular frequency band that is also compared to a threshold.

A total of 25 dialogue scenes and another 8 non-dialogue scenes have been extracted from 6 movies: “Analyze That”, “Cold Mountain”, “Jackie Brown”, “Lord of the Rings I”, “Platoon”, and “Secret Window”. The total duration of the 33 recordings is 31 min and 7 sec. The probabilities of false alarm and detection are estimated by cross-validation, where 70% of the available scenes are used to learn the thresholds employed in the dialogue detection rules and the remaining 30% of the scenes are used for testing. An almost perfect dialogue detection is reported for every distinct threshold.

Database

Our lab has developed a dialogue database. In total, 33 recordings were extracted from the following six movies: “Analyze That”, “Cold Mountai”, “Jackie Brown”, “Lord of the Rings I”, “Platoon”, and “Secret Window”. The total duration of the 33 recordings is 31 min and 7 sec. The audio track was digitized in PCM at a sampling rate of 48 kHz and the quantized sample length was 16 bit two-channel. 25 out of the 33 recordings correspond to dialogue scenes, while the remaining 8 do not contain any dialogue. For each recording, the ground truth information, that is the actors that appear in the scene, is determined.

Downloads

-

Relevant Publications

M. Kotti, C. Kotropoulos, B. Zi´olko, I. Pitas, and V. Moschou, "A Framework for Dialogue Detection in Movies", in Int. Workshop on Multimedia Content Representation, Classification, and Security, Istanbul, Turkey, 2006.

Research Projects

MUSCLE - “Multimedia Understanding through Semantics, Computation and LEarning” (FP6-507752)

top

Shot Boundary Detection

Indexing and retrieval of digital video is a very active research area. Temporal video segmentation is an important step in many video processing applications. The growing amount of digital video footage is driving the need for more effective methods for shot classification, summarization, efficient access, retrieval, and browsing of large video databases. Shot boundary detection is the first step towards further analysis of the video content.

Our Method

Two methods for shot boundary detection have been developed.

The first approach to shot transition detection in the uncompressed image domain, we have developed, is based on the mutual information and the joint entropy between two consecutive video frames.

Mutual information (MI) is a measure of the information transported from one frame to the next.
MI is used within the context of this method for detecting abrupt cuts, where the image intensity or color changes abruptly, leading to a low mutual information value.
Joint entropy is used for detecting fades.
- Fade-out, where the visual intensity is usually decreasing to a black image, the decreasing inter-frame joint entropy is used for detection.
- Fade-in, the increasing joint entropy is used for detection.
The entropy measure produces good results, because it exploits the interframe information flow in a more compact way than a frame subtraction.


Time series of the MI from “ABC news” video sequence showing abrupt cuts and one fade		The joint entropy signal from “CNN news” video sequence showing a fade-out and fade-in to the next shot

The detection technique was tested on the TRECVID2003 video test set having different types of shots and containing significant object and camera motion inside the shots. The application of these entropy-based techniques for shot cut detection was experimentally proven to be very efficient, since they produce false acceptance rates very close to zero.

The second approach to automated shot boundary detection is using singular value decomposition (SVD). We have used SVD for its capabilities to derive a refined low dimensional feature space from a high dimensional raw feature space, where pattern similarity can easily be detected.

The method relies on performing SVD on a matrix created from 3D color histograms of single frames.
After performing SVD we preserved only the 10 largest singular values.
In order to detect the video shots, the feature vectors from SVD are processed using a dynamic clustering method.
To avoid the false detections, every two consecutive clusters, obtained by the clustering procedure are in the second phase tested for a possible merging.
Merging is performed in two steps applied consecutively.
- The fist step is using ratio cosine similarity measure between clusters.
- The second step is based on statistical hypothesis testing using the von Mises-Fisher distribution, which can be considered as the equivalent of the Gaussian distribution for directional data.


Projected frame histograms on the subspace defined by the fifth and sixth singular vectors reveal a dissolve pattern between two shots		Fade detection in the sequence “basketball” visualized on the subspace defined by the first and second left singular vectors

The method can detect cuts and gradual transitions, such as dissolves, fades and wipes. The detection technique was tested on TV video sequences having various types of shots and significant object and camera motion inside the shots. The experiments demonstrated that, by using the projected feature space we can efficiently differentiate between gradual transitions and cuts, pans, object or camera motion, while most of the methods based on histograms fail to characterize these types of video transitions.

Downloads

Relevant Publications

Z. Cernekova, I. Pitas and C. Nikou, "Information theory-based shot cut/fade detection and video summarization", IEEE Transactions on Circuits and Systems for Video Technology, vol. 16, no.1, page(s): 82- 91, January 2006.

Z.Cernekova, C.Kotropoulos and I.Pitas, "Video Shot Segmentation using Singular Value Decomposition", in Proc. of 2003 IEEE Int. Conf. Acoustics, Speech, and Signal Processing (ICASSP), vol. III, pp. 181-184, Hong-Kong, April 2003 (appears also in Proc. IEEE Multimedia and Expo 2003 (ICME), pp. 301-304, Baltimore , July 2003).

Z.Cernekova, C.Kotropoulos and I.Pitas, "Video Shot Boundary Detection using Singular Value Decomposition", in Proc. of 4th European Workshop on Image Analysis for Multimedia Interactive Services(WIAMIS-2003), London, April 2003.

Research Projects

MOUMIR - "Models for Unified Multimedia Information Retrieval", RTN, EC

MUSCLE - “Multimedia Understanding through Semantics, Computation and LEarning” (FP6-507752)

VISNET - European Network of Excellence, funded under the European Commission IST FP6 programme

COST211 - "Redundancy Reduction Techniques and Content Analysis for Multimedia Services"

top

Audio-visual Scene Change Detection

The ever-growing amount of digital information has created a critical need for the development of assisting data management algorithms. Scene change detection is employed in order to manage large volumes of audio-visual data. Typically it is a tool aiming to group audio-visual data into meaningful categories and thus provide fast browsing and retrieval capabilities.

Video shot and scene detection is essential to automatic content-based video segmentation. A video shot is a collection of video frames obtained through a continuous camera recording. Similar background and motion patterns typify the set of frames within a shot. Video shots usually lead to a far too fine segmentation in terms of the semantic audio-visual data representation. In order to acquire an effective non-linear access to video information, the data are grouped into scenes , where scenes are defined as sequences of related shots chosen according to certain semantic rules.

Our Method

A novel scene change detection method has been developed where

processes and fuses audio and video information
audio frames are projected by a set of enhanced eigenframes that ‘discovers' the variations of back-ground noise.
scene changes are found by comparison to a reference noise frame.
video information is used to align the audio-detected scene changes, reduce the false alarm rates and identify fading effects, typically used to separate scenes.

In order to integrate audio and video information

If an audio scene change indication is ‘near' a shot change then a scene-cut is set. The rest are rejected as false indications.
the valid indications are further validated by comparing various acoustic features
the qualified scene-cut is set to the location of the relevant shot change in order to mend audiovisual asynchrony.
video fade effects are set to independently indicate scene changes.

The method has been tested on the well-established TRECVID2003 database. The results are very promising as higher Recall and Precision rates have been attained than the ones recorded by all contemporary algorithms our algorithm competed against.

Example of a detected scene change

Downloads

Relevant Publications

M. Kyperountas, Z. Cernekova, C. Kotropoulos, M. Gavrielides, and I. Pitas, “Audio PCA in a novel multimedia scheme for scene change detection”, in Proc. of ICASSP 2004, Montreal, May 2004.

M. Kyperountas, Z. Cernekova, C. Kotropoulos, M. Gavrielides, and I. Pitas, “Scene change detection using audiovisual clues”, in Proc. of Norwegian Conference on Image Processing and Pattern Recognition (NOBIM 2004), Stavanger, Norway, 27-28 May 2004.

M. Kyperountas, C. Kotropoulos and I. Pitas, “Enhanced eigen-audioframes for audiovisual scene change detection”, IEEE Transactions on Multimedia, accepted in 2006.

Research Projects

MOUMIR - "Models for Unified Multimedia Information Retrieval", RTN, EC

MUSCLE - “Multimedia Understanding through Semantics, Computation and LEarning” (FP6-507752)

VISNET - European Network of Excellence, funded under the European Commission IST FP6 programme

top