Speech Segmentation

Speech segmentation is the determiation of the begninning and ending boundaries of acoustic units. It is an important subproblem of speech recognition and often overlaps with other domain problems such voice activity detection (VAD), end point detection, text segmentation etc. Generally, segmentation is devided to two levels:

  • Lexical segmentation: decomposition of spoken language into smaller lexical segments (e.g. paragraphs, sentences, phrases, words, syllables)
  • Phonemic segmentation: breaking and classification of the sound signal at the lowest level to a string of acoustic elements (phones) which represent a distinct target configurations of the speech tract (regarding articulation as well as form of excitation)

Research in AIIA is done regarding context-free automatic segmentation. By discounting linguistic contraints (context, grammar, semantic information) it is possible develop or assist multilingual applications, concatenative speech synthesis systems, speaker segmentation, speech trascription in computer-aided systems etc.

Conventional speech detection and segmentation systems that follow energy-based approaches work relatively well only in high signal to noise ratios (SNR) and for known stationary noise, so they are inefficient in real-world recordings where speakers tend to leave artefacts such as breathing/sighing, mouth clicks, teeth chatters, and echoes. Statistical methods are more efficient but there issues when working with small sample sizes.

Search Engines
  Energy methods Statistical methods
Implementation Rely on energy thresholds and heuristics to identify acoustic changes Speech and noise statistics. Use a decision rule.
Requirements Fast, computationally efficient. Online - real time processing. Slow, computationally intensive. Offline applications.
Performance Misclassify non-stationary noise as speech activity, can not identify unvoiced speech segments like fricatives satisfactorily Good precision even in low SNRs, robust in non-stationary noise.

 

Our Method

A novel automatic acoustic change detection algorithm based on Bayesian statistics has been developed in AIIA. Common statistical methodology in speech segmentation, embraces binary decision-making strategies. Under certain assumptions, t is a change point if the window Z=X+Y is better modelled with a single distribtution instead of two separate for its sub-windows X and Y.

Hypothesis Testing

We model speech samples with a two-sided generasized Gamma distribution (GΓD). Using a computationally inexpensive maximum likelihood (ML) approach, we employ the Bayesian Information Criterion (BIC) for identifying the phoneme boundaries in noisy speech.

Generalised Gamma Distribution

The method is based on DISTBIC (Delacourt & Wellekens, 2000). This is an offline multi-pass algorithm that first uses a distance measure (e.g. Kullback Leibler distance KL, Generalised Likelihood Ratio GLR, etc.) to identify candidate change points, then employs BIC between adjacent windows dynamically determined by the candidate points of the first step.

DISTBIC process

Instead of Gaussian distribution (GD), we model adjacent signal segments with GΓD which offers better representation power for both speech and silence/noise. Also we explore different criteria such as BICC and ABF2 which are more efficient for small sample sizes.

 

Early experiments in M2VTS and TIMIT speech corpora give evidence that using ABF2 instead of BIC and modelling the signal with GΓD instead of GD yields better segmentation accuracy.

Results
Overall system evaluation using the F1 measure for the M2VTS dataset Overall system evaluation using the F1 measure for the TIMIT dataset
M2VTS
TIMIT

 

Relevant Publications

G. Almpanidis and C. Kotropoulos, "Voice activity detection using the generalized Gamma distribution", in Proc. 4th Panhellenic Artificial Intelligence Conf. (SETN-06), Heraklion, Greece, May 19-20, 2006.

G. Almpanidis and C. Kotropoulos, "Voice activity detection with generalized Gamma distribution", in Proc. 2006 IEEE Int. Conf. on Multimedia and Expo (ICME 2006), Toronto Ontario, Canada, 9-12 July, 2006.

G. Almpanidis and C. Kotropoulos, "Phoneme segment boundary detection based on the generalized Gamma distribution", in Proc. 2006 Int. Symposium on Industrial Electronics (ISIE 2006), Montreal, Canada, 9-13 July, 2006.

Research Projects

Herakleitos - (Operational programme for Education and Initial Vocational Training - 3rd Community Support Framework): Processing of Multimedia Signals

 

© 2006