A Comparative Study of Robust Feature Coefficients 
                               with Applications to Automatic Speaker Recognition


 

Presentation of feature extraction techniques

The selection of the best parameter representation of acoustic data is an important task in designing a speaker recognition
system. The usual objectives in selecting a representation of acoustic data are to compress the speech data by eliminating information not
pertinent to the phonetic analysis and to enhance those aspects of the signal that contribute significantly to the detection of the
phonetic differences.

In these experiments we use two different kinds of signal representation: cepstral and mel-frequency cepstral coefficients. The experiments
were conducted on the M2VTS database, which is made up from 37 adult male and female speakers and provides 5 shots for each
person. During each shot, speakers have been asked to count from '0' to '9' in their native language (most of them are french speaking).

input utterance
This waveform represents an example of a speech signal versus time for
the  shot 01 of speaker BS. The speech sequence is split in 10 voiced
segments of variable duration.
 

The sequence of processing in order to extract the cepstral coefficients is showed in the block diagram.
 
The sequence of processing in order to extract the mel-frequency cepstral features includes the following steps:
 



 

Performance Results and Conclusions

In order to estimate the performance of the feature extraction techniques we use the experimental data showed in the table:
 
 

Parameter
Value
Sampling Frequency
    48000 Hz
Processing Frequency
    12000 Hz
Frame Length
     30 msec
Overlap
     20 msec
Number of FFT bins
512 
Number of Cepstral Coefficients
 12 
Number of Mel-Cepstral Coefficients
12
Number of triangular filters
40
Size of Codebooks
32
 

During the experiments, two different kinds of classification error rates have been measured. The first kind is referred to the percentage of the identification
error rate in a closed-set of speakers. The table below, shows the exact percentage of  the identification error in 8 different  shot combinations.
 
 

(%) Recognition Error
Training Shots Testing Shot
Cepstral Coefficients
Mel-Cepstral Coefficients
1,2,3 4 13,513514 8,108108
1,2,3 5 5,405405 5,405405
1,3,4 2 0,000000 5,405405
1,3,4 5 5,405405 2,702703
2,3,4 1 5,405405 5,405405
2,3,4 5 8,108108 2,702703
1,2,4 3 0,000000 2,702703
1,2,4 5 8,108108 2,702703
 
 
 
 
 
 The second kind of recognition error is referred to the False Acceptance (FA) and the False Rejection (FR) Rate in an open-set of speakers based on the
Brussels protocol training and testing procedures. For both cepstral and mel-cepstral parameters the Receiver Operating Characteristics (ROC) are plotted
in figures I and II..
 

 Figure I                  Figure II
 
The observation of both performance results in the Brussels protocol case leads to the following conclusion remarks: