ESCA WORKSHOP ON AUTOMATIC SPEAKER RECOGNITION, IDENTIFICATION AND VERIFICATION
Neural Autoassociators for Phoneme-Based
L. Lastrucci, M. Gori, and G. Soda
Dipartimento di Sistemi e Informatica, Universit?a di Firenze
Via di Santa Marta 3 - 50139 Firenze - Italy
Tel. +39 (55) 479.6265 - Fax +39 (55) 479.6363
e-mail : [email protected]
Abstract| In last few years connectionist models, mainly based on multilayered perceptrons, have been used for identifying the speaker identity. To the best of our knowledge however, no significant results have been obtained for speaker verification.
In this paper, we propose a connectionist phoneme-based speaker verification model and give experimental results for assessing its performance. Neural autoassociators are suggested for capturing the speaker's identity. They are trained to reproduce speech frames presented at the input to the output layer. Adequate threshold criteria are proposed for performing rejection. Verification performances were evaluated on DARPA-TIMIT database for /ae/ and /aa/ phonemes in continuous speech, using different thresholds and preprocessing schemes with very promising results.
There is no much knowledge in literature about which aspects of a speech signal identify a speaker. However, machines can exceed humans on short test utterances and large number of speakers. Commonly, the machine's accuracy decreases when mimics act as impostors while humans appear to handle mimics better than machines do, easily perceiving when a voice is being mimicked. Certain voices are more easily mimicked than others, which lends further evidence to the theory that different acoustic cues are used to distinguish different voices. For speaker recognition, the acoustic aspects of what characterizes the differences between voices are obscure and difficult to separate from signal aspects that reflect segment recognition. Unlike speech recognition, where decisions are made for every phone or word, speaker recognition requires only one decision, based on parts or all of a test utterance, and there is no simple set of acoustic cues that reliably distinguishes speakers. Furthermore even if only one decision is necessary, the set of choices can vary widely. Speaker recognizers typically utilize long-term statistics averaged over whole utterances or exploit analyses of specific sounds.
In this paper, we use connectionist models for speaker's authentication simply using phonemes. First attempts of using connectionist models in this field have been oriented to automatic speaker identification (ASI)  . Multilayered perceptrons turns out to be adequate architectures for dealing with such task. Basically, they have to perform class discrimination by exploiting large speech data bases collecting utterances of different speakers. There is either theoretical and practical evidence that multilayer percep-
trons perform very well in pure discrimination problems. In these networks, the class discrimination is carried out by suitable separation surfaces in the pattern space that are drawn by the learning algorithm under the only requirement of discriminating the given examples. It turns out that the resulting separation surfaces are not closed and do not envolope" the examples by capturing their probability distribution. As a result, the multi-layered networks (MLNs) are not adequate for speaker verification where the speaker's identity must be checked without knowing all the speakers in advance. It seems very difficult to implement a reliable rejection criterion using these networks. Rejection criteria based on the error with respect to the target are not very meaningful since, because of the open separation surfaces, cases can be found where the error is very low whereas the associated speaker utterance may support clear evidence of a false" speaker.
In order to overcome this problem, in this paper we suggest to use MLNs as autoassociators. When using this learning mode, the input is forced to reproduce the outputs, and rejection criteria can be based on the way the input is approximated by the the output. As pointed out in the following section, the autoassociators are very wellsuited for performing rejection of never seen patterns and, therefore, they are adequate for for speaker verification.
We give experimental evidence of these claims on DARPA-TIMIT speech database for /ae/ and /aa/ phonemes in continuous speech. We experimented the use of different thresholds and preprocessing schemes with very promising results.
2. Multilayered Autoassociators for Speaker Verification
2.1 The Multilayered Autoassociator
Multilayered networks working as autoassociators are forced to reproduce the input to the output during the training phase. The autoassociators have been suggested mainly for problem of image compression. In this case the compression is performed at the hidden layer. The information represented by the hidden units can be reproduced subsequently by the computation carried out at the last layer. In this paper we exploit another nice theoretical property of autoassociators that can easily be understood by analyzing the separation surfaces that are drawn