Research on biological voice print recognition based on convolution-delay neural network and multi-head attentional statistical pool

Hongbing Zhang1, Maolin Ma1*

  1. School of Public Security Information Technology and Intelligence, Criminal Investigation Police University of China, Shenyang, Liaoning, 110854, China

Abstract: Currently, neural network-based biometric voiceprint recognition technology has gradually matured, yet it still struggles to extract deep-level features that fully characterize speaker identity. Moreover, mainstream biometric voiceprint recognition models underperform in complex scenarios such as noisy environments and short-duration speech. To address these issues, this study proposes a text-independent biometric voiceprint recognition system based on a Convolutional-Time Delay Deep Neural Network (CNN-TDNN) and investigates its robustness in challenging acoustic environments, including additive background noise and short-duration speech scenarios. Experimental results on Chinese speech corpus test data demonstrate that the system achieves an accuracy of 95.382%, an equal error rate (EER) of 0.86, and a minimum detection cost function (minDCF) value of 0.0686, validating the effectiveness of the proposed algorithm.

Keywords: biological voice print recognition; background noise; multi-head attention mechanism; convolutional-delay neural network

  1. Introduction

With the rapid development of artificial intelligence technology, traditional identity verification methods such as character passwords and ID cards, which rely on security measures like privacy, non-uniqueness, and convenience, will eventually be phased out due to their flaws. Therefore, research into new identity security verification methods has become extremely necessary [1]. As related studies have progressed, new identity authentication methods in the field of biometric information recognition, such as fingerprint recognition, iris recognition, vein recognition, and facial recognition, have gradually emerged. These tools have been widely applied in areas like national defense security, financial insurance, and public protection, receiving widespread praise from all sectors [2]. However, due to the high costs of hardware equipment and the need for close proximity during use, these biometric technologies are not accessible to some users who are constrained by specific scenarios [3].

In this context, research into more convenient and secure identity verification methods has become extremely necessary. In daily life, the most direct form of communication is voice interaction. Voice not only conveys semantic information but also carries characteristic information of each speaker. These identity details vary from person to person and are unique biological traits that people are born with. The human vocal system, consisting of the lungs, vocal cords, and vocal tract, produces distinct acoustic patterns that are influenced by both physiological factors (e.g., vocal tract length, nasal cavity shape) and behavioral characteristics (e.g., speaking rate, intonation). This biological uniqueness forms the foundation of voiceprint recognition technology. The biometric recognition technology based on these characteristics is called voiceprint recognition.

Biometric voiceprint recognition (Voiceprint Recognition), also known as speaker identification, is a technology that uses the most primitive human speech information for identity authentication. The theory of biometric voiceprint recognition is based on the fact that people’s speech information includes both semantic and acoustic features. Unlike fingerprints, which have visible and tangible characteristics, voiceprints lack direct and true textures; they are more like a feature derived from the concept of fingerprints. From a biological perspective, the uniqueness of voiceprints stems from the complex interplay between anatomical structures (e.g., vocal fold thickness, mouth cavity volume) and learned speaking habits, making each individual’s voice as distinctive as their DNA.

According to the different roles of biometric voiceprint recognition in practical application scenarios, it can be divided into two tasks: speaker identification (Speaker-Vertification, SV) and speaker recognition (Speaker-Identification, SI) [5], where the former is to determine whether a given voice matches a specified speaker, and the latter is to identify which registered voice in the system best matches the given voice. The biological basis for this distinction lies in the fact that SV focuses on verifying consistent vocal traits over time, while SI must discriminate between subtle physiological differences among multiple speakers.

According to whether the speaker’s voice during training and testing has the same text, biometric voice recognition can be divided into text-independent (Text-Independent) and text-dependent (Text-Dependent) [6]. Theoretically, text-independent does not focus on the content of what the speaker says but rather emphasizes identifying the speaker’s identity in any situation. In contrast, text-dependent relies on fixed texts, requiring that the speech information entered by the speaker and the speech information to be verified have identical textual or even phonetic information. From a biological standpoint, text-independent systems must extract more fundamental vocal characteristics that remain stable across different utterances, such as formant frequencies and glottal pulse shapes, whereas text-dependent systems can leverage consistent articulatory patterns associated with specific phrases.

In the application of biometric voiceprint recognition, whether during speaker voice input or while waiting for speaker verification, various complex acoustic scenarios may be encountered. This exposes numerous significant research issues, such as speaker voice information being mixed with substantial background noise, the speech segments provided by speakers being relatively short, and channel mismatches caused by different recording devices for training and testing voices. These problems brought about by complex acoustic scenarios pose significant obstacles to the research of biometric voiceprint recognition, making it difficult to adapt to a wide range of real-life scenarios[7]. Biological factors compound these challenges, as vocal characteristics may vary with health conditions (e.g., colds), emotional states, or aging, requiring robust algorithms that can distinguish between permanent physiological traits and temporary variations.

Regarding the robustness of systems under background noise conditions, previous research has mainly focused on speech enhancement techniques and how to extract robust acoustic features. The former primarily includes spectral subtraction (Spectral Subtraction, SS), minimum mean square error (Minimum Mean-Square Error, MMSE), and Wiener filtering (Wiener Filtering). For the latter, Satya Dharannipragada et al. proposed a minimum variance distortionless effect parameter [8], which can better form the envelope of short-term vowel spectra, thereby enhancing the robustness of biometric voice recognition. Gao et al. constructed biometric voice recognition features based on normalized linear prediction power spectra, with the core being the logarithmic subband energy obtained from the spectral envelope of speech signals through a gamma filter bank. Simulation results show that compared to traditional voice features, it has a certain degree of robustness to noisy environments [9]. In 2021, Chen et al. started from speaker feature space denoising and used the partial least squares algorithm to directly derive the mapping relationship between noisy and clean signals, proposing a spatially noise-adaptive robust biometric voice recognition algorithm [10]. Experiments have shown that this algorithm has good compensation effects for various signal-to-noise ratios and types of noise. These approaches must account for biological realities, such as the fact that certain vocal frequencies are more susceptible to masking by noise than others due to the natural energy distribution of human speech.

In recent years, the academic community has successively launched research on short speech problems. D Snyder used a time-delayed TDNN to construct x-vector [11], capturing long-term speaker features through temporal pooling layers aggregated over input speech in the TDNN network. F A Rezaur Rahman Chowdhury and Quan Wang, among others, considered that attention-based models can generalize relevant information across the entire length of input sequences. They analyzed the use of attention mechanisms in end-to-end text-related speaker recognition systems for sequence summarization tasks and explored different topologies and variants of attention layers, comparing various methods of aggregating attention weights [12]. Heinrich Dinkel studied the training of deep convolutional and long short-term memory (LSTM) primitive wavefronts, analyzed their applicability to speech of varying lengths, and considered the impact of frame size, number of output neurons, and sequence length. A Joint Convolutional LSTM Neural Network (CLDNN) was proposed, and experiments demonstrated that its performance outperformed previous attempts on the BTAS2016 dataset [13]. These advancements are particularly important from a biological perspective, as they must overcome the natural variability in human speech production, where even the same speaker may exhibit different acoustic characteristics when uttering short phrases versus extended speech.

From the above overview, it can be seen that biometric voiceprint recognition technology based on neural network models is gradually maturing. However, it still fails to deeply extract deep information that characterizes speaker features. Moreover, current mainstream biometric voiceprint recognition models perform poorly in complex scenarios such as noise and short speech segments. To address these issues, the main research work of this paper is to design a text-independent biometric voiceprint recognition system based on convolutional-deep neural networks with delay. Additionally, the recognition performance of this biometric voiceprint recognition system in complex acoustic environments (including additive background noise and short speech segments) is studied. The biological underpinnings of this work involve developing models that can isolate the invariant physiological features of human vocal production while remaining adaptable to the dynamic nature of real-world speech.

  1. Related work

2.1 Basic characteristics of speech signal

In the process of short-term steady-state analysis of speech signal, it can be known that speech signal has two basic attributes. The first part is the local attribute expressed by the structural characteristics of speech signal itself; the second part is determined by the mechanism of the speaker’s vocal organs, which is manifested by the temporal correlation attribute of speech signal.

Specifically, due to the physiological structure characteristics of the speaker and their language traits, speech signals continuously exhibit identical pronunciation patterns across the entire time-frequency space. These pronunciation patterns form a speech spectrum through combinations and superpositions. For example, in a speech spectrum, the same phonemes and syllables are repeatedly present; during the articulation process, similar resonant peak shapes also recur. Essentially, these characteristics stem from the speaker’s vocal organs consistently displaying specific pronunciation patterns on certain syllables, with variations among different speakers on the same syllable. As shown in Figures 1(a) and (b), the local attributes exhibited by two speech signals from different individuals in the time-frequency domain of the speech spectrum are illustrated: similar pronunciation patterns appear in different segments of the same speech; certain segments of different speeches share similar pronunciation patterns.

  • (b)

Figure 1 Local properties of speech signals in the time-frequency domain

For the temporal correlation properties of speech signals, it refers to how the pronunciation pattern at the current moment may be influenced by the pronunciation patterns from multiple previous and subsequent moments. This property essentially arises because the speaker’s vocal organs cannot undergo significant distortion in a short period; thus, the pronunciation process is a gradual one, meaning that the pronunciation pattern is always affected to varying degrees by context. As shown in Figure 2, different characteristics of the same pronunciation pattern are exhibited due to different contextual states.

2.2 Representation of speaker information in speech signal

Speech signals not only contain textual information but also retain the characteristic information of the speaker. The two fundamental attributes of speech are essentially determined by the mechanisms of the speaker’s vocal organs; therefore, the characteristic information of different speakers is also reflected in speech signals. For local attributes, the pronunciation patterns for a specific phoneme or syllable are basically fixed for each speaker. Thus, even if two speakers have very similar pronunciation patterns, their distributions in time-frequency space will inevitably differ. Therefore, the pronunciation pattern itself contains a wealth of speaker-specific information.

Figure 2 Time correlation properties of speech signals

For temporal correlation, due to the diversity of speech organs and the variability of environments among different speakers, their speaking habits, especially in the transition between different phonemes, can vary. This variability reflects a wealth of speaker-specific information. Clearly, this transitional information is long-lasting and repetitive, requiring context to manifest, thus possessing a certain degree of temporal correlation.

  1. Structure of biological voice print recognition model

By analyzing the basic characteristics of speech signals and the representation forms of speaker feature information in these signals, it is known that the structural information of speech signals can represent part of the speaker’s characteristic information. Therefore, when designing a voiceprint feature learning model, this prior knowledge can be fully utilized. In model design, both fundamental attributes of speech signals and prior knowledge related to speaker information should be taken into account, aiming to design a deep neural network that can describe structural characteristics and represent temporal correlations. Thus, it is advisable to construct a voiceprint feature learning model that combines convolutional and recurrent deep neural networks. The model can then be trained using loss functions and relevant optimization algorithms to achieve biometric voiceprint recognition for speakers.

Based on the analysis of the time-domain and frequency-domain characteristics of speech signals, this paper proposes a deep neural network that combines convolutional neural networks and delay neural networks to deeply extract features from the original speech audio sequences of speakers, obtaining speaker-specific voiceprint embedding codes and completing biometric speaker recognition. The overall framework of the biometric speaker recognition system is shown in Figure 3. First, for the speaker’s voice, preprocessing is performed according to Section 2.1, and the logarithmic filter bank (log Fbank) features of the speaker’s audio feature sequence are extracted using the audio spectrum feature extraction model described in Section 2.2, serving as input to the deep neural network model. Then, the frame-level spectral features are deeply extracted through the deep neural network, and attentional statistical pooling is used to obtain frame-by-frame speaker voiceprint features. Finally, scale normalization is applied to generate a voiceprint embedding code that represents the speaker’s feature information, with the dimensionality of the voiceprint embedding code vector being 192 dimensions. For the speaker recognition task, which is essentially a multi-class classification problem, the AAMLoss loss function is adopted during model training.

The deep neural network model in this paper uses log fbank features, which are obtained by processing the initial audio sequence through frame division (25ms frame length, 10ms frame shift), windowing function (Hanning window), and pre-emphasis techniques. For each frame of the audio, its dimension is an 80-dimensional log fbank feature. Therefore, for an audio sequence containing T frames, it can be composed into an 80×T two-dimensional feature spectrogram.

Figure 3 Overall structure of the biological voice print recognition system

3.1 convolutional neural network

Due to the structural characteristics of speech audio sequences (the phonemes and syllables in an audio sequence follow certain rules of combination and arrangement), and considering the local properties exhibited by speech signals in both the time domain and frequency domain as shown in Figure 1, it is evident that speech sequences, like structured data such as images, may exhibit the same pronunciation patterns in different parts. This characteristic is referred to as the structural property of speech signals.

In response to the structural characteristics exhibited by speech signals, Convolutional Neural Networks (Convolutional neural network, CNN) leverage their convolutional layers to maintain the original shape of data and preserve the inherent structural information within speech signals. By utilizing its advantages of local connections and parameter sharing, CNN can learn corresponding local features based on the existing pronunciation patterns in speech signals.

The core function of CNN is to perform convolution operations on the feature map of speech signals. As shown in Figure 4, it illustrates the convolution operation of a convolutional neural network. The specific convolution operation involves the convolution layer using a convolution kernel (essentially a filter matrix) to map the original data at a certain position within a local area of the feature map into a new feature space. Since spatial relationships in speech signals are localized, the local feature perception mechanism of CNN perfectly aligns with the local properties of speech signals. Additionally, increasing the convolution kernel increases the number of feature planes, allowing different feature planes to learn different features from various angles. These feature combinations will more comprehensively cover the entire feature map.

Figure 4 Schematic diagram of convolution operation of convolutional neural network

3.2 time delay neural network

From the temporal correlation of speech signals, it is known that although each speaker’s phonetic mechanism is relatively fixed, the pronunciation pattern at a given moment may be influenced by the pronunciation patterns from multiple previous and subsequent moments. This means that the speaker’s pronunciation pattern evolves gradually due to changes in the context environment. Therefore, this temporal correlation contains rich speaker feature information, which permeates the entire audio sequence and is considered to have a certain degree of delay. Additionally, stacking first-order differences and second-order differences [14-15] on top of the original acoustic features helps improve model recognition performance. Thus, to effectively learn the delay characteristics of speaker feature information, a delay neural network (Time-delay neural network, TDNN) with good descriptive ability for temporal correlation is chosen. The TDNN network structure used in this paper is the ECAPA-TDNN (Emphasized Channel Attention, Propagation and Aggregation in time delay neural network) model with an aggregated channel attention mechanism, denoted as [16]. As shown in Figure 5, the basic structure of ECAPA-TDNN is illustrated. Here, Conv1d performs one-dimensional convolution on the feature map, followed by ReLU activation and normalization, while the SE-Res2Block model is shown in Figure 7.

Figure 5 Basic structure of ECAPA-TDNN

In TDNN, the one-dimensional convolution for describing the temporal correlation characteristics of speech signals is illustrated in Figure 6. Each set of feature maps represents a frame sequence, with each feature map indicating the number of features in that frame sequence. Therefore, feature maps are not two-dimensional but represent a value. The first-layer feature maps are obtained through one-dimensional convolution with a kernel size of 5, corresponding to the closed interval [t-2, t+2] (representing frames t-2 to t+2). This indicates that the t-th frame sequence of the first layer focuses on the corresponding 5 frames before and after the input sequence, denoted as {t-2, t-1, t, t+1, t+2}. The second-layer feature maps are obtained through one-dimensional dilated convolution with a kernel size of 3. The t-th frame sequence of the second layer focuses on the corresponding 3 frames before and after the first-layer feature map {t-2, t, t+2}, which increases the receptive field size of the output nodes through one-dimensional dilated convolution.

Figure 6 Schematic diagram of one-dimensional convolution principle of TDNN

In ECAPA-TDNN, to focus on the relationships between channels in each audio frame sequence, an SE-Res2Block module that emphasizes channel attention mechanisms, propagation, and aggregation is introduced, as shown in Figure 7 for its basic structure. This module combines Res2Net[9] with SE-Net[17], where Res2Net divides the output feature maps into multiple small-scale feature maps according to their channel directions (referring to the small feature maps after division, using the parameter scale to define the scale, reflecting the multi-scale features of Res2Net). The first small-scale feature map reuses features from the previous layer, while starting from the second small-scale feature map, all undergo 3×3 convolution, and the convolution results of the current small-scale feature map are connected with the residual of the subsequent small-scale feature map. Finally, all convolution results are concatenated along the feature channel direction.

Figure 7 Basic structure of SE-Res2Block

In the x-vector system, the time context is 15 frames. Considering the temporal correlation of speech, it may be beneficial to re-scale frame-level features at the feature channel dimension to expand the context. For this purpose, a one-dimensional SE-Block is introduced in ECAPA-TDNN. The structure of SE-Net is equivalent to weighting the feature channels of the feature map, as each feature channel has different importance. Therefore, the neural network learns the weights of each feature channel on its own, making it a Attention mechanism. As shown in Figure 8, which illustrates the basic principle of the SE module, the first component is the Squeeze part (compression part), and the second part is the Excitation part (excitation part).

(1) For the compression part, the algorithm is as follows:

Step1, the dimension of the original feature map is H*W*C, where H is height (Height), W is width (width), and C is channel number (channel);

Step2, compress H*W*C into 1*1*C, which is equivalent to compressing H*W into one dimension;

Step3, compress the H*W channel into a real number, which obtains the global view of the previous H*W. The whole compression part is shown in the formula.

(1)

Figure 8 Compression and excitation parts of SE module

(2) The incentive part

The first step is to add a fully connected layer after obtaining the 1*1*C representation of Squeeze, to predict the importance of each channel. The dimension of W1 is C/r*C, where r is a scaling parameter aimed at reducing the number of channels. This is followed by a ReLU layer, transforming the 1*1*C feature into 1*1*C/r. Then, it is multiplied by W2, which has a dimension of C*C/r, resulting in an output dimension of 1*1*C. Finally, the sigmoid function is applied to obtain the channel importance s;

(2)

The second step is to perform the reweighting (Reweight) operation: after obtaining the importance of different channels, the importance is stimulated to the corresponding channel of the previous feature map, as shown in the formula. The channel importance is, the channel of the feature map is, and the feature map after stimulation is.

(3)

As shown in Figure 9, the basic structure of SE-Block in ECAPA-TDNN is shown. The compressed part is the input feature [N, C, L], where N is batchsize, L is the number of feature frames, and C is the number of channels. Then, through the global average pooling layer, the feature is compressed into [N, C, 1].

The incentive part is to use the descriptor in formula (1) to define subsequent incentive operations, as shown in the formula.

(4)

(5)

Where σ is the sigmoid function, f represents nonlinearity, W1 is an R*C vector, W2 is a C*R vector, and C is the number of channels in the feature map. The resulting vector s contains channel weights, which are then applied to the input feature map as shown in the formula.

(6)

Figure 9 SE-Block module of ECAPA-TDNN

3.3 Statistical pooling method based on multi-head attention mechanism

Traditional speaker embedding codes at the discourse level are typically obtained by averaging all frame-level features over a single audio sequence. This method overlooks the differences in speaker information between frames within the audio sequence. Therefore, the ECAPA-TDNN model structure adopts a statistical pooling method based on self-attention mechanisms (Attentive statistics poolings, ASP) [18]. This method combines statistical pooling with attention mechanisms. The ASP method uses the attention mechanism to assign different weights to different frames and generates weighted means and weighted standard deviations based on all frames. Using this approach allows the model to more effectively capture long-term changes in speaker characteristics. The following is a description of the process for proposing a multi-head attention mechanism-based statistical pooling method.

(1) For general statistical pooling methods (statisticspoolings), the pooling layer computes the mean vector of all frame feature vectors along the audio sequence dimension. Additionally, since its second-order statistics (i.e., the standard deviation vector) contain features related to temporal correlations in the speaker’s audio sequence context, the pooling layer also computes its standard deviation vector, as shown in the formula.

(7)

Among them, represents the feature of the t-th frame, and represents the multiplication of corresponding items of two vectors.

(2) In most cases, in a given audio sequence, certain frames have more unique and important features for speaker discrimination compared to other frames. Therefore, attention mechanisms have been applied in speaker recognition to automatically calculate the importance of each frame and select those that retain more speaker feature information. By embedding the attention mechanism into the original model and calculating a scalar score et for each frame-level feature, as shown in the formula.

(8)

Among them, it is a nonlinear function, such as tanh function or ReLU function.

Then, the scalar scores of all frames are normalized by softmax function, and the normalized scores are used as weights in the pooling layer to calculate the weighted mean vector, as shown in the formula.

(9)

Therefore, the speech-level features obtained from the weighted mean vector will pay more attention to the relatively important frame-level features, and the voice print embedding code obtained from the speech-level features will be more powerful to distinguish the speaker.

(3) The statistical pooling method combined with attention mechanisms considers that higher-order statistics (standard deviation as discourse-level features) are also highly effective for speaker recognition. Therefore, the attention mechanism is integrated with the statistical pooling method, using mean vectors and standard deviation vectors as discourse-level features. This method is called the attentional statistical pooling method. The weighted mean vector is shown in the formula, and the weighted standard deviation vector is shown in the formula.

(10)

The weighted standard deviation is believed to take advantage of the advantages of statistical pooling and attention, that is, feature representation based on the long-term variability of the audio sequence and the difference in the importance of different frames, thus bringing higher speaker discrimination for speech-level features.

(4) In ECAPA-TDNN, ASP is improved. First, the outputs of the previous three SE-Res2Block are concatenated along the feature dimensions. Then, a CRB (conv1d + ReLU + BN) structure fixes the output dimensions. Following this, the ASP method from (3) is applied to calculate the mean vector and standard deviation vector for each channel dimension of the feature map along the time direction of the audio sequence. The improvement of ASP in ECAPA-TDNN lies in stacking the mean vector and standard deviation vector along the time direction separately, and concatenating the feature map with the stacked mean vector and stacked standard deviation vector in the feature channel direction to form a new feature map H. Finally, the attention scores for each channel of the new feature map H are calculated using the ASP formula, resulting in the improved discourse-level feature vector of ECAPA-TDNN.

(5) In ECAPA-TDNN, the self-attention mechanism used in attentional pooling methods calculates the correlation between the current frame vector and other frame vectors. However, this self-attention mechanism tends to over-concentrate on the position of the current frame vector. The paper [19] points out that when given different query vectors and key-value pairs, the model can learn different correlations based on the same attention mechanism. Therefore, the model should allow the attention mechanism to combine queries, keys, and values from different subspaces to enhance its expressive power. This scholar proposes using a multi-head self-attention mechanism (Multi-head self-attention).

The following two “heads” are used to illustrate the principle of the multi-head attention mechanism adopted in this paper.

For a vector at a position i in the feature dimension direction of a frame, its query vector, key and value are shown in equations (11), (12) and (13).

(11)

(12)

(13)

After obtaining the vector at position i, two linear mapping matrices are introduced to calculate the corresponding multi-head vector, as shown in Equation. Similarly, the multi-head mapping of vectors can be obtained.

(14)

(15)

After obtaining the q, k, and v vectors from multiple heads, the attention scores for positions i and j belonging to the same “head” are calculated separately. Then, the v vectors belonging to the same “head” are weighted and summed together to produce the output of the current “head.” Performing this operation on each “head” yields the multi-head output as shown in the formula.

(16)

(17)

For this purpose, this paper designs a method in ECAPA-TDNN that can independently learn multiple sets of different linear mappings to obtain distinct query vectors, key, and value sets, rather than using a single attention pool. Then, these different queries, keys, and values will be pooled in parallel, mapping the current frame vector and other frame vectors into different linear spaces. Finally, all the outputs from these attention pools are concatenated and transformed into final discourse-level features through linear mappings (such as fully connected layers). This design is known as the multi-head attention mechanism, where each output from an attention pool is referred to as a “head.” This paper improves the attention statistical pooling method in ECAPA-TDNN by using a multi-head attention statistical pooling method with some linear mapping.

3.4 Convolution-delay deep neural network

Based on the structural characteristics and temporal correlation of speech signals, this paper designs a biometric voiceprint recognition model using a convolutional-deep neural network (Convolutional Time-Delay Deep Neural Network, CT-DNN). The entire model mainly consists of two parts: the convolutional module and the delay module. Since the convolutional module focuses on local feature learning from input feature maps without affecting the temporal sequence between frames in the audio sequence, it connects the convolutional module and the delay module through a bottleneck layer.

The convolutional module is used to learn structural characteristics related to speaker feature information in speech signals. Therefore, CNNs are employed for feature learning by stacking frames. Specifically, the CNN consists of two convolutional layers, each followed by a downsampling layer. The delay module is designed to learn temporal correlations exhibited by speaker feature information across the entire audio sequence. Thus, an ECAPA-TDNN-based structure is adopted for designing the delay neural network encoder. Additionally, the statistical pooling method of ECAPA-TDNN is improved to a new type of statistical pooling based on multi-head attention mechanisms. This CT-DNN model can effectively extract speaker feature information from speech signals.

First, to enhance the model’s learning capabilities, data augmentation strategies are applied to the spectrogram, including temporal distortion (shifting the spectrogram left and right along the time axis), frequency masking (randomly removing consecutive rows of the spectrogram on the frequency axis, also known as frequency masks), and temporal masking (randomly removing consecutive columns of the spectrogram on the time axis, also known as temporal masks). Second, during the training process, the AAMLoss function is used to maximize the differentiation between different speakers, and the model is continuously optimized using the stochastic gradient descent (Stochastic Gradient Descent, SGD) algorithm. The feature maps obtained through multi-head attention pooling methods are normalized to produce high-dimensional speaker embedding codes. Finally, after passing through a fully connected layer and normalization, a fixed-dimensional (with 192 dimensions) biometric speaker recognition embedding code is generated. Lastly, once the model training is complete, the model’s biometric speaker recognition accuracy can be evaluated using the speakers in the test set.

  1. Experimental design

4.1 Experimental data and its preprocessing

In this experiment, zhvoice Chinese speech corpus data set and VoxCeleb1 English data set made by Oxford University based on YouTube are selected.

(1)zhvoice data set

The zhvoice dataset consists of 8 open-source corpus data sets (see Table 1), obtained after noise reduction and silence removal. The experimental data has a sampling rate of 16,000 Hz and includes 3,242 speakers, with approximately 1.13 million speech samples. Compared to the original data, the Zhvoice dataset, free from noise interference, boasts much clearer audio quality and reduces unnaturalness caused by disorganized speech sequences.

(2) To further validate the effectiveness of the proposed model, a foreign language dataset VoxCeleb1 based on real-world scenarios was used (see Table 2), with all audio data sourced from the YouTube website. It includes 1,251 speakers, approximately 150,000 audio clips, covering various ethnicities, accents, and ages, with a balanced gender ratio (55% male, 45% female).

Table 1 zhvoice Classification of biometric voice print recognition tasks in the data set

Number of speakers Number of audio files
training set 3242 1124566+
test set 3242 5434
amount to 6484 1130000+

Table 2 VoxCeleb1 Task division of biometric voice print recognition in the data set

Number of speakers Number of audio files
training set 3242 1124566+
test set 3242 5434
amount to 6484 1130000+

The preprocessing of the aforementioned dataset primarily involves format conversion and generating a corresponding data list. Since the data set in MP3 format is very slow to read, it needs to be converted entirely into WAV format. Then, for easier model access, a data list must be created with the format <file path\tcategory label>, where the category label refers to the speaker’s identity tag. (Note: The training set and test set use different speakers).

4.2 System and parameter configuration

In this experiment, the Gaussian mixture model i-vector system based on statistical learning and the ECAPA-TDNN model based on deep learning are selected as the baseline of biological voice print recognition algorithm. The parameter configuration of the three systems is shown in (1)-(3).

(1) For the i-vector system, the input spectrum feature dimension of the model is 60-dimensional, which includes 19-dimensional Mel frequency cepstral coefficients and the logarithmic energy of this frame, as well as the first and second differences of these 20 features calculated from adjacent frames. The number of Gaussian components in the general background model is set to 512, so the corresponding Gaussian mean vector has 512*60 dimensions. The T matrix maps the high-dimensional Gaussian mean vector to a low-dimensional i-vector space. Here, the dimension of i-vector is 400-dimensional, and after LDA (Linear Discriminant Analysis) dimensionality reduction, the low-dimensional space is 150-dimensional, followed by PLDA (Probabilistic Linear Discriminant Analysis) for scoring processing.

(2) For the ECAPA-TDNN model, the input feature map size is 80*T, meaning each frame’s feature vector has a dimension of 80. The total number of frames in the sequence is T, and the output speaker embedding code has a dimension of 192. For each layer of the model, the settings for the input feature map dimensions, number of convolutional kernels, kernel size, stride, and the control parameters for the average frame feature dimensions (dividing each frame’s features equally according to their dimensions) are shown in Table 3.

Table 3 Parameter Settings of ECAPA-TDNN model

Number of convolution kernels Size of convolution kernel step padding dilated convolution controlling parameter
Conv1dReLUBN 512 5 / 2 1 /
SE-Res2Block 512 3 1 2 2 8
SE-Res2Block 512 3 1 3 3 8
SE-Res2Block 512 3 1 4 4 8

(3) For the ECAPA-CTDNN system proposed in this paper, a bottleneck layer connects the convolutional module and the temporal delay module. The acoustic features input to it are 80-dimensional fbanks features, with 98 frames selected from the time series to form an 80*98 feature map. The convolution layer of the convolutional module uses one-dimensional dilated convolution, and its parameter configuration is shown in Table 4. Some hyperparameter settings used by the model are listed in Table 5.

Table 4 Parameter Settings of convolution module

Number of convolution kernels Size of convolution kernel step padding dilated convolution
Conv1 layer 128 4 1 2 2
Conv2 layer 256 2 1 2 3
Conv3 layer 512 2 1 3 4
bottleneck layer 80 1 1 1 /

This paper adopts the CosinAnnealingLR (Cosine Annealing Adjusted Learning Rate) method to control the decay of the learning rate. The cosine annealing method adjusts the learning rate by using the cosine function, which first decreases slowly as x increases, then accelerates the decrease, and finally slows down again. This decaying pattern precisely aligns with how the learning rate changes when the loss value approaches the global minimum, as illustrated by the formula.

(18)

Represents the current learning rate, represents the minimum and maximum learning rates, respectively, and represents the current epoch and the maximum epoch.

Table 5 Hyperparameter Settings for the model

hyperparameter meaning Initial value / initial method
learning_rate learning rate 1e-3
audio_duration Audio sequence length 3s
min_duration Minimum audio length 0.5s
num_epoch Number of training rounds 30
feature_method Feature extraction method melspectrogram/spectrogram
augment_conf_method Data enhancement methods noise/spec/speed/volume

4.3 Model training

During the training process, the AAM-Softmax function is used to calculate the loss, and the weights are updated using the SGD optimizer. The number of iterations is set to 10 rounds, with each round iterating 16,542 times on the zhvoice dataset. After every 100 iterations, the loss value is recorded locally, resulting in a total of 1,650 loss values being recorded. After training, the loss curve changes for the biometric speaker recognition system based on convolutional-deep neural networks and multi-head attention pooling on the zhvoice dataset were plotted, as shown in Figure 10.

Figure 10 Loss value variation curve

From the loss value change curve, it can be seen that in the initial stage of model training, the reduction in loss values is very significant. After 1000 iterations, the rate of change in loss values begins to decrease. This is because, in the later stages of training, as the learning rate continues to decrease, the model gradually approaches its optimal point. The decay process of the learning rate is controlled using the CosinAnnealingLR method, and the learning rate change curve is shown in Figure 11.

Figure 11 Learning rate variation curve

In addition, after each 100 iterations of training, the accuracy of biometric voice recognition is calculated and recorded locally; at the end of each training round, the model’s biometric voice recognition accuracy is also calculated using a test dataset and recorded. The training accuracy and test accuracy are shown in Figures 12 and 13, respectively.

Figure 12 Accuracy curve of training set

Figure 13 Accuracy change curve of test set

From the changes in accuracy of the model on the training set, it can be seen that the accuracy initially rises sharply until iteration 600, then gradually stabilizes at 0.96 by iteration 1600, indicating that the model fits the distribution of the training data well. By examining the changes in accuracy on the test set, we observe that the model reaches an accuracy of 0.95382 after the 9th iteration, which demonstrates good adaptability to the test set and no overfitting, showing a certain level of generalization ability.

4.4 Experimental results and analysis

To test the performance of the model proposed in this paper, this section conducts comparative experiments on the zhvoice and VoxCeleb1 clean speech datasets for the ECAPA-CTDNN model, i-vector baseline model, and ECAPA-TDNN baseline model. These experiments include comparisons under long-duration speech (audio length of 3s) and short-duration speech (audio length of 0.5s-1.5s), as shown in Table 6 and Table 7, respectively; and a series of ablation experiments on the ECAPA-CTDNN model using the zhvoice dataset. Both the comparative experiments and ablation experiments evaluate system recognition performance using the metric of speaker embedding codes (Equal Error Rate EER, Minimum Detection Cost Function minDCF), with the FR, FA risk coefficients CFR, CFA set to 10 and 1, respectively, as defined in NISTSRE2008. The prior probabilities PT, PI for real speakers and impostors are set to 0.95 and 0.05 [20], respectively.

(1) Comparison of experimental results and analysis

Table 6 shows the comparison results of long-term test voice

Datasets Model EER minDCF
zhvoice(clean) i-vector 1.27 0. 1446
EC APA-TDNN 0. 94 0. 0695
ECAPA-CTDNN(Ours) 0. 86 0. 0686
VoxCelebl(clean) i-vector 1 .42 0.1488
EC APA-TDNN 1 .01 0.1274
ECAPA-CTDNN(Ours) 0.97 0.1265

From the experimental results in Table 6, it can be seen that in the long-duration scene zhvoice dataset test, the error rate of the ECAPA-CTDNN model proposed in this paper, which is based on multi-head attention statistical pooling, is lower than that of the i-vector model and the ECAPA-TDNN model. This indicates that the joint factors of the i-vector model still cannot fully separate speaker information from channel information. In contrast, the model proposed in this paper leverages the strong learning ability of deep learning to represent hidden layer features, enabling more accurate identification of speaker identity. Additionally, the ECAPA-CTDNN model focuses on the structural information of the speaker’s voice through convolutional networks and uses the learned local features as input for subsequent delay networks. The delay network extracts contextual relevance from speech information to form the final acoustic feature, which contains richer speaker characteristic information, sufficient to distinguish different speakers.

Table 7 Comparison results of short time test voice

Datasets Model EER minDC F
zhvoice(clean) i- vector 5.71 0. 1295
ECAPA-TDNN 1. 18 0. 0765
ECAPA-CTDNN(Ours) 1.16 0. 0861
VoxCelebl(clean) i- vector 8.23 0.1446
ECAPA-TDNN 1. .47 0.1316
ECAPA-CTDNN(Ours) 1.32 0.1277

The experimental results in Table 7 show that the performance of the ECAPA-CTDNN model, as well as the i-vector and ECAPA-TDNN models, declined across both datasets in short-duration scenarios. This indicates that the ECAPA-CTDNN model performs slightly less effectively when dealing with short-duration speech containing minimal speaker information. The reason is that such speech inherently contains little speaker information, making it difficult to extract sufficient discriminative spectral features during the preprocessing stage, which directly leads to a decline in final recognition performance. However, compared to other models, the ECAPA-CTDNN model still outperforms others. Specifically, this is because the model employs a multi-head attention mechanism to assign different levels of attention to different channels within the feature maps, enabling it to autonomously select relatively important channels under limited short-duration speech conditions.

Regardless of whether it is long-duration or short-duration speech, all models show varying degrees of performance degradation under the VoxCeleb1 dataset. The ECAPA-CTDNN model in this paper shows a more pronounced decline. This is because the smaller scale of this dataset prevents the model from fully learning multi-scale feature information. This phenomenon also indicates that the model proposed in this paper relies on significant data costs and has certain limitations.

(2) Results and analysis of ablation experiments

In order to study the contribution of each component in the ECAPA-CTDNN model proposed in this paper to the accuracy of biometric voice print recognition, a series of ablation experiments are carried out in this section, which are set from two aspects: multi-layer feature aggregation summation and multi-head attention mechanism.

In the ECAPA-CTDNN model, starting from the second SE-Res2Block, subsequent SE-Res2Block inputs include the residuals of the previous SE-Res2Block as input, known as multi-layer feature aggregation summation. Therefore, to investigate the impact of this method on model performance, residual-free connections (NoResConnections) and multi-layer feature aggregation summation (No Sum Resconnections) were set up separately. The experimental results are shown in Table 8. To verify the impact of the proposed multi-head attention mechanism on model performance in the CTDNN model, single-head and multi-head attention mechanisms were compared, with the experimental results presented in Table 9.

Table 8 shows the results of multi-layer feature aggregation ablation experiment

Systems EER minDCF
ECAPA-CTDNN(Ours) 0.86 0.0686
No Res Connections 1.08 0.1310
No Sum Resconnections 1.02 0.1217

It can be seen that residual connections and multi-layer feature aggregation have a significant impact on model performance, reducing the model’s error rate and minimum detection cost to some extent. The residual connection alleviates gradient disappearance, enabling the model to learn more abstract speaker features at higher levels; multi-layer feature aggregation suggests that while deeper features are more relevant to speaker identity, shallow features contribute to a more robust representation of speaker identity.

Table 9 shows the results of the experiment on the dissipation of multiple attention mechanisms

Systems EER minDCF
Single Head 1.03 0.1288
Multi Head 0.86 0.0686

The experimental results show that the multi-head attention mechanism is calculated by its multiple independent attention modules, which not only helps the network to capture richer speaker feature information from multiple perspectives, but also can avoid the overfitting of single-head attention to the model, so as to improve the recognition performance of the model.

  1. Conclusions

This paper first analyzes the time-domain and frequency-domain characteristics of speech signals from the perspective of speaker voice signals. During the short-term stationary analysis of speech signals, in conjunction with the representation form of speaker information in speech signals, it is known that speech signals have two fundamental attributes: local attributes and temporal correlation attributes. Based on these two attributes, a deep neural network combining convolutional neural networks and delay neural networks is constructed to deeply extract features from the original speech audio sequences of speakers, obtaining corresponding acoustic embedding codes for each speaker, thus completing the biometric recognition of speakers. Furthermore, a statistical pooling method based on multi-head attention mechanisms is adopted in the model. Finally, through various comparative experiments and ablation experiments, it is demonstrated that the biometric recognition system based on convolutional-neural networks and multi-head attention statistical pooling methods has better recognition performance, meeting the predetermined goals of this paper.

Funding

This work was supported by Basic scientific research project of colleges and universities, Liaoning Provincial Department of Education: Research on feature pattern extraction and detection method of forged and mutated speech (item number: JYTZD2023150) and 2024 Graduate Innovation Ability Improvement Project (item number: 2024YCYB34).

Author’s Profile

Hongbing Zhang was born in Wuyang,Henan. P.R. China, in 1979. He obtained a bachelor’s degree from Shaanxi Normal University, Xi’an in China. I am currently a Professor at the School of Police Information Technology and Intelligence, Criminal Investigation Police University of China. My main research direction is Speaker Recognition and Voice Anti-Spoofing.

Maolin Ma was born in Qingdao,Shandong. P.R. China, in 1998. He obtained a bachelor’s degree from Guilin University of Electronic and Technology in China. I am currently studying at the School of Public Security Information Technology and Intelligence, Criminal Investigation Police University of China. My main research direction is Speaker Recognition and Voice Anti-Spoofing.

References

[1] Desplanques B , Thienpondt J , Demuynck K .ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification[J]. 2020.DOI:10.21437/Interspeech.2020-2650.

[2] Desplanques B , Thienpondt J , Demuynck K .ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification[J]. 2020.DOI:10.21437/Interspeech.2020-2650.

[3] Hu J , Shen L , Albanie S , [3] Hu J, Shen L, Albanie S, et al. Squeeze-and-Excitation Networks.[J]. IEEE transactions on pattern analysis and machine intelligence, 2020, 42(8):2011-2023.

[4]Tan Z , Mak M W , Mak K W . DNN-Based Score Calibration With Multitask Learning for Noise Robust Speaker Verification[J]. IEEE/ACM Transactions on Audio Speech & Language Processing, 2018, 26(4):700-712.

[5] Jin M , Yoo C D . Speaker Verification and Identification[J]. Behavioral Biometrics for Human Identification Intelligent Applications, 2010.

[6] Furui S.An overview of speaker recognition technology[J]. Automatic Speech & Speaker Recognition,1996,355:31-56.

[7] Tan Z , Mak M W , Mak K W . DNN-Based Score Calibration With Multitask Learning for Noise Robust Speaker Verification[J]. IEEE/ACM Transactions on Audio Speech & Language Processing, 2018, 26(4):700-712

[8] Dharanipragada S , Yapanel U H , Rao B D . Robust Feature Extraction for Continuous Speech Recognition Using the MVDR Spectrum Estimation Method[J]. IEEE Transactions on Audio Speech & Language Processing, 2007,15(1):p.224-234.

[9]Gao S H , Cheng M M , Zhao K , [9]Gao S H, Cheng M M, Zhao K, et al. Res2Net: A New Multi-Scale Backbone Architecture[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021(2):43.

[10] Chen Zhuang and Yu Yibiao. Robust biological voice print recognition algorithm with noise adaptive fitting compensation [J]. Journal of Acoustics, 2022,47(1):10.

[11] Snyder, D. , Garcia-Romero, D. , Povey, D. , & Khudanpur, S. . (2017). Deep Neural Network Embeddings for Text-Independent Speaker Verification.Interspeech 2017.

[12] Chowdhury F A R R , Wang Q , Moreno I L ,[12] Chowdhury F A R R, Wang Q, Moreno I L, et al.Attention-Based Model for Text-Dependent Speaker Verification[J].IEEE, 2018.DOI:10.1109/ICASSP.2018.8461587.

[13] Heinrich D , Qian Y , Kai Y . Investigating Raw Wave Deep Neural Networks for End-to-End Speaker Spoofing Detection[J]. IEEE/ACM Transactions on Audio,Speech, and Language Processing, 2018, PP:1-1.

[14] Huang X , Acero A , Hon H W .Spoken Language Processing: A Guide to Theory,Algorithm, and System Development[J].Prentice Hall PTR, 2001.DOI:http://dx.doi.org/.

[15] Furui S . Cepstral analysis technique for automatic speaker verification[J]. Acoustics Speech & Signal Processing IEEE Transactions on, 1981, 29(2):254-272.

[16] DesplanquesB,ThienpondtJ,DemuynckK.ECAPA-TDNN:EmphasizedChannelAttention,PropagationandAggregationinTDNNBasedSpeakerVerification[J].2020.DOI:10.21437/Interspeech.2020-2650.

[17] Hu J ,Shen L ,Albanie S , [17]Hu J,Shen L,Albanie S, et al. Squeeze-and-Excitation Networks.[J]. IEEEtransactions on pattern analysis and machine intelligence, 2020, 42(8):2011-2023.

[18] Okabe K ,Koshinaka T , Shinoda K .Attentive Statistics Pooling for Deep Speaker Embedding[J]. 2018.DOI:10.21437/Interspeech.2018-993.

[19] Vaswani A ,Shazeer N , Parmar N ,[19]Vaswani A,Shazeer N, Parmar N, et al.Attention Is All You Need[J].arXiv, 2017.DOI:10.48550/arXiv.1706.03762.

[20] Sun H , Ma B , Huang C L ,[22]Sun H, Ma B, Huang C L, et al.The IIR NIST SRE 2008 and 2010 summed channel speaker recognition systems[C]//Interspeech, Conference of theInternational Speech Communication Association, Makuhari, Chiba, Japan, September.DBLP, 2010.

Leave a Comment