Improved Bimodal Speech Recognition Study Based on Product Hidden Markov Model
Improved Bimodal Speech Recognition Study Based on Product Hidden Markov Model
International Journal of Fuzzy Logic and Intelligent Systems. 2013. Sep, 13(3): 164-170
Copyright ©2013, Korean Institute of Intelligent Systems
This is an Open Access article distributedunder the terms of the CreativeCommons Attribution Non-Commercial License( which permits unrestricted noncommercialuse, distribution, and reproductionin any medium, provided the originalwork is properly cited.
  • Received : May 30, 2013
  • Accepted : September 14, 2013
  • Published : September 25, 2013
Export by style
Cited by
About the Authors
Su Mei Xi
School of Science, Qilu University of Technology, Jinan, China
Young Im Cho
College of Information Technology, The University of Suwon, Hwaseong, Korea

Recent years have been higher demands for automatic speech recognition (ASR) systems that are able to operate robustly in an acoustically noisy environment. This paper proposes an improved product hidden markov model (HMM) used for bimodal speech recognition. A two-dimensional training model is built based on dependently trained audio-HMM and visual-HMM, reflecting the asynchronous characteristics of the audio and video streams. A weight coefficient is introduced to adjust the weight of the video and audio streams automatically according to differences in the noise environment. Experimental results show that compared with other bimodal speech recognition approaches, this approach obtains better speech recognition performance.
1. Introduction
Speech recognition technology has made great progress in recent decades, and automatic speech recognition (ASR) systems have become increasingly widespread. Since an ASR system is vulnerable to speech noise, and since almost all voice signals contain noise, ASR identification performance using only audio information cannot meet the need. Therefore, developing a robust speech recognition system in a noisy environment is an urgent problem. Developing an integration strategy for audio and visual information is one of the many challenges facing an audio-visual (bimodal) ASR system. From the point of view of perception, video information corresponding to audio information can improve a person’s understanding of a speaker’s voice. In a noisy environment or for hearing-impaired listeners, video information is a useful complement.
Generally, the audio-visual ASR (AVSR) systems work by the following procedures. First, the acoustic and the visual signals of speech are recorded by a microphone and a camera, respectively. Then, each signal is converted into an appropriate form of compact features. Finally, the two modalities are integrated for recognition of the given speech. Integration of acoustic and visual information aims at obtaining as good recognition results as possible in noisy circumstances. It can take place either before the two information sources are processed by a recognizer early integration (EI) or after they are classified independently late integration (LI). LI has been shown to be preferable because of its better performance and robustness than EI [1] , and psychological supports [2] .
Considering the asynchronous nature of speech signal and video signal, we put forward an improved product hidden Markov model (HMM) in this paper, used for implementing the bimodal voice recognition of Chinese words, which formulate an improved HMM as a multi-stream HMM. Moreover, we control the stream weights of the audio-visual HMM by the generalized Pareto distribution (GPD) algorithm [1 , 3] , in order to adaptively optimize the audio-visual ASR. According to the corresponding relation between the weight coefficient and instantaneous signal-to-noise ratio (SNR), it can adjust the weight ratio of audio stream and video stream adaptively.
Experimental results demonstrate that the importance of audio features is far higher than that of video features in a quiet environment, while in the presence of noise, video features make an important contribution to speech recognition [4 - 6] .
In Section 2, we introduce some related research works about speech recognition technology. In Section 3, we introduce the classical speech feature parameter extraction approach. In Section 4, we put forward our improved HMM, and, in Section 5, we present our weight optimization approach based on the GDP algorithm. In Section 6, we implement our improved bimodal product HMM system, and, show some experiment results between other ASR systems and the improved HMM. Section 7 is conclusion and our future work.
2. Related Works
In speaker recognition, features are extracted from speech signals to form feature vectors, and statistical patter n recognition methods are applied to model the distribution of the feature vectors in the feature space. Speakers are recognized by pattern matching of the statistical distribution of their feature vectors with target models. Speaker verification (SVR) is the task of deciding, upon receiving tested feature vectors, whether to accept or reject a speaker hypothesis, according to the speaker’s model. Mel-frequency cepstral coefficients (MFCC) [7] are a popular feature-extraction method for speech signal processing, and Gaussian mixture models (GMM) have become a dominant approach for statistical modeling of speech feature vectors for text-independent SVR [8] .
A recently developed method for overcoming model mismatch is to use a reverberant speech database for training target models [9] . This method was tested on an adaptive-GMM (AGMM)-based SVR system [10] with reverberant speech, with various values of reverberation time (RT). Matching of RT between training and testing data was reported to reduce the equalerror rate (EER) from 16.44% to 9.9%, on average, when using both Z-norm and T-norm score normalizations. However, the study in [9] did not investigate the effect of GMM order on SVR performance under reverberation conditions. In fact, it may be difficult to find such research studies on this effect in the literature.
The audio and visual fusion techniques investigated in previous work include feature fusion, model fusion, or decision fusion. In feature fusion, the combined audio-visual feature vectors are obtained by the concatenation of the audio and visual features, followed by a dimensionality reduction transform [11] . The resulting observation sequences are then modeled using one HMM [12] . A model fusion system based on multi-stream HMM was proposed in [13] . The multi-stream HMM assumes that audio and video sequences are state synchronous but allows the audio and video components to have different contribution to the overall observation likelihood. However, it is well known that the acoustic features of speech are delayed from the visual features of speech, and assuming state synchronous models can be inaccurate. We proposed an audio visual bimodal that uses a product HMM. The audio visual product HMM can be seen as an extension of the multi-stream HMM that allows for audio-video state asynchrony. Decision fusion systems model independently the audio and video sequences using two HMMs, and combine the likelihood of each observation sequence based on the reliability of each modality [11] .
3. Speech Feature Parameter Extraction
- 3.1 Audio Feature Extraction
The normalized energy, MFCC and linear predictive cepstrum coefficients (LPCC) of speech describe the prosodic features, timbre features and perceived features, respectively, so they are selected as audio feature parameters in this paper.
The MFCC computation formula is as follows:
Lager Image
where N is the number of triangular filters; E mel (t,j) is the output energy for the j th filter at t time, {MFCC(t,i)} i=1,2,...,p is the corresponding MFCC parameters at t time, and P of {MFCC(t,i)} i=1,2,...,p is order.
The LPCC computation formula is as follows:
Lager Image
Mouth parameters.
LPCC(t; i) =LPC(t; i)
Lager Image
where LPC (t, k) is the k –th linear prediction coefficient at time t, {LPCC(t,i)} i=1,2,...,p is the corresponding LPCC parameters at time t, and P of {LPCC(t,i)} i=1,2,...,p is order.
- 3.2 Video Feature Extraction
We select lip parameters as video features, a segment video image using a two-dimensional fast thresholding segmentation algorithm, and lip feature extraction parameters [14] , as shown in Figure 1 , where x is the distance from the labial center line to the edge, u 0 and d 0 are the heights of the upper and lower halves, respectively, of the mouth centerline, and u 1 ,d 1 ,u 2 ,d 2 are the respective heights of the corresponding x-third point. The original lip parameter is v t = [x, u 0 , d 0 , u 1 ,d 1 ,u 2 ,d 2 ] for the image frame at t time.
4. Improved Product HMM
Auditory and visual features have some synchronicity, with some asynchrony within a certain range. When people talk, mouth movement has already begun before the voice, and it takes time to close the mouth and return to the natural state after the voice, so visual information is usually ahead of auditory information by about 120 ms [15] , which is close to the average duration of a phoneme.
Therefore, asynchrony can be permitted in the auditory and visual training model. This paper proposes an improved product HMM model based on the product HMM method. For Chinese words recognition, which usually corresponds to five or six states, only one state migration is allowed between the audio and video streams as a result of the presence of asynchrony. Figure 2 shows theHMMmodel for the audio and video streams, and Figure 3 shows the topology of the corresponding product
Lager Image
Double stream hidden Markov model model for wordspeech recognition (⊗ is synchronization point).
Lager Image
Topological structure of the proposed product hiddenMarkov model.
HMM model.
We extract the video features and audio features during the training stage. Generally speaking, the frame rate of the phonetic features is higher than the frame rate of the video features, so we use interpolation for them after video-feature extraction in order to ensure training synchronicity in the data flow.
A bimodal speech feature vector consists of the observation vectors of audio features and video features. According to Bayes’ theorem, the classification result for the maximum posterior probability is
Lager Image
where W denotes some word and O a and O υ denote the vector sequence of audio and video features, respectively; if they are independent of each other, the joint output probability is P(O a ,O υ )= P(O a │ W)P(O υ │ W).
A multiple-model method needs to combine the audio and video streams in terms of formulas. For the improved product HMM method, we assume that the audio and video streams are conditionally independent, so observation vector and transition probabilities at t time are
Lager Image
Lager Image
where a a ij is the transition probability from state i to state j among the audio HMM and a υ mn is the transition probability from state m to state n in the video HMM. The output probability of state ij is
Lager Image
The weight coefficient ⋋ (0 < ⋋ < 1) reflects the different weights of the two modes, which depend on the recognition performance of each mode under the different noise conditions. Zhao et al.’s experiments [16] showed the following linear relationship.
Lager Image
The weight coefficient is greater (⋋ > 0:825) when the speech signal noise is smaller (SNR>25 dB), illustrating that audio information plays a larger role in the decision-making. The weight coefficient decreases with an increase of noise, illustrating that the proportion of video information increases gradually in the decision-making. When SNR = 5 dB, ⋋ ≈ 0.5, illustrating that they have the same importance at that moment.
5. Weight Optimization Based on the GPD Algorithm
Using the existing training data, based on formula (7), and calculating the weight coefficient according to the GDP algorithm [17] , this training algorithm defines a misclassification distance that provides the correct distance between the class information and other information. The misclassification distance is computed using a smooth loss function and is minimized.
For a well-trained product HMM model, according to the Nbest recognized hypotheses, supposing x as the unknown word vector,
Lager Image
(⋋) as the logarithmic likelihood values of correctly identifying the x in the model,
Lager Image
(⋋) as the logarithmic likelihood values of the N-best candidate vector of misrecognized words, the misclassification distance is
Lager Image
Where η is a smoothing parameter and N is the total candidate number. The total loss function of x after smoothing is
Lager Image
The purpose of training is to minimize Lost(⋋) so as to minimize the error. The recursive formula of weight is
Lager Image
The condition is
Lager Image
is a finite positive matrix sequence. The algorithm converges as k→ ∞. The recursion stops, and the final weight is obtained when the difference of the recursive value is smaller than a given threshold.
6. Experiment Results and Analysis
- 6.1 Experiment Dataset
We constructed a bimodal corpus and selected from seven people (five for male and two for female). The corpus contains 50 Chinese words, totaling 750 words for the seven people, including 550 words for training and the others for recognition. As needed, we added some noises of different intensity for recognition speech words. The sampling rate of the speech signal is 22.05 kHz. The quantitative value is 16 bits. The frame length of the speech frame is 28 ms. The frame shift is 14 ms, using a Hamming window as the window function. In order to ensure the synchronization of the video and audio streams after the extraction of video and audio features, we interpolated the video features and input these feature parameters into the improved product HMM, shown as in Figure 4 . The video features were the lip parameter v t and the dynamic parameter v t , totaling 14 dimensions. To determine the final scheme of the audio features, we preselected three sets of features as follows:
(1) MFCC feature: MFCC (14 dimensions) + ΔMFCC (14 dimensions) + normalized audio energy, totaling 29 dimensions;
(2) LPCC feature: LPCC (14 dimensions) + ΔLPCC (14 dimensions) + normalized audio energy, totaling 29 dimensions;
(3) MFCC-LPCC joint feature: MFCC (14 dimensions) + ΔMFCC (14 dimensions) + LPCC (14 dimensions) + ΔLPCC (14 dimensions) + normalized audio energy, totaling 57 dimensions.
Lager Image
Proposed bimodal speech recognition process.
Bimodal speech recognition rate of audio feature parameters for different SNRSNR, signal-to-noise ratio; MFCC, Mel-frequency cepstral coefficient; LPCC, linear predictive cepstrum coefficient.
Lager Image
Bimodal speech recognition rate of audio feature parameters for different SNR SNR, signal-to-noise ratio; MFCC, Mel-frequency cepstral coefficient; LPCC, linear predictive cepstrum coefficient.
- 6.2 Experiment Results
Under different SNR conditions, according to the recognition result ( Table 1 ), we selected the MFCC-LPCC joint feature to train modal and recognize speech.
The convergence performance of the GPD algorithm depends on the choice of the feature parameters. After many experiments, we set N = 2 in formula (8), α = 0:1 in formula (9), ϵ k = 50=k in formula (10), and convergence threshold T c = 0.01. The recursion stopped when the recursive interpolation ▽W=W k+1 –W k < T c .
Four speech recognition schemes were proposed for comparing the bimodal speech recognition performance of the different methods.
(1) For the single-modal speech recognition of the audio, the single audio model adopts the classical left-to-right no-cross HMM modal [18] . Considering that the training objects are Chinese words, five or six states are selected and the output probability density function is a four-dimension mixed Gaussian density distribution. The audio parameters are the aforementioned 57 dimension MFCC-LPCC joint feature parameters.
(2) Based on the EI model [19] , the joint feature vector is composed of the audio feature vector and video feature vectors, which are inputted to the single HMM model for training, with the same modal architecture as in solution (1).
Lager Image
Speech recognition rate comparison of four solutions fordifferent signal-to-noise ratios (SNRs).
(3) The multiple HMM model, mixing audio and video features on the state layer, assigning different weights to the audio and video streams according to the principle of formula (6) [5 , 6] , requires special restrictions so as to maintain synchronization between the video and voice stream.
(4) This solution achieves word recognition through the improved product HMM model ( Figure 3 ) proposed here. Based on the assumption that the audio and video streams are independent of each other, it allows for a step-state deviation between them.
To facilitate the performance comparison, the audio and video feature parameters of solutions (2) and (3) are the same as that of solution (4), and the recognition results are as shown in Figure 5 .
- 6.3 Result Comparison and Analysis
In a low-noise environment, the difference in the recognition rate between the single-mode audio recognition method (solution 1) and the bimodal recognition method (solution 2-solution 4) is not large. When noise increases, the gap in recognition rate between single-mode and double-mode recognition will increase. This shows that in high-noise environments, the video information contributes more to the speech recognition rate.
For the bimodal recognition method, the EI model-based method did not assign the weights of video and audio information dynamically, leading to the lowest recognition rate for this method among all the bimodal recognition methods. The solution 3 in [5] only considered the speech recognition of isolated words, while we consider recognizing Chinese phrases, increasing the recognition length. In contrast to our method, [5] used a neural network model, which was slower in actual speech recognition. Furthermore, [5] did not consider the asynchronous nature of the video and audio signals, but simply used the weighted fusion of the speech and video streams. Different from other methods, the method proposed in this paper considered the asynchrony, as shown in Figure 5 , the speech recognition rate was slightly higher than in solution 3.
7. Conclusion
In this paper, with the aim of achieving effective speech recognition in noisy environments, a product HMM-based bimodal speech model allowing a one-step state offset to adapt to the asynchronous nature of the video signal and audio signal is proposed.
According to the corresponding relation between the weight coefficient and instantaneous SNR, the model can adjust the weight ratio of the audio and video streams adaptively. We selected a 50 Chinese 2-digit word corpuses as training and identification data, in contrast to other types of programs. The result showed that our proposed model can ensure the accuracy and robustness of speech recognition in a noisy environment.
Although we have shown effectiveness of the proposed bimodal HMM method on the Chinese 2-digit word recognition tasks, this scheme can be extended for multiword or continuous speech recognition tasks. In such cases, it would be a problem that, from the two modalities, we have unmanageably many possible word or phoneme sequence hypotheses to be considered for weighted integration. Also, more complicated interactions between the modalities can be modeled by using cross-modal associations and influences, where we still can use the proposed integration method for adaptive robustness. With these considerations, further investigation of applying the proposed system to complex tasks such as multiword or continuous speech recognition is in progress.
- Conflict of Interest
No potential conflict of interest relevant to this article was reported.
This work was supported by three projects of the ShandongProvince Higher Educational Science and Technology Program(J12LN09), China, Ji’nan Youth Science and Technology Star Project (No.20120104), China, and by a Natural Science FundationProject of Shandong Province, China (No. zr2011fm028).
Dasarathy B. V. 1997 “Sensor fusion potential exploitation:innovative architectures and illustrative applications” Proceedings of the IEEE 85 (1) 24 - 38    DOI : 10.1109/5.554206
Massaro D. W. 1987 “Speech Perception by Ear and Eye:A Paradigm for Psychological Inquiry” Lawrence Erlbaum Hillsdale, NJ
Lee J. S. , Park C. H. 2006 “Training hidden Markov modelsby hybrid simulated annealing for visual speech recognition” in Proceedings of 2006 IEEE International Conference on Systems, Man and Cybernetics Taipei Oct. 198 - 202
Kumatani K. , Nakamura S. , Shikano K. 2001 “An adaptiveintegration based on product HMM for audio-visualspeech recognition” in Proceedings of 2001 IEEE International Conference on Multimedia and Expo Tokyo 813 - 816    DOI : 10.1109/ICME.2001.1237846
Lee J. S. , Park C. H. 2008 “Robust audio-visual speechrecognition based on late integration” IEEE Transactions on Multimedia 10 (5) 767 - 779    DOI : 10.1109/TMM.2008.922789
Dupont S. , Luettin J. 2000 “Audio-visual speech modelingfor continuous speech recognition” IEEE Transactions on Multimedia 2 (3) 141 - 151    DOI : 10.1109/6046.865479
Davis S. , Mermelstein P. 1980 “Comparison of parametricrepresentations for monosyllabic word recognition incontinuously spoken sentences” IEEE Transactions on Acoustics, Speech and Signal Processing 28 (4) 357 - 366    DOI : 10.1109/TASSP.1980.1163420
Reynolds D. A. , Quatieri T. F. , Dunn R. B. 2000 “Speakerverification using adapted Gaussian mixture models” Digital Signal Processing 10 (1) 19 - 41
Peer I. , Rafaely B. , Zigel Y. 2008 “Reverberation matchingfor speaker recognition” in Proceedings of 2008 IEEE International Conference on Acoustics, Speech and Signal Processing Las Vegas, NV 4829 - 4832    DOI : 10.1109/ICASSP.2008.4518738
Bimbot F. , Bonastre J. F. , Fredouille C. , Gravier G. , Magrin-Chagnolleau I. , Meignier S. , Merlin T. , Ortega-Garcia J. , Petrovska-Delacretaz D. , Reynolds D. A. 2004 “A tutorial on text-independent speaker verification” EURASIP Journal on Advances in Signal Processing 2004 (4) 430 - 451    DOI : 10.1155/S1110865704310024
Neti C. , Potamianos G. , Luettin J. , Matthews I. , Glotin H. , Vergyri D. , Sison J. , Mashari A. , Zhou J. 2000 “Audio visualspeech recognition,” in Final Workshop 2000 Report Center for Language and Speech Processing Baltimore
Rabiner L. R. , Juang B. H. 1993 Fundamentals of SpeechRecognition PTR Prentice Hall Englewood Cliffs, NJ
Luettin J. , Potamianos G. , Neti C. 2001 “Asynchronousstream modeling for large vocabulary audio-visual speechrecognition” in Proceedings of 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing Salt Lake City, UT 169 - 172    DOI : 10.1109/ICASSP.2001.940794
Zhao H. , Tang C. , Yu T. 2008 “Fast thresholding segmentationfor image with high noise” in Proceedings of 2008 International Conference on Information and Automation Changsha 290 - 295    DOI : 10.1109/ICINFA.2008.4608013
Xie Lei , Jiang D. 2004 “Audio-visual synthesis and synchronousasynchronous experimental research for bimodalspeech recognition” Journal of Northwestern Polytechnical University 22 (2) 171 - 175
Zhao H. , Gu Y. , Tang C. 2009 “Research of relationshipbetween weight coefficient of product HMM and instantaneousSNR in bimodal speech recognition” Journal of Computer Application 29 279 - 285
Adjoudani A. , Benot C. 1995 “On the integration of auditoryand visual parameters in an HMM-based ASR” in Proceedings NATO ASI Conference on Speechreading by Man and Machine: Models, Systems and Applications 461 - 471
Rabiner L. 1989 “A tutorial on hidden Markov models andselected applications in speech recognition” Proceedings of the IEEE 77 (2) 257 - 286    DOI : 10.1109/5.18626
Bregler C. , Omohundro S. M. 1995 “Nonlinear manifoldlearning for visual speech recognition” in Proceedings of 1995 5th International Conference on Computer Vision Cambridge, MA 494 - 499    DOI : 10.1109/ICCV.1995.466899