We present a robust speaker identification algorithm that uses novel features based on soft bag-of-word representation and a simple Naive Bayes classifier. The bag-of-words (BoW) based histogram feature descriptor is typically constructed by summarizing and identifying representative prototypes from low-level spectral features extracted from training data. In this paper, we define a generalization of the standard BoW. In particular, we define three types of BoW that are based on crisp voting, fuzzy memberships, and possibilistic memberships. We analyze our mapping with three common classifiers: Naive Bayes classifier (NB); K-nearest neighbor classifier (KNN); and support vector machines (SVM). The proposed algorithms are evaluated using large datasets that simulate medical crises. We show that the proposed soft bag-of-words feature representation approach achieves a significant improvement when compared to the state-of-art methods.
The Simulation for Pediatric Assessment, Resuscitation, and Communication (SPARC) group within the Department of Pediatric Critical Care Medicine at the University of Louisville makes extensive use of simulation in training teams of nurses, medical students, residents, and attending physicians. These simulation sessions involve trained actors simulating family members in various crisis scenarios. Sessions involve 4 to as many as 9 people and last approximately 20 minutes to one hour. They are scheduled approximately twice per week and are recorded as video data. After each session, the physician/instructor must manually review and annotate the recording and then debrief the trainees on the session. The goal is to enhance the care of children and strengthen interdisciplinary and clinician-patient interactions
The physician responsible for the simulation has recorded 100’s of sessions, and has realized that the manual process of review and annotation is labor intensive and that retrieval of specific video segments (based on speaker or what was said) is not trivial. Using machine learning methods, we have developed a speaker segmentation and identification system that can provide the physician with automated and efficient methods to semantically index and retrieve specific segments from the large collections of simulation sessions.
The architecture of this system is illustrated in
It has two main components. The first one is for offline training, and the second one is for online testing. In the offline training, first audio streams are extracted from the training videos. Then, the divide-and-conquer (
3) based speaker segmentation method
is used to partition the speech sequence into homogeneous segments. Finally, a classifier is trained to discriminate between segments that correspond to different speakers. In the online testing, the input consists of an unlabeled video recording. First, the audio component is extracted and segmented. Then, each segment is labeled by the classifier. As a result, our system will identify “who spoke and when”. In this paper, we focus on developing an efficient and accurate algorithm for speaker identification.
Overview of the proposed speaker segmentation and identification system
The rest of this paper is organized as follows. Section 2 reviews related work. Section 3 presents the proposed soft bag-of-words feature representation method. Section 4 discusses our experiments. Finally, concluding remarks are given in Section 5.
2. Related Works
- 2.1 Speaker Recognition
Speaker identification, classifying speech utterances into different speaker classes, and speaker verification, verifying a person’s claimed identity from his/her voice, are generally referred to as speaker recognition
. Several features and classification methods have been proposed for this task. For instance, Mel frequency cepstral coefficients (MFCC)
, which take into account how humans perceive the difference between sounds of different frequencies, is one of the most commonly used features
. Perceptual linear predictions (PLP)
and linear prediction cepstral coefficients (LPCC)
are two other common features that rely on psychophysically based spectral transformations and linear prediction. These features may not always achieve good performance, especially in noisy environment, and many other features have been proposed. For instance, in
, Wang proposed combing MFCC features and phase information for speaker identification. In
, Li proposed an auditory based feature extraction algorithm by using a set of modules cochlear filter banks. In
, Gabor filtering was applied to speech spectrum features, and nonnegative tensor factorization method was used to extract more robust features. Other feature representations can be found in
Most of the above methods extract features from small overlapping windows. Thus, each speech segment can be represented by a large number of features. Moreover, since speech segments can have different durations, they will be represented by different number of features. To overcome this limitation, the above features are usually summarized by a small fixed number of representatives. For instance, Gaussian mixture model (GMM), with universal background model (UBM) for speaker adaptation
, has been widely used for speaker recognition
, GMM adaptation was applied to UBM to learn each speaker. Then, log-likelihood scores with nonlinear normalization were used for speaker discrimination. In
, the GMM mean supervector, that represents the variable size segments with a fixed dimension by concatenating all adapted Gaussian mean vectors, was combined with a support vector machines (SVM) classifier. This approach was proven to be one of the most effective methods for speaker recognition.
Another alternative approach, called possibilistic histogram features (PHF) was proposed in
. The PHF is inspired by the “bag of words” concept used in information retrieval. It identifies a fixed set of representative prototypes, and each audio segment is mapped to the closest prototype. The relative frequency of occurrence of each prototype is used as the feature vector of each audio segment. The PHF has been used with a KNN classifier and its performance was constrained. On one hand, a reduced vocabulary cannot represent all variations within the features. On the other hand, a larger vocabulary can improve the feature representation, but can also degrade the KNN classifier.
- 2.2 Feature Representation with Bag-of-Words (BoW)
The bag-of-words model has been widely used in various applications, such as document classification, computer vision, speech and speaker recognition, etc. In document classification, the feature is constructed based on the frequency of occurrence of each word
. Generally, there are two different models to represent the document. One model uses a vector of binary attributes to indicate whether a word occurs or does not occur in the document. This representation can be modeled as a multivariate Bernoulli distribution. Another model takes the number of word occurrences into account, and represents the document by a sparse histogram of words frequencies. This representation can be modeled as a multinomial model. For both models, the Naive Bayes classifier is commonly used for classification.
In computer vision, a bag of
words is a vector of frequency counts of a vocabulary of local image features. It has been used mainly in image/video scenes classification and retrieval
, a “bag of key points” method was proposed based on vector quantization of affine invariant descriptors of image patches. Two different classifiers, Naive Bayes and SVM, were applied for semantic visual categories classification. Similarly, in
, a set of viewpoint invariant region descriptors were extracted to search and localize all the occurrences of a given query object in a video. In this approach, a visual vocabulary was built through vector quantizing the descriptors into clusters. Using the standard indexing method used in text retrieval, the term frequency-inverse document frequency (TF-IDF) was computed and the cosine similarity was used for retrieval.
The BoW has also been used for the analysis of speech data. In
, the high-frequency keywords (e.g.
, etc.) were selected by computing the frequent, reflexive words and word pairs, and modeling them via word-based HMM models. Integrating this advantage of text-dependent modeling into the traditional GMM-based text-independent speaker recognition was shown to improve the performance. In
, a bag-of-words (BoW)-style feature representation, which quantizes the observed direction of arrival (DOA) powers into discrete “word” samples, was developed to solve the speaker-clustering problem. In this approach, a time-varying probabilistic model was combined with the DOA information calculated from a microphone array to estimate the number and locations of the speakers.
- 2.3 BoW Feature Representation with Naive Bayes Classifier
Assume that we have a set of labeled speech segments
], and representative vocabularies (i.e. codebook or cluster centers)
) denotes the relative frequency of the occurrence of word
. To classify a new test sample,
, Bayes’ rule is applied and the maximum a posteriori score is used for prediction:
) is the a priori probability of class
, and the class-conditional probability
) denotes the probability of word
occurring in class
and can be estimated using:
In order to avoid the zero probability estimation in (2), the Laplace smoothing is frequently used, and (2) can be replaced with:
3. Soft BoW Audio Feature Representation
In this paper, we propose a generalization of the BoW feature representation. In addition to the standard binary voting, where each sample contributes to each keyword with a binary value (1 if the keyword is the closest one to the sample and 0 otherwise), we propose a generalization that uses soft voting. We discuss the advantages and disadvantages of each voting scheme. We also show that the soft BoW representations with a Naive Bayes classifier outperform existing methods for speaker identification.
- 3.1 Visual Vocabulary Construction
Assume that each speaker
has a training set of
low-level features, that is,
dimensional feature vector extracted from the
segment of the
The first step consists of summarizing each X
by a set of representative prototypes
. This quantization step is achieved by partitioning X
clusters and letting
be the centroid of the
partition. Any clustering algorithm can be used for this task. In this paper, we report the results using the Fuzzy C-means (FCM)
algorithm. The FCM partitions the
clusters by minimizing the sum of within-cluster distances, i.e.,
refers to the Euclidean distance between
] represents the membership of feature vector
and satisfies the constraints:
, is a representative of cluster
that summarizes a group of similar speech segments. Let
be the variance of all features
assigned to cluster
. After clustering, the
prototypes obtained by partitioning the data of speaker
, are all combined to form a dictionary or a codebook with
is the number of speakers.
Instead of using the original feature space
, the bag-of-words based histogram feature descriptor (BoW-HFD) approach maps it to a new space
characterized by the
clusters that capture the characteristics of the training data. Formally, this mapping is defined as
) ∈ [0, 1] is a measure of belongingness of feature
represented by prototype
. This measure could be
. These different mappings are described in the following subsections.
- 3.2 Crisp Mapping
In crisp mapping, each feature vector
is assigned a binary membership value to each “word”
based on the distance between them. This mapping considers only the closest word (i.e. prototype) to word
and is defined as:
This mapping is used in the standard BoW approach
and considers only the closest word. Thus, it is reasonable if
is close to one word and far from the other words. However, if
is close to multiple words (i.e.,
is located close to the clusters’ boundaries), then, crisp mapping will not preserve this information.
- 3.3 Fuzzy Mapping
Instead of using binary voting (as in eq. (7)), fuzzy mapping uses soft labels to allow for partial or gradual membership values. This type of labeling offers a richer representation of belongingness and can handle uncertain cases. In particular, a sample
votes to each word
in the codebook with a membership degree
Many clustering algorithms use this type of labels to obtain a fuzzy partition. In the proposed fuzzy BoW (F-BoW) approach, we use the memberships derived within the Fuzzy C-Means (FCM)
∈ (1, ∞) is a constant that controls the degree of fuzziness. In (9),
is the distance between feature vector
and the “word” summarizing cluster
. To take into account the shape of the clusters, we use
is the variance of feature
is the dimensionality of the feature space.
- 3.4 Possibilistic Mapping
The fuzzy membership in (9) is a relative number that depends on the distance of
to all prototypes. Thus, it does not distinguish between samples that are equally close to multiple prototypes and samples that are equally far from all prototypes.
An alternative approach to generate soft labels is based on possibility theory
. Possibilistic labeling relaxes the constraint in (8) that the memberships across all words must sum to one. It assigns “typicality” values,
, that do not consider the
position of the point to all clusters. As a result, if
is a noise point, then
, and if
is typical of more than one cluster, we can have
. Many robust partitional clustering algorithms
use this type of labeling in each iteration. In this paper, we use the membership function derived within the Possibilistic C-Means
is a cluster-dependent resolution/scale parameter
∈ (1, ∞).
Robust statistical estimators, such as M-estimators and W-estimators
, use this type of memberships to reduce the effect of noise and outliers.
4. Experimental Results and Discussion
- 4.1 Data Collection
Multiple data sets are used to validate and compare our proposed soft BoW-based audio feature representation with Naive Bayes classifier for speaker identification. In particular, we use 15 medical simulations videos. We only use the audio information for speaker identification as it contains most conversation information. This is because the video resolution is low and has no additional information (people are sitting with little movement and just talking). As shown in
, each simulation has four speakers (patient, patient’s friend, doctor, and nurse). Videos are recorded in different rooms, and have different quality with different levels of background noise and frequent interruptions. The content of the conversations involve similar topics. For all experiments reported in this paper, we use a k-fold cross validation with k = 5. That is, for each video, we keep 80％ of data for training and use the remaining 20％ for testing. We repeat this process 5 times by testing different subsets and report the average of the 5 numbers.
Data collections used to validate the proposed speaker identification approach
Data collections used to validate the proposed speaker identification approach
- 4.2 Preprocessing
First, the audio component is extracted from the video. All speech files are single-channel data sampled at 22.05kHz frequency. Then, since silence segments provide no information about the speakers and actually may reduce the correct speaker identification rate, each audio stream is processed to identify and remove silence segments. We use a trainable support vector machines (SVM) classifier
based on 3 low-level audio features (short-time energy, zero crossing rate, and spectral centroid) to discriminate between speech and nonspeech audio.
The remaining speech segments are decomposed into small frames using a 25ms analysis window with 10ms overlap. From each window, we extract MFCC
, and GFCC
features. For GFCC, instead of using tensor decomposition as proposed in
, we simply average all Gabor filtered spectrum features along the scales and phases to reduce the computational complexity.
For each extracted feature, we use the BIC algorithm
to identify changing points within the audio stream and partition it into homogeneous segments. Each segment will then be processed by our proposed algorithm to identify the speaker.
displays the number of speech segments identified by BIC for each video. The average length of all segments is short due to the frequent interruption during the conversation. We should note here that each video segment is processed independently since it involves different speakers. The reported results are the average over the 15 datasets.
- 4.3 Evaluation and Discussion
First, the same low-level features used to segment the audio stream (MFCC, PLP, LPCC, and GFCC) are also used for speaker identification. Next, bag of words features (C-BoW, F-BoW, and P-BoW) are constructed for each feature as described in Section 3. The initial number of prototypes is set to 100 per speaker, i.e.
= 100, resulting in a codebook with
= 400 words.
For each low-level feature, we evaluate the performance of the proposed mapping using 3 different classifiers: K-NN, Naive Bayes, and SVM
. K-NN has the advantage of incorporating various distance measures. Naive Bayes is a simple and efficient classifier that proved to be effective is classification problems that use the bag-of-word feature representation
. SVM is one of the most commonly used classifiers. For each classifier, we compare the performance of the 3 proposed feature mapping methods.
For the K-NN classifier, first we experiment with several measures, as discussed in
, to compute the dissimilarity between two histogram features (i.e. vectors mapped to histograms using bag of words representation). In particular, we use chi-square statistics (CS), histogram intersection (HI), Jensen-Shannon divergence (JS), Kolmogorov-Smirnov distance (KS), Kullback-Leibler divergence (KL), match distance (MD), diffusion distance (DD), and cosine distance (CD). The speaker recognition accuracies, averaged over the 15 datasets, using the MFCC features with a K-NN classifier (
=7), are displayed in
. As it can be seen, the cosine distance has the best performance for the crisp, fuzzy, and possibilistic bag of words representations. Similar results are obtained for the PLP, LPCC, and GFCC features. Thus, for the remaining experiments, the cosine distance will be used within the K-NN classifier to compare it to other classifiers.
Classification rate of the K-NN classifier using the proposed soft bag of words representation of MFCC features and various distance measures
Classification rate of the K-NN classifier using the proposed soft bag of words representation of MFCC features and various distance measures
, we compare the speaker identification accuracy of the proposed soft BoW feature mappings using MFCC features with the K-NN, NB, and SVM classifiers. First, we notice that the NB classifier outperforms the K-NN and SVM classifiers for the crisp, fuzzy, and possibilistic cases. Second, on average, the soft (fuzzy and possibilistic) feature mappings outperform the crisp mapping. Similar results were obtained for the PLP, LPCC, and GFCC features.
Performance of the crisp, fuzzy, and possibilistic BoW using MFCC features with the KNN, SVM, and NB classifiers
In a second experiment, we compare our methods to 3 existing speaker identification algorithms: GMM-UBM
, GMM mean supervector
with K-NN classifier (SV-KNN) and SVM classifier (SV-SVM), and PHF
with KNN classifier (PHF-KNN). For the GMM-UBM-based speaker identification, the UBM is estimated using all training features, while GMM adaptation is applied to the UBM to get each training speaker model. Then, log-likelihood scores are used for the classification. For both the GMM-UBM and GMM mean supervector methods, we experiment with several values for the number of Gaussian components and set this parameter to 10. The results are reported in
. As it can be seen, for all 4 features, the proposed soft feature mapping coupled with the NB classifier outperform the state of the art methods.
Comparison of the classification accuracy of the proposed soft BoW feature mappings using the NB classifier with GMM-UBM, GMM mean supervector with K-NN (SV-KNN) and SVM (SV-SVM), and PHF with KNN.
We proposed a soft feature mapping approach for speaker identification. Our approach uses bag-of-words model to extract robust histogram descriptors from low-level spectral features. We formulated three kinds of feature mapping methods using crisp, fuzzy, and possibilistic membership functions.
Using 15 datasets, we showed that the Naive Bayes is the best classifier to be used with our soft mapping. We also showed that the proposed approach outperforms commonly used methods.
The Proposed mappings provide more accurate speaker identification results. This allows the physicians to analyze the simulation sessions more easily and to identify and retrieve speech segments for a given speaker more accurately.
In our future work, we will focus on the fusion of multiple histograms that map different features (e.g. MFCC, PLP, LPCC, and GFCC) and applying ensemble learning approaches to further improve the accuracy of the speaker identification.
Conflict of Interest No potential conflict of interest relevant to this article was reported.
“Semantic indexing of video simulations for enhancing medical care during crises,”
11th International Conference on Machine Learning and Applications (ICMLA)
“Bic-based speaker segmentation using divide-and-conquer strategies with application to speaker diarization,”
IEEE Trans. on Audio, Speech, and Language Processing
DOI : 10.1109/TASL.2009.2024730
“Speaker recognition: a tutorial,”
Proceedings of the IEEE
DOI : 10.1109/5.628714
Stevens S. S.
Newman E. B.
“A scale for the measurement of the psychological magnitude pitch,”
The Journal of the Acoustical Society of America
DOI : 10.1121/1.1915888
“An overview of text-independent speaker recognition: from features to supervectors,”
DOI : 10.1016/j.specom.2009.08.009
Vinyals G. F. abd O.
“Speaker diarization: A review of recent research,”
IEEE Transactions on Audio, Speech, and Language Processing
DOI : 10.1109/TASL.2011.2125954
“Perceptual linear predictive (plp) analysis of speech,”
J. Acoust. Soc. Am.
DOI : 10.1121/1.399423
“Spoken language processing: a guide to theory, algorithm, and system development,”
“Speaker identification by combining mfcc and phase information in noisy environment,”
“Robust speaker identification using an auditory-based feature,”
“Robust feature extraction for speaker recognition based on constrained nonnegative tensor,”
J. Comput. Sci. Technol.
“An overview of automatic speaker diarization systems,”
IEEE Trans. Audio, Speech and Language Processing
DOI : 10.1109/TASL.2006.878256
“Speaker verification using adapted gaussian mixture models,”
Digital Signal Processing
DOI : 10.1006/dspr.1999.0361
Zhang B. X. S.
“Text-independent speaker identification using gmm-ubm and frame level likelihood normalization,”
International Symposium on Chinese Spoken Language Processing
“Feature mapping and fusion for music genre classification,”
“A comparison of event models for naive bayes text classification,”
AAAI-98 workshop on learning for text categorization
“Visual categorization with bags of keypoints,”
Proc. of ECCV International Workshop on Statistical Learning in Computer Vision
“Efficient visual search of videos cast as text retrieval,”
IEEE Trans. on pattern analysis and machine intelligence
DOI : 10.1109/TPAMI.2008.111
Peskin K. B. B.
“Text-constrained speaker recognition on a text-independent task,”
“Probabilistic speaker diarization with bag-of-words representations of speaker angle information,”
IEEE Trans. on audio, speech, and language processing
DOI : 10.1109/TASL.2011.2151858
Bezdek J. C.
Pattern Recognition with Fuzzy Objective Function Algorithms
Kluwer Academic Publishers
Norwell, MA, USA
“Membershipmap: Data transformation based on granulation and fuzzy membership aggregation,”
IEEE Trans. Fuzzy Systems
DOI : 10.1109/TFUZZ.2006.879981
John Wiley & Sons
“Finding groups in data: An introduction to cluster analysis,”
Hampel F. R.
Ronchetti E. M.
Rousseeuw P. J.
Stahel W. A.
“Robust statistics the approach based on influence functions,”
An Introduction to Support Vector Machines and Other Kernel-based Learning Methods
Cambridge University Press
Chen S. S.
Gopalakrishnan P. S.
“Speaker, environment and channel change detection and clustering via the bayesian information criterion,”
Proc. DARPA Broadcast News Transcription Understanding Workshop
Bishop C. M.
“Pattern recognition and machine learning,”
“The earth mover’s distance as a metric for image retrieval,”
International Journal of Computer Vision
DOI : 10.1023/A:1026543900054