Advanced
Text-independent Speaker Identification Using Soft Bag-of-Words Feature Representation
Text-independent Speaker Identification Using Soft Bag-of-Words Feature Representation
International Journal of Fuzzy Logic and Intelligent Systems. 2014. Dec, 14(4): 240-248
Copyright © 2014, Korean Institute of Intelligent Systems
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • Received : November 02, 2014
  • Accepted : December 11, 2014
  • Published : December 25, 2014
Download
PDF
e-PUB
PubReader
PPT
Export by style
Share
Article
Author
Metrics
Cited by
TagCloud
About the Authors
Shuangshuang Jiang
Multimedia Research Lab, CECS Dept., University of Louisville, Louisville, KY 40292, USA
Hichem Frigui
Multimedia Research Lab, CECS Dept., University of Louisville, Louisville, KY 40292, USA
Aaron W. Calhoun
Pediatrics Dept., University of Louisville, Louisville, KY 40202, USA

Abstract
We present a robust speaker identification algorithm that uses novel features based on soft bag-of-word representation and a simple Naive Bayes classifier. The bag-of-words (BoW) based histogram feature descriptor is typically constructed by summarizing and identifying representative prototypes from low-level spectral features extracted from training data. In this paper, we define a generalization of the standard BoW. In particular, we define three types of BoW that are based on crisp voting, fuzzy memberships, and possibilistic memberships. We analyze our mapping with three common classifiers: Naive Bayes classifier (NB); K-nearest neighbor classifier (KNN); and support vector machines (SVM). The proposed algorithms are evaluated using large datasets that simulate medical crises. We show that the proposed soft bag-of-words feature representation approach achieves a significant improvement when compared to the state-of-art methods.
Keywords
1. Introduction
The Simulation for Pediatric Assessment, Resuscitation, and Communication (SPARC) group within the Department of Pediatric Critical Care Medicine at the University of Louisville makes extensive use of simulation in training teams of nurses, medical students, residents, and attending physicians. These simulation sessions involve trained actors simulating family members in various crisis scenarios. Sessions involve 4 to as many as 9 people and last approximately 20 minutes to one hour. They are scheduled approximately twice per week and are recorded as video data. After each session, the physician/instructor must manually review and annotate the recording and then debrief the trainees on the session. The goal is to enhance the care of children and strengthen interdisciplinary and clinician-patient interactions [1] .
The physician responsible for the simulation has recorded 100’s of sessions, and has realized that the manual process of review and annotation is labor intensive and that retrieval of specific video segments (based on speaker or what was said) is not trivial. Using machine learning methods, we have developed a speaker segmentation and identification system that can provide the physician with automated and efficient methods to semantically index and retrieve specific segments from the large collections of simulation sessions.
The architecture of this system is illustrated in Figure 1. It has two main components. The first one is for offline training, and the second one is for online testing. In the offline training, first audio streams are extracted from the training videos. Then, the divide-and-conquer ( DAC 3) based speaker segmentation method [2] is used to partition the speech sequence into homogeneous segments. Finally, a classifier is trained to discriminate between segments that correspond to different speakers. In the online testing, the input consists of an unlabeled video recording. First, the audio component is extracted and segmented. Then, each segment is labeled by the classifier. As a result, our system will identify “who spoke and when”. In this paper, we focus on developing an efficient and accurate algorithm for speaker identification.
PPT Slide
Lager Image
Overview of the proposed speaker segmentation and identification system
The rest of this paper is organized as follows. Section 2 reviews related work. Section 3 presents the proposed soft bag-of-words feature representation method. Section 4 discusses our experiments. Finally, concluding remarks are given in Section 5.
2. Related Works
- 2.1 Speaker Recognition
Speaker identification, classifying speech utterances into different speaker classes, and speaker verification, verifying a person’s claimed identity from his/her voice, are generally referred to as speaker recognition [3] . Several features and classification methods have been proposed for this task. For instance, Mel frequency cepstral coefficients (MFCC) [4] , which take into account how humans perceive the difference between sounds of different frequencies, is one of the most commonly used features [5 , 6] . Perceptual linear predictions (PLP) [7] and linear prediction cepstral coefficients (LPCC) [8] are two other common features that rely on psychophysically based spectral transformations and linear prediction. These features may not always achieve good performance, especially in noisy environment, and many other features have been proposed. For instance, in [9] , Wang proposed combing MFCC features and phase information for speaker identification. In [10] , Li proposed an auditory based feature extraction algorithm by using a set of modules cochlear filter banks. In [11] , Gabor filtering was applied to speech spectrum features, and nonnegative tensor factorization method was used to extract more robust features. Other feature representations can be found in [3 , 5 , 12] .
Most of the above methods extract features from small overlapping windows. Thus, each speech segment can be represented by a large number of features. Moreover, since speech segments can have different durations, they will be represented by different number of features. To overcome this limitation, the above features are usually summarized by a small fixed number of representatives. For instance, Gaussian mixture model (GMM), with universal background model (UBM) for speaker adaptation [13] , has been widely used for speaker recognition [5 , 14] . In [14] , GMM adaptation was applied to UBM to learn each speaker. Then, log-likelihood scores with nonlinear normalization were used for speaker discrimination. In [5] , the GMM mean supervector, that represents the variable size segments with a fixed dimension by concatenating all adapted Gaussian mean vectors, was combined with a support vector machines (SVM) classifier. This approach was proven to be one of the most effective methods for speaker recognition.
Another alternative approach, called possibilistic histogram features (PHF) was proposed in [1 , 15] . The PHF is inspired by the “bag of words” concept used in information retrieval. It identifies a fixed set of representative prototypes, and each audio segment is mapped to the closest prototype. The relative frequency of occurrence of each prototype is used as the feature vector of each audio segment. The PHF has been used with a KNN classifier and its performance was constrained. On one hand, a reduced vocabulary cannot represent all variations within the features. On the other hand, a larger vocabulary can improve the feature representation, but can also degrade the KNN classifier.
- 2.2 Feature Representation with Bag-of-Words (BoW)
The bag-of-words model has been widely used in various applications, such as document classification, computer vision, speech and speaker recognition, etc. In document classification, the feature is constructed based on the frequency of occurrence of each word [16] . Generally, there are two different models to represent the document. One model uses a vector of binary attributes to indicate whether a word occurs or does not occur in the document. This representation can be modeled as a multivariate Bernoulli distribution. Another model takes the number of word occurrences into account, and represents the document by a sparse histogram of words frequencies. This representation can be modeled as a multinomial model. For both models, the Naive Bayes classifier is commonly used for classification.
In computer vision, a bag of visual words is a vector of frequency counts of a vocabulary of local image features. It has been used mainly in image/video scenes classification and retrieval [17 , 18] . In [17] , a “bag of key points” method was proposed based on vector quantization of affine invariant descriptors of image patches. Two different classifiers, Naive Bayes and SVM, were applied for semantic visual categories classification. Similarly, in [18] , a set of viewpoint invariant region descriptors were extracted to search and localize all the occurrences of a given query object in a video. In this approach, a visual vocabulary was built through vector quantizing the descriptors into clusters. Using the standard indexing method used in text retrieval, the term frequency-inverse document frequency (TF-IDF) was computed and the cosine similarity was used for retrieval.
The BoW has also been used for the analysis of speech data. In [19] , the high-frequency keywords (e.g. you know , um , right , etc.) were selected by computing the frequent, reflexive words and word pairs, and modeling them via word-based HMM models. Integrating this advantage of text-dependent modeling into the traditional GMM-based text-independent speaker recognition was shown to improve the performance. In [20] , a bag-of-words (BoW)-style feature representation, which quantizes the observed direction of arrival (DOA) powers into discrete “word” samples, was developed to solve the speaker-clustering problem. In this approach, a time-varying probabilistic model was combined with the DOA information calculated from a microphone array to estimate the number and locations of the speakers.
- 2.3 BoW Feature Representation with Naive Bayes Classifier
Assume that we have a set of labeled speech segments X = { Xi }, C classes [ S 1 , …, Sj , …, SC ], and representative vocabularies (i.e. codebook or cluster centers) V = { vt }. Let ft ( Xi ) denotes the relative frequency of the occurrence of word vt in segment Xi . To classify a new test sample, Xs , Bayes’ rule is applied and the maximum a posteriori score is used for prediction:
PPT Slide
Lager Image
In (1), P ( Sj ) is the a priori probability of class Sj , and the class-conditional probability P ( vt | Sj ) denotes the probability of word vt occurring in class Sj and can be estimated using:
PPT Slide
Lager Image
In order to avoid the zero probability estimation in (2), the Laplace smoothing is frequently used, and (2) can be replaced with:
PPT Slide
Lager Image
3. Soft BoW Audio Feature Representation
In this paper, we propose a generalization of the BoW feature representation. In addition to the standard binary voting, where each sample contributes to each keyword with a binary value (1 if the keyword is the closest one to the sample and 0 otherwise), we propose a generalization that uses soft voting. We discuss the advantages and disadvantages of each voting scheme. We also show that the soft BoW representations with a Naive Bayes classifier outperform existing methods for speaker identification.
- 3.1 Visual Vocabulary Construction
Assume that each speaker i has a training set of Ni low-level features, that is,
PPT Slide
Lager Image
where
PPT Slide
Lager Image
is a D dimensional feature vector extracted from the jth segment of the ith speaker.
The first step consists of summarizing each X i by a set of representative prototypes
PPT Slide
Lager Image
. This quantization step is achieved by partitioning X i into Ki clusters and letting
PPT Slide
Lager Image
be the centroid of the kth partition. Any clustering algorithm can be used for this task. In this paper, we report the results using the Fuzzy C-means (FCM) [21] algorithm. The FCM partitions the Ni samples into Ki clusters by minimizing the sum of within-cluster distances, i.e.,
PPT Slide
Lager Image
In (4), d refers to the Euclidean distance between
PPT Slide
Lager Image
and
PPT Slide
Lager Image
, and U = [ μtj ] represents the membership of feature vector
PPT Slide
Lager Image
in cluster t [22] and satisfies the constraints:
PPT Slide
Lager Image
Each prototype, pk , is a representative of cluster ck that summarizes a group of similar speech segments. Let σk be the variance of all features xj assigned to cluster ck . After clustering, the Ki prototypes obtained by partitioning the data of speaker i , Xi , are all combined to form a dictionary or a codebook with
PPT Slide
Lager Image
words, where Nsp is the number of speakers.
Instead of using the original feature space X , the bag-of-words based histogram feature descriptor (BoW-HFD) approach maps it to a new space H characterized by the K clusters that capture the characteristics of the training data. Formally, this mapping is defined as
PPT Slide
Lager Image
In (6), fi ( xj ) ∈ [0, 1] is a measure of belongingness of feature xj to cluster i represented by prototype pi . This measure could be crisp , fuzzy , or possibilistic [22] . These different mappings are described in the following subsections.
- 3.2 Crisp Mapping
In crisp mapping, each feature vector xj is assigned a binary membership value to each “word” i based on the distance between them. This mapping considers only the closest word (i.e. prototype) to word i and is defined as:
PPT Slide
Lager Image
This mapping is used in the standard BoW approach [17] and considers only the closest word. Thus, it is reasonable if xj is close to one word and far from the other words. However, if xj is close to multiple words (i.e., xj is located close to the clusters’ boundaries), then, crisp mapping will not preserve this information.
- 3.3 Fuzzy Mapping
Instead of using binary voting (as in eq. (7)), fuzzy mapping uses soft labels to allow for partial or gradual membership values. This type of labeling offers a richer representation of belongingness and can handle uncertain cases. In particular, a sample xj votes to each word i in the codebook with a membership degree
PPT Slide
Lager Image
such that:
PPT Slide
Lager Image
Many clustering algorithms use this type of labels to obtain a fuzzy partition. In the proposed fuzzy BoW (F-BoW) approach, we use the memberships derived within the Fuzzy C-Means (FCM) [21] algorithm, i.e.,
PPT Slide
Lager Image
where m ∈ (1, ∞) is a constant that controls the degree of fuzziness. In (9), Djt is the distance between feature vector xj and the “word” summarizing cluster t . To take into account the shape of the clusters, we use
PPT Slide
Lager Image
where
PPT Slide
Lager Image
is the variance of feature k of cluster t and M is the dimensionality of the feature space.
- 3.4 Possibilistic Mapping
The fuzzy membership in (9) is a relative number that depends on the distance of xj to all prototypes. Thus, it does not distinguish between samples that are equally close to multiple prototypes and samples that are equally far from all prototypes.
An alternative approach to generate soft labels is based on possibility theory [22] . Possibilistic labeling relaxes the constraint in (8) that the memberships across all words must sum to one. It assigns “typicality” values,
PPT Slide
Lager Image
, that do not consider the relative position of the point to all clusters. As a result, if xj is a noise point, then
PPT Slide
Lager Image
, and if xj is typical of more than one cluster, we can have
PPT Slide
Lager Image
. Many robust partitional clustering algorithms [23 , 24] use this type of labeling in each iteration. In this paper, we use the membership function derived within the Possibilistic C-Means [22] , i.e.,
PPT Slide
Lager Image
In (11), ƞj is a cluster-dependent resolution/scale parameter [22] and m ∈ (1, ∞).
Robust statistical estimators, such as M-estimators and W-estimators [25] , use this type of memberships to reduce the effect of noise and outliers.
4. Experimental Results and Discussion
- 4.1 Data Collection
Multiple data sets are used to validate and compare our proposed soft BoW-based audio feature representation with Naive Bayes classifier for speaker identification. In particular, we use 15 medical simulations videos. We only use the audio information for speaker identification as it contains most conversation information. This is because the video resolution is low and has no additional information (people are sitting with little movement and just talking). As shown in Table 1 , each simulation has four speakers (patient, patient’s friend, doctor, and nurse). Videos are recorded in different rooms, and have different quality with different levels of background noise and frequent interruptions. The content of the conversations involve similar topics. For all experiments reported in this paper, we use a k-fold cross validation with k = 5. That is, for each video, we keep 80% of data for training and use the remaining 20% for testing. We repeat this process 5 times by testing different subsets and report the average of the 5 numbers.
Data collections used to validate the proposed speaker identification approach
PPT Slide
Lager Image
Data collections used to validate the proposed speaker identification approach
- 4.2 Preprocessing
First, the audio component is extracted from the video. All speech files are single-channel data sampled at 22.05kHz frequency. Then, since silence segments provide no information about the speakers and actually may reduce the correct speaker identification rate, each audio stream is processed to identify and remove silence segments. We use a trainable support vector machines (SVM) classifier [26] based on 3 low-level audio features (short-time energy, zero crossing rate, and spectral centroid) to discriminate between speech and nonspeech audio.
The remaining speech segments are decomposed into small frames using a 25ms analysis window with 10ms overlap. From each window, we extract MFCC [4] , PLP [7] , LPCC [8] , and GFCC [11] features. For GFCC, instead of using tensor decomposition as proposed in [11] , we simply average all Gabor filtered spectrum features along the scales and phases to reduce the computational complexity.
For each extracted feature, we use the BIC algorithm [27] to identify changing points within the audio stream and partition it into homogeneous segments. Each segment will then be processed by our proposed algorithm to identify the speaker. Table 1 displays the number of speech segments identified by BIC for each video. The average length of all segments is short due to the frequent interruption during the conversation. We should note here that each video segment is processed independently since it involves different speakers. The reported results are the average over the 15 datasets.
- 4.3 Evaluation and Discussion
First, the same low-level features used to segment the audio stream (MFCC, PLP, LPCC, and GFCC) are also used for speaker identification. Next, bag of words features (C-BoW, F-BoW, and P-BoW) are constructed for each feature as described in Section 3. The initial number of prototypes is set to 100 per speaker, i.e. Ki = 100, resulting in a codebook with K = 400 words.
For each low-level feature, we evaluate the performance of the proposed mapping using 3 different classifiers: K-NN, Naive Bayes, and SVM [28] . K-NN has the advantage of incorporating various distance measures. Naive Bayes is a simple and efficient classifier that proved to be effective is classification problems that use the bag-of-word feature representation [16 , 17] . SVM is one of the most commonly used classifiers. For each classifier, we compare the performance of the 3 proposed feature mapping methods.
For the K-NN classifier, first we experiment with several measures, as discussed in [29] , to compute the dissimilarity between two histogram features (i.e. vectors mapped to histograms using bag of words representation). In particular, we use chi-square statistics (CS), histogram intersection (HI), Jensen-Shannon divergence (JS), Kolmogorov-Smirnov distance (KS), Kullback-Leibler divergence (KL), match distance (MD), diffusion distance (DD), and cosine distance (CD). The speaker recognition accuracies, averaged over the 15 datasets, using the MFCC features with a K-NN classifier ( K =7), are displayed in Table 2 . As it can be seen, the cosine distance has the best performance for the crisp, fuzzy, and possibilistic bag of words representations. Similar results are obtained for the PLP, LPCC, and GFCC features. Thus, for the remaining experiments, the cosine distance will be used within the K-NN classifier to compare it to other classifiers.
Classification rate of the K-NN classifier using the proposed soft bag of words representation of MFCC features and various distance measures
PPT Slide
Lager Image
Classification rate of the K-NN classifier using the proposed soft bag of words representation of MFCC features and various distance measures
In Figure 2 , we compare the speaker identification accuracy of the proposed soft BoW feature mappings using MFCC features with the K-NN, NB, and SVM classifiers. First, we notice that the NB classifier outperforms the K-NN and SVM classifiers for the crisp, fuzzy, and possibilistic cases. Second, on average, the soft (fuzzy and possibilistic) feature mappings outperform the crisp mapping. Similar results were obtained for the PLP, LPCC, and GFCC features.
PPT Slide
Lager Image
Performance of the crisp, fuzzy, and possibilistic BoW using MFCC features with the KNN, SVM, and NB classifiers
In a second experiment, we compare our methods to 3 existing speaker identification algorithms: GMM-UBM [14] , GMM mean supervector [5] with K-NN classifier (SV-KNN) and SVM classifier (SV-SVM), and PHF [1] with KNN classifier (PHF-KNN). For the GMM-UBM-based speaker identification, the UBM is estimated using all training features, while GMM adaptation is applied to the UBM to get each training speaker model. Then, log-likelihood scores are used for the classification. For both the GMM-UBM and GMM mean supervector methods, we experiment with several values for the number of Gaussian components and set this parameter to 10. The results are reported in Figure 3 . As it can be seen, for all 4 features, the proposed soft feature mapping coupled with the NB classifier outperform the state of the art methods.
PPT Slide
Lager Image
Comparison of the classification accuracy of the proposed soft BoW feature mappings using the NB classifier with GMM-UBM, GMM mean supervector with K-NN (SV-KNN) and SVM (SV-SVM), and PHF with KNN.
5. Conclusions
We proposed a soft feature mapping approach for speaker identification. Our approach uses bag-of-words model to extract robust histogram descriptors from low-level spectral features. We formulated three kinds of feature mapping methods using crisp, fuzzy, and possibilistic membership functions.
Using 15 datasets, we showed that the Naive Bayes is the best classifier to be used with our soft mapping. We also showed that the proposed approach outperforms commonly used methods.
The Proposed mappings provide more accurate speaker identification results. This allows the physicians to analyze the simulation sessions more easily and to identify and retrieve speech segments for a given speaker more accurately.
In our future work, we will focus on the fusion of multiple histograms that map different features (e.g. MFCC, PLP, LPCC, and GFCC) and applying ensemble learning approaches to further improve the accuracy of the speaker identification.
Conflict of Interest No potential conflict of interest relevant to this article was reported.
References
Jiang S. , Frigui H. , Calhoun A. 2012 “Semantic indexing of video simulations for enhancing medical care during crises,” 11th International Conference on Machine Learning and Applications (ICMLA) 520 - 525
Cheng S. , Wang H. , Fu H. 2010 “Bic-based speaker segmentation using divide-and-conquer strategies with application to speaker diarization,” IEEE Trans. on Audio, Speech, and Language Processing 18 (1) 141 - 157    DOI : 10.1109/TASL.2009.2024730
Campbell J. 1997 “Speaker recognition: a tutorial,” Proceedings of the IEEE 85 (9) 1437 - 1462    DOI : 10.1109/5.628714
Stevens S. S. , Volkmann J. , Newman E. B. 1937 “A scale for the measurement of the psychological magnitude pitch,” The Journal of the Acoustical Society of America 8 (3) 155 - 210    DOI : 10.1121/1.1915888
Kinnunen T. , Li H. 2010 “An overview of text-independent speaker recognition: from features to supervectors,” Speech Communication 52 12 - 40    DOI : 10.1016/j.specom.2009.08.009
Miro X. , Bozonnet S. , Evans N. , Fredouille C. , Vinyals G. F. abd O. 2012 “Speaker diarization: A review of recent research,” IEEE Transactions on Audio, Speech, and Language Processing 20 (2) 356 - 370    DOI : 10.1109/TASL.2011.2125954
Hermansky H. 1990 “Perceptual linear predictive (plp) analysis of speech,” J. Acoust. Soc. Am. 87 (4) 1738 - 1752    DOI : 10.1121/1.399423
Huang X. , Acero A. , Hon H. 2001 “Spoken language processing: a guide to theory, algorithm, and system development,” Prentice-Hall New Jersey
Wang L. , Minami K. , Yamamoto K. , Nakagawa S. 2010 “Speaker identification by combining mfcc and phase information in noisy environment,” ICASSP
Li Q. , Huang Y. 2010 “Robust speaker identification using an auditory-based feature,” ICASSP
Wu Q. , Zhang L. , Shi G. 2010 “Robust feature extraction for speaker recognition based on constrained nonnegative tensor,” J. Comput. Sci. Technol. 25 (4) 745 - 754
Tranter S. , Reynolds D. 2006 “An overview of automatic speaker diarization systems,” IEEE Trans. Audio, Speech and Language Processing 14 1557 - 1565    DOI : 10.1109/TASL.2006.878256
Reynolds D. , Quatieri T. , Dunn R. 2000 “Speaker verification using adapted gaussian mixture models,” Digital Signal Processing 10 19 - 41    DOI : 10.1006/dspr.1999.0361
Zheng R. , Zhang B. X. S. 2004 “Text-independent speaker identification using gmm-ubm and frame level likelihood normalization,” International Symposium on Chinese Spoken Language Processing 289 - 292
Balti H. , Frigui H. 2012 “Feature mapping and fusion for music genre classification,” ICMLA 306 - 310
McCallum A. , Nigam K. 1998 “A comparison of event models for naive bayes text classification,” AAAI-98 workshop on learning for text categorization 41 - 48
Csurka G. , Dance C. , Fan L. , Willamowski J. , Bray C. 2004 “Visual categorization with bags of keypoints,” Proc. of ECCV International Workshop on Statistical Learning in Computer Vision
Sivic J. 2009 “Efficient visual search of videos cast as text retrieval,” IEEE Trans. on pattern analysis and machine intelligence 31 (4) 591 - 605    DOI : 10.1109/TPAMI.2008.111
Peskin K. B. B. 2004 “Text-constrained speaker recognition on a text-independent task,” ODYS-2004 129 - 134
Ishiguro K. , Yamada T. , Araki S. , Nakatani T. , Sawada H. 2012 “Probabilistic speaker diarization with bag-of-words representations of speaker angle information,” IEEE Trans. on audio, speech, and language processing 20 (2) 447 - 460    DOI : 10.1109/TASL.2011.2151858
Bezdek J. C. 1981 Pattern Recognition with Fuzzy Objective Function Algorithms Kluwer Academic Publishers Norwell, MA, USA
Frigui H. 2006 “Membershipmap: Data transformation based on granulation and fuzzy membership aggregation,” IEEE Trans. Fuzzy Systems 14 885 - 896    DOI : 10.1109/TFUZZ.2006.879981
Duda R. , Hart P. , Stork D. 2000 “Pattern classification,” 2nd edition John Wiley & Sons New York
Kaufman L. , Rousseeuw P. 1990 “Finding groups in data: An introduction to cluster analysis,” Wiley New York
Hampel F. R. , Ronchetti E. M. , Rousseeuw P. J. , Stahel W. A. 1986 “Robust statistics the approach based on influence functions,” Wiley New York
Cristianini N. , Shawe-Taylor J. 2000 An Introduction to Support Vector Machines and Other Kernel-based Learning Methods first ed. Cambridge University Press
Chen S. S. , Gopalakrishnan P. S. 1998 “Speaker, environment and channel change detection and clustering via the bayesian information criterion,” Proc. DARPA Broadcast News Transcription Understanding Workshop (Landsdowne, VA)
Bishop C. M. 2006 “Pattern recognition and machine learning,” Springer
Rubner Y. , Tomasi C. , Guibas L. 2000 “The earth mover’s distance as a metric for image retrieval,” International Journal of Computer Vision 40 (2) 99 - 121    DOI : 10.1023/A:1026543900054