Online Selective-Sample Learning of Hidden Markov Models for Sequence Classification
Online Selective-Sample Learning of Hidden Markov Models for Sequence Classification
International Journal of Fuzzy Logic and Intelligent Systems. 2015. Sep, 15(3): 145-152
Copyright © 2015, Korean Institute of Intelligent Systems
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • Received : August 14, 2015
  • Accepted : September 24, 2015
  • Published : September 25, 2015
Export by style
Cited by
About the Authors
Minyoung Kim

We consider an online selective-sample learning problem for sequence classification, where the goal is to learn a predictive model using a stream of data samples whose class labels can be selectively queried by the algorithm. Given that there is a limit to the total number of queries permitted, the key issue is choosing the most informative and salient samples for their class labels to be queried. Recently, several aggressive selective-sample algorithms have been proposed under a linear model for static (non-sequential) binary classification. We extend the idea to hidden Markov models for multi-class sequence classification by introducing reasonable measures for the novelty and prediction confidence of the incoming sample with respect to the current model, on which the query decision is based. For several sequence classification datasets/tasks in online learning setups, we demonstrate the effectiveness of the proposed approach.
1. Introduction
Sequences or time-series data are parts of our everyday life in diverse forms such as videos of image frames, speech signals, financial asset prices and meteorological records, to name a few. Sequence classification involves automatically assigning class labels to these instances in a meaningful way. Owing to the increasing demand for efficient indexing, search, and organization of a huge amount of sequence data, it has recently received significant attention in machine learning, data mining, and related fields. Unlike conventional classification of non-sequential vectorial data, sequence classification entails inherent difficulty originating from potentially variable-length and non-stationary structures.
Recent sequence classification approaches broadly adopt one of the two different frameworks. In the first framework, one estimates the similarity (or distance) measure between pairs of sequences, from which any kernel machines (e.g., support vector machines) or exemplarbased classifiers (e.g., nearest neighbor) can be employed. As the sequence lengths and sampling rates can differ from instance to instance, one often resorts to warping or alignment of sequences (e.g., dynamic time warping), alternatively, incorporating a specially-tailored similarity measure (e.g., spectral or string kernels [1] ).
The second framework is based on probabilistic sequence models, essentially aiming to represent an underlying generative model for sequence data. The hidden Markov model (HMM) is considered to be one of the most popular sequence models, and has previously shown broad success in various application areas including automatic speech recognition [2 , 3] , computer vision [4 6] , and bioinformatics [7] . y-based one in several aspects: certain statistical constraints such as Markov dependency structures can be easily imposed. In addition, the model-based approaches are typically computationally less demanding for training, specifically linear time in the number of training samples, while similarity-based methods require quadratic. Thus, throughout the study we focus on HMM-based sequence classification: the details of the model are thoroughly described in Section 2.
Unlike the traditional learning setup where all the labeled training samples are stored and available to a learning algorithm (i.e., batch learning ), the learning setup we deal with in this study is quite different. We consider the online selective-sample learning : it is basically a streaming data scenario where we receive an input sequence (without its class label) one at a time, and the technique is assumed to have no capability of storing the sample for future use. The learning algorithm has an option of either querying the class label or not, and this decision has to be made on the fly based solely on the current data sample (i.e., unable to look into previous samples). The model can then be updated with the sequence sample and the class label (if queried). Of course, there is a limit to the total number of queries that can be made, and therefore the learner’s main objective is to select samples for queries that are the most salient and important for learning an accurate model.
Considering the cost of obtaining class labels, which typically requires human experts endeavor, the online learning algorithms can be more favorable. Moreover, they are better suited for various applications, most notably the mobile computing environments, in which there are massive amount of data collected and observed whereas the computing platforms (e.g., mobile devices) have minimal storage and limited computing power. It is worth noting that the online selective-sample learning setup is different from the well-known active learning [8 , 9] in machine learning. In the active learning, the algorithm has full access to entire data samples (thus requiring data storage capabilities), and the goal is to output a subset for which the class labels are queried. In this sense, the online selective-sample learning can be more realistic and challenging than the active learning.
Recently there have been attempts to tackle the online selective-sample learning problem. In particular, the main idea of the latest aggressive algorithms (e.g., [10 , 11] ) is to decide to query labels for samples that appear to be novel (compared to the previous samples) with respect to the current model. Moreover, it is desirable to ask for labels for those data that have less confident class prediction under the current model, which is also intuitively appealing. However, most approaches are limited to vectorial (non-sequential) data and binary classification setups with the underlying classification model confined to the simple linear model.
We extend the idea to the multi-class sequence data classification problem within the HMM classification model. Specifically, we deal with the negative data log-likelihood of the incoming sequence as a measure of data novelty. Furthermore, the entropy of the class conditional distribution given the incoming sequence is considered as a measure of strength/confidence in class prediction. The proposed approach is not only intuitively appealing, but also shown to yield superior performance to the baseline approaches that make queries randomly or greedily. These results have been verified for several sequence classification datasets/problems.
The paper is organized as follows: after introducing a few notations and a formal problem setup, we describe the HMM-based sequence classification model that we deal with throughout the study in Section 2. Our main approach is described in Section 3, and the empirical evaluations are demonstrated in Section 4.
- 1.1 Notations and Problem Setup
We consider a K -way sequence classification problem, where we let c ∈ {1, ..., K } be the class variable and x = x 1 . . . x T be the sequence observation (the length T can vary from instance to instance). We assume each time slice x t ∈ ℝ d is a realvalued d -dimensional vector.
In the online selective-sample learning setup, the learner receives a sequence one at a time, and decides whether to query its class label or not. More formally, at the i -th stage, the algorithm receives a new sequence
PPT Slide
Lager Image
(length Ti ), and outputs the class prediction
PPT Slide
Lager Image
for xi from the current model. It then decides whether to query the true class label ci or not, and the learning algorithm may update the classification model using the observed data sample (either xi or ( ci, xi ), latter only when the query is made). In general, there are two (complementary) goals for the learner: to yield an accurate class prediction model and to make as few queries as possible. One possible way of enforcing the goals, which we adopt in this paper, is to introduce the budget B , the upper bound of the number of queries to be made, and devise an algorithm yielding the smallest classification error with the budget constraint.
2. HMM-based Sequence Classification Model
We consider a probabilistic sequence classification model P ( c , x) = P ( c P (x| c ), where P ( c ) is the class prior (modeled by the multinomial distribution over {1, . . . , K }), and P (x| c ) is the c -th HMM model (with c = 1, . . . , K ) that is responsible for generation of a sequence x under the class c . In this paper we use the Gaussian-emission HMM models, for which we provide formal descriptions below.
The HMM is composed of two generative components: (i) a (hidden) state sequence s = s 1 . . . sT is generated where each hidden state st takes a discrete value from {1, ..., S }) conforming to the 1st-order Markov dependency, (ii) at each time slice t , the observation x t is generated whose distribution (Gaussian) is determined by st . More formally, we have the following local conditional models and associated parameters:
PPT Slide
Lager Image
The parameters of the HMM are denoted by
PPT Slide
Lager Image
In typical cases, the state sequence s is not observed, and one has to deal with the observation likelihood which is marginalization of the full joint likelihood over all possible hidden sequences, namely P (x) = Σ s P (s, x). This marginalization can be done very efficiently (time linear in sequence length T ) using the dynamic programming [12] . Given an HMM model and the observation sequence x, the task of computing the hidden state posteriors (i.e., P (s|x)) is very important. Often referred to as probabilistic inference , it can be done efficiently using the forward/backward recursions [12] . Specifically, we denote two key quantities by: γt ( j ) := P ( st = j |x) and ξt ( j , l ) := P ( s t−1 = j , st = l |x).
While the parameter learning of the HMM can typically done by the EM algorithm [13] , one can directly optimize the log-likelihood log P (x) by standard gradient ascent. This is indeed exploited in our online selective-sample learning algorithm since we need to update a model with a single sample within the stochastic gradient optimization framework [14] . The gradient of the observation log-likelihood can be derived as follows:
PPT Slide
Lager Image
Returning to the classification model, we deal with K HMMs and treat each one as a class conditional density P (x| c ) for each class c . The class prior P ( c ) is modeled by a multinomial with a K -dimensional paraemter vector p where pk = P ( c = k ). Overall the model has parameters
PPT Slide
Lager Image
, and
PPT Slide
Lager Image
is the parameters of the k -th HMM. For simplicity, the hidden state cardinalities (i.e., S ) are assumed to be the same across all the component HMMs. We refer to this HMM-based classification model as SC-HMM. In the SC-HMM model, class prediction for an unseen test sample x = x 1 . . . x T is done by the maximum-a-posteriori (MAP) decision: c = arg max c P ( c |x) = arg max c P ( c , x).
3. Online Selective-Sample Learning for SC-HMM
Our online selective-sample learning algorithm for the SC-HMM model is motivated from the aggressive algorithm of [11] , and we briefly review their approach here. However, it should be noted that their approach is based on the simple linear model for binary classification of vectorial (non-sequential) data, hence should be differentiated from our extension.
In [11] a linear classification model y = sgn(w z) is considered where z ∈ ℝ d is the input observation vector, w is the parameter vector of the same dimension, and sgn(·) returns the sign (±1) of its argument. The so-called score w z is real valued, where its magnitude indicates the confidence in prediction (i.e., a larger score implies that z lies farther away from the decision boundary, and vice versa). We denote by zi the i -th sample received by the algorithm, and by δi a binary variable indicating whether i -th sample is queried (1) or not (0).
The model w is basically estimated by L2-regularized linear regression, which gives rise to the estimate w = A −1 b at stage i , where
PPT Slide
Lager Image
Note that (3) can be computed/updated in an online fashion. Then the model’s score is computed as:
PPT Slide
Lager Image
. Whether to query yi or not is then based on this confidence, namely we query if p is smaller than some threshold, and this decision strategy is employed in [10 , 11] . In [11] , it is further considered the novelty of the sample z i , which is measured by
PPT Slide
Lager Image
. This is based on the 0-mean Gaussian assumption for z, and we query yi if r is larger than certain threshold.
In their algorithm, due to the simple linear regression model and Gaussianity assumption, the query decision rule is easily derived and the update equation becomes simple. For sequence classification, the underlying model is a rather complex SC-HMM, and extension of the idea requires further endeavor. At each stage i , upon receiving a new sequence xi , our online selective-sample learning algorithm performs two main steps: 1) decide whether to query the true class label ci or not, and 2) update the model with the current sample, either ( ci , x i ) if queried or x i if not. We provide detailed derivations for each step in the subsequent sections.
- 3.1 Decision for Query
First, decision to query is made if the incoming sample meets either of two conditions with respect to the current model Θ. (Cond-1) − log P (x i ; Θ) ≥ τNLL indicates the current data sample attains high negative log-likelihood, or equivalently, the sample appears to be novel for the current model. The threshold τNLL is properly chosen. (Cond-2) − H ( p ( c |x i ; Θ)) ≤ τNCE implies that the entropy of the posterior class distribution for the current sample is high, meaning that the con- fidence in prediction is not very strong. Here
PPT Slide
Lager Image
is the entropy of the class posterior. Again we choose the threshold parameter τNCE adequately.
- 3.2 Model Update
The second step is to update the model Θ using the current data sample. The true class label ci is either available or not depending on the decision in the above step, and we let the binary variable δi record it (i.e., δi = 1(0) if queried (not)). To present our model update algorithm, we begin with the batch data learning formulation assuming all labeled and unlabeled data can be accessed. We then derive the stochastic gradient update rule from the batch formulation.
In the batch learning case with n data samples, the learning problem can be formulated with partially labeled data (sine we have selectively labeled samples) as follows:
PPT Slide
Lager Image
Here note that we can exploit even the unlabeled samples (the second term) by maximizing the marginal log-likelihood where the two objectives are balanced by the hyperparameter λ ≥ 0.
In the stochastic gradient ascent [14] , instead of computing the full gradient of (4), it only updates the model using the gradient for a single sample, say i , that is,
PPT Slide
Lager Image
which can be seen as an unbiased sample for the total gradient. The constant η > 0 is a learning rate which we fix throughout the iterations (e.g., η = 0.001).
The nice thing is that (5), although derived from a batch setup, can be computed in an online fashion, and it exactly forms our model update equations. To be more specific to the SC-HMM, the gradients can be computed by (for each k = 1, . . . , K ):
PPT Slide
Lager Image
where I (·) is the 1/0 indicator function, and the gradient in the latter exactly follows the HMM gradient (2). For the marginal log-likelihood term, since
PPT Slide
Lager Image
The initial model is chosen randomly and blindly, thus tends to select first a few samples for query, which is a desirable strategy considering that model is inaccurate with little labeled data received at initial stages. We stop the algorithm when the number of queries made reaches the budget constraint B . The proposed online selective-sample learning algorithm is summarized in Alg. 1.
PPT Slide
Lager Image
4. Empirical Evaluation
In this section we evaluate the classification performance of the proposed online selective-sample learning algorithm on several real-world time-series datasets including human gait/activity recognition and facial emotion prediction. For each dataset we form online learning setups by feeding a learning algorithm one sample at each stage randomly drawn from the data pool, where we restrict the algorithm to make queries up to B times. We test with two different values of budget B : 10% and 30% of the entire data samples. Once the algorithm uses up the budget, from then on it can only exploit the incoming sequences only (without class labels) to update the model.
Our algorithm is compared with two fairly basic online learning strategies: the first is to make queries at the first B stages (greedy approach; denoted by GREEDY), and the second one makes random queries (denoted by RANDOM). The former is reasonable in the sense that the model is initially incorrect, thus trying to learn with labeled data as early as possible. The related hyperparameters (e.g., the learning rate η and the impact of the marginal log-likelihood term λ in model update) are chosen identical across competing models for fair comparison. As a reference, we also contrast with the online learner that is not restricted by the budget constraint (i.e., it can make query every stage), which perhaps provides an upper bound for the classification accuracy (denoted by QRYALL).
As a performance measure, we accumulates the test errors up to n stages. That is, the prediction error of an online algorithm is:
PPT Slide
Lager Image
. We provide detailed descriptions of the datasets, experimental setups, and test results in the subsequent sections.
- 4.1 Human Gait Recognition
The classification task we deal with is to identify a person based on his/her gait sequence. We collect data from the speed-control human gait database [15 , 16] , where we focus on samples from 6 different subjects to form a K = 6-way multi-class classification setup. Each subject performs walking motion with four different speeds (0.7m/s, 1.0m/s, 1.3m/s, 1.6m/s), and the task is to predict a subject regardless of the walking speed. For the sequence representation, we take 3D motion captures of some marker points at the lower body parts (refer to [15] for details). We then select sub-sequences of lengths around 80 with random starting/ending positions.
For n = 648 sequences, we form online learning setups with two budget setups, B = 0.1 n and B = 0.3 n , and it is repeated 5 times to report the average test errors. The results are summarized in Table 1 . The reference QRYALL that makes unlimited queries, as expected, attains the lower bound for test errors of all competing methods. For both budget setups, the proposed approach consistently outperforms the other two methods, which can be attributed to its more sophisticated decision strategy based on sample novelty and/or model’s certainty on class prediction, rather than simple greedy or random querying strategies.
Accumulated test prediction errors (%) on human gait recognition data
PPT Slide
Lager Image
Accumulated test prediction errors (%) on human gait recognition data
It is also interesting to see that when the budget B increases from 10% to 30%, the difference between GREEDY and RANDOM becomes small, which can be explained as follows: as we have more labeled samples, it is less critical when the labels are asked. Also, for smaller B , the differences between our method and two competing ones are more significant, which emphasizes the effectiveness of the proposed method under a more restricted budget scenario, namely that when only a few queries are allowed to be made, the careful decision strategy in our proposed approach yields outstanding performance.
- 4.2 Facial Emotion Classification
Next we consider the problem of recognizing facial emotion from a video sequence comprised of facial image frames that undergo changes in facial expression. From the Cohn-Kanade facial expression video dataset [17] , we collect videos of two emotions, fear and happiness. Differentiating these two emotions are recognized as a hard problem in that visually they are very similar to each other. We tackle this binary classification task.
For the sequence representation, we use certain image features extracted from images as follows. We first estimate the tight bounding boxes for faces using the standard face detector (e.g., Viola and Jones [18] ), then normalize the sizes of the cropped face images. Then the Haar-like features are extracted for each frame where we follow the procedure similar to that of [19] . The last step is to reduce the dimensionality by principal component analysis. We obtain n = 139 sequences with lengths varying from 10 to 30 for about 90 different subjects, and there are 54 sequences for the fear class and 85 for happiness.
The test errors are depicted in Table 2 , where the reference QRYALL again gives lower bound on the test errors. Our proposed method outperforms the competing methods significantly and consistently for all budget setups, while for B = 0.3 n it attains error rate nearly close to the lower bound. The results again signify the superiority of the proposed query decision strategy.
Accumulated test prediction errors (%) on facial emotion classification data
PPT Slide
Lager Image
Accumulated test prediction errors (%) on facial emotion classification data
- 4.3 Human Activity Recognition
Last but not least, we tackle the problem of human activity recognition from a sequence of motion localization records. We obtain the dataset from the UCI machine learning repository [20] , where the original goal was to reconstruct locations and postures from the localization data recorded from wearable tags at articulation points of subjects performing actions [21] . In this experiment, we focus on the task of recognizing the activity type. For this purpose, we collect from three different subjects about 360 sequences of three activities: walking , lying , and sitting . For the sequence representation, we use the 3-dim velocity information evaluated from 3D coordinates data obtained from the left-ankle tag.
For this 3-way classification problem, we form online learning setups similarly as previous experiments. The test results are summarized in Table 3 . We again have similar behavior as the former two experiments. The performance of the proposed approach is outstanding: in particular, it even attains more accurate prediction with the smaller budget constraint. In this dataset, due to the rather limited feature information, the overall prediction performance is low, and under such scenarios, the sophisticated query decision in our algorithm is shown to yield viable solutions.
Accumulated test prediction errors (%) on human activity recognition data
PPT Slide
Lager Image
Accumulated test prediction errors (%) on human activity recognition data
5. Conclusion
In this paper we have proposed a novel online selective-sample learning algorithm for multi-class sequence classification problems. Under the HMM-based sequence classification model, we devise reasonable criteria for decision for query based on the data novelty (likelihood of the incoming data) and the class prediction confidence (entropy of the class posterior). For several sequence classification datasets/tasks in online learning setups, we have shown that the proposed algorithm yields superior prediction accuracy to greedy or random approaches even with a limited query budget. One current issue may be how to choose the threshold parameters adequately, and we leave it as our future work to investigate a systematic way to tune those.
Conflict of InterestNo potential conflict of interest relevant to this article was reported.
This study was supported by the Research Program funded by the Seoul National University of Science and Technology.
Minyoung Kim received his BS and MS degrees both in Computer Science and Engineering from Seoul National University, South Korea. He earned a PhD degree in Computer Science from Rutgers University in 2008. From 2009 to 2010 he was a postdoctoral researcher at the Robotics Institute of Carnegie Mellon University. He is currently an Assistant Professor in the Department of Electronics and IT Media Engineering at Seoul National University of Science and Technology in Korea. His primary research interest is machine learning and computer vision. His research focus includes graphical models, motion estimation/tracking, discriminative models/learning, kernel methods, and dimensionality reduction.
Leslie C. , Eskin E. , Noble W. S. 2002 “The spectrum kernel: A string kernel for SVM protein classification,” Pacific Symposium on Biocomputing 7 566 - 575
Juang B. H. , Rabiner L. R. 1985 “A probabilistic distance measure for hidden Markov models,” AT&T Technical Journal
Sha F. , Saul L. K. 2007 “Large margin hidden Markov models for automatic speech recognition,” Advances in Neural Information Processing Systems 19
Starner T. , Pentland A. 1995 “Real-time American sign language recognition from video using hidden Markov models,” International Symposium on Computer Vision
Wilson A. D. , Bobick A. F. 1999 “Parametric hidden Markov models for gesture recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence 21 (9) 884 - 900    DOI : 10.1109/34.790429
Alon J. , Sclaroff S. , Kollios G. , Pavlovic V. 2003 “Discovering clusters in motion time-series data,” Computer Vision and Pattern Recognition
Durbin R. , Eddy S. , Krogh A. , Mitchenson G. 2002 Biological Sequence Analysis Cambridge University Press
Dasgupta S. , Kalai A. T. , Monteleoni C. 2005 “Analysis of perceptron-based active learning,” Conference on Learning Theory
Beygelzimer A. , Dasgupta S. , Langford J. 2009 “Importance weighted active learning,” International Conference on Machine Learning
Cesa-Bianchi N. , Gentile C. , Zaniboni L. 2006 “Worst-case analysis of selective sampling for linear classification,” Journal of Machine Learning Research 7 1205 - 1230
Crammer K. 2014 “Doubly aggressive selective sampling algorithms for classification,” International Conference on AI & Statistics
Rabiner L. R. 1989 “A tutorial on hidden Markov models and selected applications in speech recognition,” Proceedings of the IEEE 77 (2) 257 - 286    DOI : 10.1109/5.18626
Dempster A. P. , Laird N. M. , Rubin D. B. 1977 “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the Royal Statistical Society B 39 185 - 197
Duchi J. , Hazan E. , Singer Y. 2011 “Adaptive subgradient methods for online learning and stochastic optimization,” Journal of Machine Learning Research 12 2121 - 2159
Tanawongsuwan R. , Bobick A. 2003 “Characteristics of time-distance gait parameters across speeds,” Graphics, Visualization, and Usability Center, Georgia Institute of Technology Atlanta, GA
Tanawongsuwan R. , Bobick A. 2003 “Performance analysis of time-distance gait parameters under different speeds,” International Conference on Audio and Video Based Biometric Person Authentication
Lien J. , Kanade T. , Cohn J. , Li C. 1999 “Detection, tracking, and classification of action units in facial expression,” Journal of Robotics and Autonomous Systems
Viola P. , Jones M. 2001 “Robust real-time object detection,” International Journal of Computer Vision 57 (2) 137 - 154
Yang P. , Liu Q. , Metaxas D. N. 2009 “Rankboost with l1 regularization for facial expression recognition and intensity estimation,” International Conference on Computer Vision
Hettich S. , Bay S. D. “The UCI KDD Archive,” University of California, Information and Computer Science Irvine ()
Kaluza B. , Mirchevska V. , Dovgan E. , Lustrek M. , Gams M. 2010 “An agent-based approach to care in independent living,” International Joint Conference on Ambient Intelligence Malaga, Spain