We pose pattern classification as a density estimation problem where we consider mixtures of generative models under partially labeled data setups. Unlike traditional approaches that estimate density everywhere in data space, we focus on the density along the decision boundary that can yield more discriminative models with superior classification performance. We extend our earlier work on the recursive estimation method for discriminative mixture models to semisupervised learning setups where some of the data points lack class labels. Our model exploits the mixture structure in the functional gradient framework: it searches for the base mixture component model in a greedy fashion, maximizing the conditional class likelihoods for the labeled data and at the same time minimizing the uncertainty of class label prediction for unlabeled data points. The objective can be effectively imposed as individual mixture component learning on weighted data, hence our mixture learning typically becomes highly efficient for popular base generative models like Gaussians or hidden Markov models. Moreover, apart from the expectationmaximization algorithm, the proposed recursive estimation has several advantages including the lack of need for a predetermined mixture order and robustness to the choice of initial parameters. We demonstrate the benefits of the proposed approach on a comprehensive set of evaluations consisting of diverse timeseries classification problems in semisupervised scenarios.
1. Introduction
In a number of datadriven modeling tasks, a generative probabilistic model such as a Bayesian network (BN) is an attractive choice, advantageous in various aspects including the ability to easily incorporate domain knowledge, factorize complex problems into selfcontained models, handle missing data and latent factors, and offer interpretability to results, to name a few
[1
,
2]
. While such models are implicitly employed for joint density estimation, for the last few decades they have gained significant attention as
classifiers
. A model of this class, the Bayesian network classifier (BNC)
[3]
, has been used in a wide range of applications subsuming speech recognition and motion timeseries classification
[4

9]
and has been shown to yield performance comparable to dedicated discriminative classifiers such as support vector machines (SVMs).
A BNC model represents a density
P
(
c
, x) over the class variable
c
and observation x. Learning its parameters with fully labeled data is traditionally posed as a joint likelihood maximization (ML). However, as the ML learning aims to fit the density for all points in the training data, it may not be directly compatible with the ultimate goal of class prediction. Instead, discriminative learning, typically the conditional likelihood maximization (CML), optimizes the conditional distribution of class
given
observation, i.e.,
P
(
c
x), achieving better classification performance than ML learning in a variety of situations
[10

12]
. Unfortunately, CML optimization is, in general, complex with nonunique solutions. Typical CML learning methods are based on gradient search that can be computationally intensive.
A mixture model, as a rich density estimator, can potentially yield more accurate class prediction than a
single
BNC model. Mixture models have received significant attention in related fields and achieved success in diverse application areas
[13

15]
. In our earlier work
[16]
we proposed a quite efficient approach to the discriminative density estimation of
mixture models
. Here we briefly describe the algorithm introduced in
[16]
. The main goal is to exploit the properties of a mixture to alleviate the complexity of a learning task. This can be done in a greedy fashion, where a mixture component is added recursively to the current mixture with the objective of maximizing conditional likelihoods. Formulated within the functional gradient boosting framework
[17]
, the procedure yields the weight distribution on the data with which a new mixture component can be learned. The derived weighting scheme effectively emphasizes the data points at the decision boundary, a desirable property similarly observed in SVMs.
The method is particularly efficient and easy to implement in that searching for a new mixture component can be done by ML learning with
weighted
data, and hence is suited to domains with complex component models such as hidden Markov models (HMMs) in timeseries classifications that are usually computationally intensive for parametric gradient search. Compared to the conventional expectationmaximization (EM) algorithms, the recursive estimation approach has the crucial advantages of ease of model selection (i.e., estimating mixture orders) and robustness to initial model parameter choice.
Although our earlier approach was limited to fully supervised settings, in this paper we extend it to semisupervised learning setups where we can make use of a large portion of unlabeled data points in conjunction with a few labeled data. We incorporate the minimum entropy principle in
[18]
into our recursive mixture estimation framework, where the unlabeled data points are exploited in such a way that the model’s uncertainty in class prediction is maximally reduced. This leads to an objective function comprising the conditional loglikelihoods on labeled data and the negative entropy terms for unlabeled data. Within the functional gradient boosting framework, we derive the stagewise data weight distribution for this semisupervised objective.
The paper is organized as follows. In the next two sections, we formally set up the problem and review our earlier approach to the discriminative learning of mixtures in a fully supervised setup. Our proposed semisupervised discriminative mixture learning algorithm is described in Section 4. In the experimental evaluation in Section 5, we demonstrate the benefits of the proposed algorithms in an extensive set of timeseries classification problems on many realworld datasets in semisupervised scenarios.
2. Problem Setup and Notation
Consider a classification problem where a class label is denoted by
c
∈ {1, ...,
K
} for the observation/feature x ∈
X
. The input feature x is either vectorvalued or structured like sequences of timeseries. Let
f
(
c
, x) denote a BNC
^{1}
with a class variable
c
and the input attribute variables x. A BNC can be usually factorized into a (multinomial) class prior
f
(
c
) and the class conditional densities
f
(x
c
) =
f_{c}
(x). For example,
f_{c}
(x) could be a class(
c
)specific Gaussian when x is a realvalued vector. Often,
f_{c}
(x) may also contain latent variables (e.g., in the sequence classification where x is a sequence of measurements,
f_{c}
(x) can be modeled as an HMM with hidden state variables).
As a classifier, the class prediction of a new observation x can be accomplished by the decision rule:
c
^{*}
= arg max
_{c}
f
(
c
x) = arg max
_{c}
f
(
c
, x). Given the (fully supervised) training data
we learn a joint density
f
(
c
, x) that minimizes the prediction error. The traditional ML learning optimizes the data joint likelihood
However, the ML learning does not necessarily yield optimal prediction performance unless we are given not only the correct model structure but also a large number of training samples.
The discriminative learning of BNCs effectively represents the class boundaries, and exhibits superior classification performance to ML learning that merely focuses on fitting the density to all points in the training data. CML learning, one of the most popular discriminative estimators, maximizes the conditional likelihood of
c
given x, an objective directly related to the goal of accurate class prediction. The conditional loglikelihood objective for the training data
D
is defined as
CML optimization in general does not admit closedform solutions for most generative models. One typically maximizes it using gradient search. Although it has been shown that CML outperforms ML when the model structure is suboptimal
[6
,
10
,
11
,
19]
, the computational overhead demanded by gradientbased approaches is high, especially for complex models such as HMMs and general BN structures.
^{1}We use the notation f(c, x) interchangeably to represent either a BNC or a likelihood at data point (c, x).
3. Previous Recursive Mixture Estimation in Fully Supervised Setups
Motivated by the fact that a single BNC can be insufficient for modeling complex decision boundaries (e.g., Gaussian classconditionals merely represent ellipsoidal clusters), one can enlarge representational capacity by forming a
mixture
. Let
F
(
c
, x) denote a mixture of BNCs, that is,
Note
^{1}
that each component of the mixture is a BNC
f_{m}
(
c
, x). Instead of the usual EM learning for mixture models, a greedy recursive approach was proposed in
[16]
. At each stage, we add a new BNC component
f
(
c
, x) to the current mixture so that it optimizes a certain criterion.
Within the functional gradient optimization framework
[17]
, one considers how to maximize a given objective functional
J
(
F
) with respect to the (mixture) function
F
(z) where z ∈
Z
. In the classification setting, z = (
c
, x), and
Z
is the classmeasurement joint input domain for the BNC likelihood function
f
(
c
, x). The greedy optimization proceeds as follows: for the current mixture estimate
F
, we seek a new component
f
such that when
F
is locally varied as (1 –
ϵ
)
F
+
ϵf
for some small positive
ϵ
,
J
((1 –
ϵ
)
F
+
ϵf
) is maximally increased. The update equation is:
Maximizing
J
(
F
) can be done by gradient ascent (in function space) described by the update rule:
where
δ
is the step size and ▽
_{F}
J
(
F
) =
∂J
(
F
)/
∂F
(z) is the functional gradient of
J
(
F
) that is also a function obtained by a pointwise partial derivative.
Contrasting (2) with the greedy mixture update rule of (1), the optimal
f
would be the one that attains the maximal alignment between (
f – F
) and ▽
J
(
F
), namely
In the case of a finite number of samples
we estimate (3) as
where
w
(
c
, x) = ▽
_{F(c, x)}
J
(
F
) =
∂J
(
F
)/
∂F
(
c
, x). Thus, ▽
_{F(c, x)}
J
(
F
) serves as a weight for data point (
c
, x) with which the new
f
will be learned. Optimization in (4) can be accomplished using a generic gradient ascentbased approach, however, a more efficient recursive EMlike lowerbound maximization was suggested in
[16]
.
Once the optimal component
f
^{*}
is selected, its optimal contribution to the mixture
α
^{*}
is obtained as
This optimization can easily be done with any line search algorithm.
It is important to discuss the choice of the objective functional
J
(
F
) . For discriminative mixture learning, the conditional loglikelihood is employed in
[16]
by:
In this case, the functional gradient becomes:
yielding the discriminative data weight:
The discriminative weight indicates that the new component
f
is learned from the weighted data where the weights are directly proportional to 1 –
F
(
c
x) and inversely proportional to
F
(
c
, x). Hence the data points
unexplained by the model
, i.e.,
F
(
c
, x) → 0, and
incorrectly classified
by the current mixture, i.e., (1 –
F
(
c
x)) → 1, are focused on in the next stage. This is an intuitively desirable strategy for improving the classification performance.
The time complexity of discriminative mixture learning is of the order
O
(
M
· (
N_{ML}
+
N_{LS}
)) where
N_{ML}
stands for the complexity of the ML learning and
N_{LS}
is the complexity of the line search. Hence, the discriminative mixture learning algorithm complexity is a constant factor of simple generative learning of the base model on weighted data.
^{1}It is also worth noting that, if viewed from the generative perspective, this corresponds to modeling each class with the same number (M) of mixture components (i.e., F(xc) for all c that have the same mixture order).
4. SemiSupervised Recursive Discriminative Mixture Estimation
So far, we have considered the case where the data is fully labeled. In the semisupervised setting, we are given the labeled set
and the unlabeled data
Among several known semisupervised classification approaches, an effective way to exploit the unlabeled data is the entropy minimization method proposed by
[18]
. The main idea is that we minimize the classification error for the labeled data (e.g., maximizing the conditional likelihood), while forcing the model to have minimal uncertainty in predicting class labels for the unlabeled data. This minimum entropy principle is motivated by minimization of the KullbackLeibler divergence between the modelinduced distribution and the empirical distribution on the unlabeled data that has been shown to effectively partition the unlabeled data into clusters.
Having the negative entropy term for the unlabeled data, the semisupervised discriminative (SSD) objective can be defined as
where
γ
≥ 0 is a controllable parameter that balances the loss term against the negative entropy term. The functional gradient for the new objective is now
Notice that for the labeled data we have a functional gradient identical to that of supervised discriminative mixture learning. For the unlabeled data, however, the gradient terms require further consideration. The main difficulty is that for the unlabeled data x
^{j}
(∈
U
), we have no assigned class labels. We next consider two different approaches for treating this latent label.
 4.1 Marginalization Over Full Label Set
A possible treatment is to assume that we are given all
K
class labels attached to the unlabeled data x
^{j}
. That is, for each data point x
^{j}
, we pretend that all possible
K
pairs
are observed in the training data. Then it follows that:
where
H
(·) is the entropy function.
Hence, the unlabeled data point x
^{j}
induces
K
data weights:
The unlabeled data weight can be interpreted as follows: (i) The denominator
F
(x
^{j}
) implies the need to focus on the samples that are less highlighted by the current model (regardless of their class labels) in the next stage. (ii) The first term in the numerator log
F
(
c^{j}
x
^{j}
) encourages the model to keep attending to its current decision (
c^{j}
) on x
^{j}
. (iii) The entropy term in the numerator assigns more weights to the unlabeled samples x
^{j}
that have a higher prediction uncertainty in the current model. So, by (ii) and (iii), one can achieve entropy minimization for the unlabeled data.
Despite this intuitive interpretation, one practical issue with this weighting scheme is that the weights can be potentially negative, in which case the optimization in (4) may not be tackled by the lower bound maximization technique. In this case, one can directly optimize it using a parametric gradient search. Alternatively, the pseudo labelbased technique presented next can circumvent the negative weight issue.
 4.2 Pseudo Labels
For the current model, we define the
pseudo label
for x
^{j}
as:
Instead of dealing with all
K
possible labels for x
^{j}
, we consider only a single pseudolabeled pair
That is, the unlabeled x
^{j}
is assumed to be accompanied by
having the following weight:
The intuition discussed in Section 4.1 follows immediately, however, we can now guarantee that the weights for the pseudolabeled data points (11) are always nonnegative.
Although dealing with only the best predicted label is convenient for optimization and is a rational strategy to pursue, it is important to note that unlike the
alllabel
approach in Section 4.1 the
pseudolabel
approach is suboptimal in the objective perspective, essentially amounting to ignoring the negativeweight (pushingaway) effects enforced by the nonbest labels.
5. Evaluation
We evaluated the performance of the proposed recursive mixture learning in semisupervised learning settings. We focused on the structured data classification task of classifying
sequences
or
timeseries
. This is, in general, more difficult than the static multivariate data classification. We used Gaussianemission HMMs (GHMMs)
^{1}
to model the class conditional densities
f_{c}
(x) for the real multivariate sequence x. In our recursive mixture learning, we need to learn GHMMs with weighted data samples, and this can be done by a fairly straightforward extension of the regular EMbased GHMM learning. The detailed EM steps can be found in
[20]
. The competing approaches whose performance will be contrasted are summarized in Section 5.1, while in Section 5.2 we describe the datasets and report the results.
 5.1 Competing Methods
A simple and straightforward approach to dealing with partially labeled data is to ignore the unlabeled data points. In this section, we first summarize the fully supervised classification algorithms with which we compare our approach. These algorithms are then extended to handle unlabeled data by the wellknown and generic adaptive semisupervised method called
selftraining
.
The first approaches we describe are modelbased, where we use (single) BNC models trained by ML and CML. In CML, the gradient search starts with the ML estimate as the initial iterate. Related to the proposed recursive discriminative mixture learning, we compare the proposed method with
[21]
’s
boosted Bayesian network
(BBN), an ensemblebased discriminative learning method for BNCs that treat
f
(
c
, x) as a (weak) hypothesis, namely
c
=
h
(x) = arg max
_{c}
f
(
c
x), within a boosting
[22]
framework. For each stage, AdaBoost’s weights
w
on data (
c
, x) are used to learn the next hypothesis (BNC) via
weighted
ML learning:
This approach has been shown to inherit certain benefits from AdaBoost such as good generalization by maximizing the margin. However, the resulting ensemble cannot be simply interpreted as a generative model since the learned BNCs are just weak classifiers to be combined for the classification task.
In addition to the modelbased approaches, we also consider two alternative similaritybased approaches that have exhibited good performance in the past, especially on sequence classification problems: dynamic time warping (DTW) and the Fisher kernel
[23]
. DTW is a dynamic programming algorithm that searches for the globally best warping path. Often, imposing certain constraints on the feasible warping paths has been empirically shown to improve the classification performance
[24

26]
. For instance, the SakoeChiba band constraint
[24]
restricts the maximum deviation of matching slices from the diagonal by
p
% of the sequence length. Thus
p
= 0 and
p
= ∞ correspond to the naive Euclidean distance (defined only if the lengths of two sequences are equal) and the standard (unconstrained) DTW, respectively. Recently,
[26]
proposed an adaptive band approach that estimates the function spaces of time warping paths. In this setting, classspecific warpingpath constraints are learned for each class that reflect the warping variations of the samples within it.
The Fisher kernel between two sequences x and x′ is defined as the radial basis function (RBF) evaluated on the distance between their Fisher scores with respect to the underlying generative model. More specifically, in binary classification,
k
(x, x′) =
e
^{–║Ux–Ux′║2/(2σ2)}
, where
U
_{x}
= ▽
_{θ}
log
P
_{c=+}
(x). Here
P
_{c=+}
(x) indicates the likelihood of the HMM usually learned by ML from the examples of the positive class only. The RBF scale
σ
^{2}
is determined as the median distance between the Fisher scores corresponding to the training sequences in the positive class and the closest Fisher score from the negative class in the training data
[23]
. The multiclass extension is made using a set of onevsrest binary problems.
As a baseline, we also consider a static classifier (e.g., SVM) that treats fixedlength (window) segments from a sequence as iid multivariate samples. Specifically, for a window of size
r
, the classsequence data pair (
c
, x) is converted to
rd
dim iid samples,
At the test stage, the class label is determined by majority voting over the predicted segment labels.
The competing methods are summarized below:

ML: ML learning off(c, x).

CML: CML learning off(c, x).

RDM: Recursive discriminative mixture learning[16].

BBN: Boosted Bayesian networks[21].

NNDTW(B%): The Nearest Neighbor classifier based on the DTW distance measure whereBis the best Sakoe Chiba band constraint selected by cross validation over the candidate set: {∞%, 30%, 10%, 3%}.

FSDTW: The functionspace DTW learning[26].

SVMFSK: The SVM classifier based on the Fisher kernel. The SVM hyperparameters are selected by cross validation. To handle multiclass settings, we perform binarization in theonevsrestmanner. We then employ thewinnertakesall(WTA) strategy which predicts the multiclass labels by majority voting from the outputs of the onevsothers binarized problems1.

SVMWin(R%): An SVM classifier that treats fixedlength window segments as iid multivariate samples whereRis the relative window size with respect to the sequence length (R= 100r/T). We use the RBF kernel inrddimensional vector space. We report the best (relative) window sizeRselected by cross validation over a candidate set: {0% (window sizer= 1), 10%, 20%, 30%, 50%}.

SSRDM: Semisupervised recursive discriminative mixture learning (proposed approach).
In the experiments, we split the data threefold: labeled training data, unlabeled training data, and test data. All the other approaches listed above are
fully supervised
, making use of only the labeled data for training. On the other hand, our semisupervised discriminative mixture learning algorithm (denoted by SSRDM) exploits the unlabeled training data in conjunction with the labeled data. Throughout the evaluation we make use of the
alllabel
strategy in (11) as it consistently demonstrated performance superior to the pseudolabel alternative.
We not only demonstrate the improvement in prediction performance achieved by SSRDM compared to supervised methods that ignore the unlabeled data, but we also contrast it with the generic
selftraining
algorithm, a generic method of extending fully supervised classifiers to semisupervised setups, often very successful and the most popular method in use. We apply the selftraining algorithm to each of the supervised methods listed above. The selftraining algorithm is described in pseudocode in Algorithm 5.1.
SelfTraining.
Unless stated otherwise, for the mixture/ensemble approaches (i.e., BBN, RDM, and SSRDM), the maximum number of iterations (i.e., the number of BNC components) was set to ten. The test errors for the datasets (described in the next section) are shown in
Table 1
.
 5.2 Datasets and Results
 5.2.1 Gun/Point dataset
This is a binary class dataset that contains 200 sequences (100 per class) of
gun draw
(class 1) and
finger point
(class 2). The sequences are all 1D vectors of length 150, representing the xcoordinate of the centroid of the right hand
^{1}
. This timeseries dataset is a typical example where a NN approach with either a simple Euclidean distance or a DTW with small SakoeChiba band size constraints works very well.
Weform five folds for cross validation with 10%/40% labeled/unlabeled training data and the remaining 50% for the test data, randomly . The sequences were preprocessed by Znormalization so that the mean = 0 and standard deviation = 1. The GHMM order was chosen to be ten as it is also meaningful for describing 2?3 states for delicate movements around the subject’s side, 2?3 states for hand movement from/to the side to/from the target, 1?2 states at the target, and 2?3 states for returning to the gun holster.
The test errors (means and standard deviations) are shown in
Table 1
. NNDTW with
properly
chosen SakoeChiba band size (10%) outperforms ML, CML, and sequence kernel based SVM, while it is comparable to RDM. The semisupervised learning results indicate that the SSRDM outperforms the other semisupervised methods and significantly improves on supervised
Test errors (%) for the semisupervised settingsThe proposed SSRDM, located at the bottom with the boldfaced title, is compared with supervised classifiers that simply ignore unlabeled data and their selftraining extensions (depicted in parentheses). ML, likelihood maximization; CML, conditional likelihood maximization; BBN, boosted Bayesian network; NN, nearest neighbor; DTW, dynamic time warping; FS, function space; SVM, support vector machine; FSK, Fisher kernel; RDM, recursive discriminative mixture; SSRDM, semisupervised RDM.
Test errors (%) for the semisupervised settings The proposed SSRDM, located at the bottom with the boldfaced title, is compared with supervised classifiers that simply ignore unlabeled data and their selftraining extensions (depicted in parentheses). ML, likelihood maximization; CML, conditional likelihood maximization; BBN, boosted Bayesian network; NN, nearest neighbor; DTW, dynamic time warping; FS, function space; SVM, support vector machine; FSK, Fisher kernel; RDM, recursive discriminative mixture; SSRDM, semisupervised RDM.
RDM by taking advantage of the large number of unlabeled data.
 5.2.2 Australian Sign Language (ASL)
This UCIKDD dataset contains about 100 signs generated by five signers with different levels of skill
[31]
. In this experiment, we considered 10 selected signs (“hello,” “sorry,” “love,” “eat,” “give,” “forget,” “know,” “exit,” “yes,” and “no”), forming a
K
= 10way classification problem. In the original ASL dataset, each time slice of a sequence consists of 15 features corresponding to the hand position, hand orientation, finger flexion, and so on. As recommended, we ignored the 5
^{th}
, 6
^{th}
, and 11
^{th}
–15
^{th}
features. To prevent occasional noisy spikes in the original sequences, we additionally preprocessed them with a median filter. In contrast to the Gun/Point dataset, DTW is not very effective here because the lengths of sequences in the dataset are diverse, ranging from 17 to 196. We split the data randomly into 60% labeled and 20% unlabeled training data with 20% test data in five folds. For the HMMbased models, the GHMM order was chosen to be three from cross validation.
Results in
Table 1
show the test errors averaged over the five test folds. The DTW with the bestchosen band constraint (
B
= 30) exhibits a rather poor performance, statistically indistinguishable from ML, as expected due to the large deviation in sequence lengths. Compared to ML, the discriminative approaches like CML and RDM improve the prediction accuracy considerably. Despite the small number of unlabeled data, the proposed SSRDM effectively takes advantage of them, yielding the lowest test error significantly below the random guess error rate of 90%.
 5.2.3 GeorgiaTech speedcontrol gait database
We next tested the proposed mixture learning algorithms on the human gait recognition problem. The data of interest is the speedcontrol gait data collected by the Human Identification at a Distance (HID) project at GeorgiaTech. The database was originally intended for studying distinctive characteristics (e.g., stride length or cadence) of human gait over different speeds
[32
,
33]
. For 15 subjects, and four different walking speeds (0.7 m/s, 1.0 m/s, 1.3 m/s, 1.6 m/s), 3D motion capture data of 22 marked points (as depicted in
[32]
) were recorded for nine repeated sessions. The data was sampled at 120 Hz evenly for exactly one walking cycle, meaning that slower sequences were longer than the faster ones. The sequence length ranged from approximately 100 to 200 samples. Each marked point had a 3D coordinate, yielding 66 (= 22×3) dimensional sequences.
Apart from the original purpose of the data, we were interested in recognizing
subjects
regardless of their walking speeds. Taking only the first five subjects into consideration without distinguishing their walking speeds, we formulated a 5class problem where each class consisted of 36 (= 4 speeds × 9 sessions) sequences. The original dataset provided highquality 3D motion capture features on which most of the competing methods performed equally well. To make the classification task more challenging, we considered two modifications: (1) From the original 1cycle gait sequence, we took subsequences randomly where the starting positions were chosen uniformly at random and the lengths were around 100. (2) Only the features related to the
lower
body part were used: the joint angles of the torsofemur, femurtibia, and tibiafoot.
After this manipulation, we randomly partitioned the data five time into 20% labeled and 50% unlabeled training data with the remaining 30% test data. The GHMM order was chosen to be three, and the maximum number of mixture learning iterations was set to 20. As
Table 1
demonstrates, the proposed SSRDM again attains the lowest errors.
 5.2.4 USF human ID gait dataset
The USF human ID gait dataset consists of about 100 subjects periodically walking in elliptical paths in front of a set of cameras. We considered the task of motionbased subject identification, where the motion videos were recorded in diverse circumstances: the subject walking on grass or concrete, with or without a briefcase. From the processed human silhouette video frames, we computed the 7
^{th}
order Hu moments that are translation and rotation invariant descriptors of binary images. The extracted features were then Znormalized, yielding 7dimensional sequences of duration ~ 200. While the original investigation of the set focused on how well the classifiers adapted to new circumstances (i.e., a different combination of covariates), we concentrated on identifying humans regardless of the covariates. For this, we chose seven humans from the database (a 7class problem), each of which had 16 associated sequences containing all combinations of circumstances.
After randomly splitting the 112 sequences into 50% labeledtraining, 25% unlabeledtraining, and 25% test sets five times, we recorded the average test errors in
Table 1
. The GHMM order was chosen to be three from cross validation. The maximum number of iterations for recursive mixture models was set to 20. Again, RDM and SSRDM have the lowest test errors with small variances, reaffirming the importance of the recursive estimation of discriminative mixture models when combined with the use of unlabeled data.
 5.2.5 Traffic dataset
We next tackle a video classification problem that has demonstrated the utility of dynamic texture methods
[34
,
35]
in the computer vision community. Dynamic texture is a generative model that represents a video as a sample from a linear dynamical system. Dynamic texture can extract the visual or spatial components in the image measurements using PCA while capturing the temporal correlation by the latent linear dynamics. Hence, a video, potentially of varying length, can be succinctly represented by two matrices (
A, C
), where
A
is the dynamics matrix on the lowdimensional latent space, and
C
is the emission matrix that maps the latent state to the image observation. To apply dynamic texture to the video classification problems, the Martin distance
[34]
is often employed. It defines the similarity measure (or kernel) between a pair of videos based on the principal angles between subspaces represented by their matrix parameters. Once the distance measure is estimated, one can readily employ standard classifiers such as nearest neighbors or SVMs.
The dataset we used in this experiment is traffic data (also used in
[36]
) that contains videos of highway traffic taken over two days from a stationary camera. The videos were labeled manually as light, medium, and heavy traffic, posing a 3class problem. The videos are around 50 frames long, where each image frame is of size (48 × 48), yielding a 2304dimensional vector.
For the dynamic texture approach, we set the latent space dimension to be eight. We used the same dimension for our GHMMbased competing approaches, where we used the PCA dimensionreduced observation in the GHMMs. We collected 131 videos (with a nearly equal number of videos for each class), and randomly split them into 60% labeledtraining, 10% unlabeledtraining, and 30% test sets five times. The GHMM order was chosen to be two, and the maximum number of iterations for recursive mixture models was set to 20. The SVM classifier with the Martin distance measure estimated from the learned dynamic texture recorded a test error of 16.67 ± 1.84%, outperforming many of the competing approaches as shown in
Table 1
. However, the proposed SSRDM (and RDM) achieved still higher prediction accuracies than the dynamic texture model.
 5.2.6 Behavior recognition
Finally, we deal with a behavior recognition task, a very important problem in computer vision. We used the facial expression and mouse behavior datasets from the UCSD vision group
^{1}
. The face data are composed of video clips of two individuals, each displaying six different facial expressions (anger, disgust, fear, joy, sadness, and surprise) under two different illumination settings. Each expression was repeated eight times, yielding a total of 192 video clips. We used 96 clips from one subject (regardless of illumination conditions) as training data, and predicted the emotions of other subjects in the video clips (a 6class problem). We further randomly partitioned the training data into 50% labeled and 50% unlabeled sets five times. The mouse data contained videos of five different behaviors (drink, eat, explore, groom, and sleep). From the original dataset, we formed a smaller set comprising 75 video clips (15 videos for each behavior). We then randomly split the data into 25% labeled training, 40% unlabeled training, and 35% test sets five times.
In both cases, from the raw videos, we extract the
cuboid
features of
[37]
that are spatiotemporal 3D interest point features. Similar to
[37]
, we constructed a finite dictionary of descriptors, and replaced each cuboid descriptor by a corresponding word in the dictionary. More specifically, we collected cuboid features from all training videos, clustered them into
C
centers using the kmeans algorithm, and replaced each cuboid by its closest center ID.
For the classification, we first ran the static mixture approach in
[37]
as a baseline, where they represented a video as a histogram of cuboid types, essentially forming a bagofwords representation. They then applied the nearest neighbor prediction using the
χ
^{2}
distance measure over the histogram space. Setting
C
= 50 with other cuboid parameters properly chosen, we obtained test errors of 68.75 ± 2.95% for the face dataset and 52.36 ± 0.81% for the mouse dataset. (Note that random guessing would yield 83.33% and 80.00% error rates, respectively.)
Instead of representing the video as a single histogram, we considered a sequence representation for our GHMMbased sequence models. For each time frame
t
, we collected all cuboids that spread over
t
and formed a histogram of cuboid types for it. Hence, we formed a
C
dimensional histogram feature vector for each time slice
t
, where we used GHMMs to model the nonnegative quantities (histograms). Note that some time slices did not contain any cuboids, in which case the feature vector was a zerovector. To avoid a large number of parameters in GHMM learning, we further reduced the dimensionality of the features to five dimensions with PCA.The test errors of the competing approaches for this sequence representation are recorded in
Table 1
. Here the best GHMM orders are three for the face dataset and four for the mouse dataset. For both cases, our discriminative recursive mixture learning algorithms (RDM and SSRDM) consistently exhibited the best performance within the margin of significance, outperforming
[37]
’s baseline method .
 5.3 Discussion
The experimental results for the semisupervised classification settings imply that SSRDM is significantly better than selftraining. This can be attributed to the SSRDM’s effective and discriminative weighting scheme that discovers the most important unlabeled data points for classification. Compared to our SSRDM, the performance improvement achieved by the selftraining algorithm is small and can sometimes deteriorate classification accuracy.
^{1}Selecting the number of hidden states in GHMMs is an important task of model selection that we accomplish using cross validation.
^{1}Alternatively, one can directly tackle multiclass problems via multiclass SVM [27]. The other possibility in binarization is the onevsone treatment [28,29]. In our evaluation, however, WTA in onevsothers settings slightly outperforms these two alternatives almost all the time, hence we only report the results of WTA.
^{1}For further details about the data, please refer to [30].
^{1}Available for download at http://vision.ucsd.edu.
6. Conclusion
In this paper we have introduced a novel semisupervised discriminative method for learning mixtures of generative BNC models. Under semisupervised settings, we utilized the minimum entropy principle leading to stagewise data weight distributions for both labeled and unlabeled data. Unlike traditional approaches to discriminative learning, the proposed recursive algorithm is computationally as efficient as learning a single BNC model while achieving significant improvement in classification performance. Our recursive mixture learning is amenable to a predetermined mixture order as well as robust to the choice of initial parameters.
 Conflict of Interest
No potential conflict of interest relevant to this article was reported.
Acknowledgements
This study was supported by Seoul National University of Science & Technology.
Ko S.
,
Kim D. W.
,
Kang B. Y.
2011
“A matrixbased genetic algorithm for structure learning of Bayesian networks”
International Journal of Fuzzy Logic and Intelligent Systems
11
(3)
135 
142
DOI : 10.5391/IJFIS.2011.11.3.135
Cho H. C.
,
Fadali M. S.
,
Lee K. S.
2007
“Online parameterestimation and convergence property of dynamic Bayesiannetworks”
International Journal of Fuzzy Logic and Intelligent Systems
7
(4)
285 
294
DOI : 10.5391/IJFIS.2007.7.4.285
Friedman N.
,
Geiger D.
,
Goldszmidt M.
1997
“Bayesiannetwork classifiers”
Machine Learning
29
131 
163
Starner T.
,
Pentland A.
1995
“Realtime American signlanguage recognition from video using hidden Markovmodels”
in Proceedings of 1995 International Symposium on Computer Vision
Coral Gables, FL
265 
270
DOI : 10.1109/ISCV.1995.477012
Wilson A. D.
,
Bobick A. F.
1999
“Parametric hiddenMarkov models for gesture recognition”
IEEE Transactions on Pattern Analysis and Machine Intelligence
21
(9)
884 
900
DOI : 10.1109/34.790429
Woodland P. C.
,
Povey D.
2002
“Large scale discriminativetraining of hidden Markov models for speech recognition”
Computer Speech & Language
16
(1)
25 
47
DOI : 10.1006/csla.2001.0182
Alon J.
,
Sclaroff S.
,
Kollios G.
,
Pavlovic V.
2003
“Discovering clusters in motion timeseries data”
in Proceedings of the 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Wisconsin
375 
381
Lee S. Y.
,
Lee K. J.
2011
“Pattern classification modeldesign and performance comparison for data mining oftime series data”
Journal of Korean Institute of Intelligent Systems
21
(6)
730 
736
DOI : 10.5391/JKIIS.2011.21.6.730
Bang Y. K.
,
Lee C. H.
2009
“Design of fuzzy system withhierarchical classifying structures and its application totime series prediction”
Journal of Korean Institute of Intelligent Systems
19
(5)
595 
602
DOI : 10.5391/JKIIS.2009.19.5.595
Greiner R.
,
Zhou W.
2002
“Structural extension to logisticregression: discriminative parameter learning of belief netclassifiers”
in Proceeding 18th National Conference on Artificial Intelligence
Edmonton, AB
167 
173
Pernkopf F.
,
Bilmes J.
2005
“Discriminative versus generativeparameter and structure learning of Bayesian NetworkClassifiers”
in refProceedings of the 22nd International Conference on Machine Learning
Bonn
657 
664
Salojarvi J.
,
Puolamaki K.
,
Kaski S.
2005
“On discriminativejoint density modeling”
in Proceedings of the 16th European Conference on Machine Learning
Berlin
341 
352
Dinh Q. N.
,
Lee C. H.
2013
“Modelbased clustering ofDOA data using von mises mixture model for soundsource localization”
International Journal of Fuzzy Logic and Intelligent Systems
13
(1)
59 
66
DOI : 10.5391/IJFIS.2013.13.1.59
Lee J.
,
Cho S.
,
Kim J.
,
Chung S.T.
2008
“Layered objectdetection using adaptive gaussian mixture model in thecomplex and dynamic environment”
Journal of Korean Institute of Intelligent Systems
18
(3)
387 
391
DOI : 10.5391/JKIIS.2008.18.3.387
Kim S. S.
,
Kwak K. C.
,
Ryu J. W.
,
Chun M. G.
2003
“ANeuroFuzzy Modeling using the Hierarchical Clusteringand Gaussian Mixture Model”
Journal of Korean Institute of Intelligent Systems
13
(5)
512 
519
DOI : 10.5391/JKIIS.2003.13.5.512
Kim M.
,
Pavlovic V.
2007
“Recursive method for discriminative mixture learning”
in Proceedings of the 24th International Conference on Machine Learning
Corvallis, OR
409 
416
DOI : 10.1145/1273496.1273548
Friedman J. H.
1999
“Greedy function approximation: a gradient boosting machine”
Annals of Statistics
29
(5)
1189 
1232
DOI : 10.1214/aos/1013203451
Grandvalet Y.
,
Bengio Y.
2004
“Semisupervised learning by entropy minimization”
in Proceeding of Advances in Neural Information Processing Systems
Vancouver, BC
Nadas A.
1983
“A decision theorectic formulation of a trainingproblem in speech recognition and a comparison oftraining by unconditional versus conditional maximumlikelihood”
IEEE Transactions on Acoustics, Speech and Signal Processing
31
(4)
814 
817
DOI : 10.1109/TASSP.1983.1164173
Pavlovic V.
2004
“Modelbased motion clustering usingboosted mixture modeling”
in Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Washington, DC
811 
818
Jing Y.
,
Pavlovic V.
,
Rehg J. M.
2005
“Efficient discriminativelearning of Bayesian network classifier via boostedaugmented naive Bayes”
in Proceedings of the 22nd International Conference on Machine Learning
Bonn
369 
376
DOI : 10.1145/1102351.1102398
Freund Y.
,
Schapire R. E.
1995
“A DecisionTheoretic Generalizationof OnLine Learning and an Application toBoosting”
in Proceedings of the 2nd European Conference
Barcelona
23 
37
DOI : 10.1007/3540591192_166
Jaakkola T.
,
Diekhans M.
,
Haussler D.
1999
“Using the Fisher kernel method to detect remote protein homologies”
in Proceedings of the 7th International Conference on Intelligent Systems for Molecular Biology
Heidelberg
149 
158
Sakoe H.
,
Chiba S.
1978
“Dynamic programming algorithmoptimization for spoken word recognition”
IEEE Transactions on Acoustics, Speech and Signal Processing
26
(1)
43 
49
DOI : 10.1109/TASSP.1978.1163055
Ratanamahatana C. A.
,
Keogh E.
2004
“Making timeseries classification more accurate using learned constraints”
in Proceedings of the 4th SIAM International Conference on Data Mining
Lake Buena Vista, FL
11 
21
Veeraraghavan A.
,
Chellappa R.
,
RoyChowdhury A. K.
2006
“The function space of an activity”
in Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
New York
959 
968
DOI : 10.1109/CVPR.2006.304
Crammer K.
,
Singer Y.
2001
“On the algorithmic implementationof multiclass kernelbased vector machines”
Journal of Machine Learning Research
2
265 
292
Hastie T.
,
Tibshirani R.
1997
“Classification by pairwisecoupling”
in Proceedings of the 1997 Conference on Advances in Neural Information Processing Systems
Denver, CO
507 
513
Duan K. B
,
Keerthi S. S.
2005
“Which is the best multiclass SVM method? An empirical study”
in Proceedings of the 6th International Conference on Multiple Classifier Systems
Seaside, CA
278 
285
DOI : 10.1007/11494683_28
Keogh E.
,
Folias T.
2002
“The UCR time series data miningarchive,” Department Computer Science & Engineering
University of California
Riverside CA
Hettich S.
,
Bay S. D.
2009
“The UCI KDD archive,” Departmentof Information and Computer Science
University of California
Irvine, CA
Tanawongsuwan R.
,
Bobick A. F.
“Characteristics of timedistance gait parameters across speeds”
Available https://smartech.gatech.edu/bitstream/handle/1853/85/0301.pdf?sequence=1
Tanawongsuwan R.
,
Bobick A.
2003
“Performance analysis of timedistance gait parameters under different speeds”
in Proceedings of the 4th International Conference on Audio and VideoBased Biometric Person Authentication
Guildford
715 
724
DOI : 10.1007/354044887X_83
Saisan P.
,
Doretto G.
,
Wu Y. N.
,
Soatto S.
2001
“Dynamictexture recognition”
in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Kauai, HI
58 
63
Doretto G.
,
Chiuso A.
,
Wu Y. N.
,
Soatto S.
2003
“Dynamic textures”
International Journal of Computer Vision
51
(2)
91109 
DOI : 10.1023/A:1021669406132
Chan A. B.
,
Vasconcelos N.
2005
“Probabilistic kernels for the classification of autoregressive visual processes”
in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
San Diego, CA
846 
851
DOI : 10.1109/CVPR.2005.279
Dollar P.
,
Rabaud V.
,
Cottrell G.
,
Belongie S.
2005
“Behavior recognition via sparse spatiotemporal features”
in Proceedings of 2nd Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance
Beijing
65 
72
DOI : 10.1109/VSPETS.2005.1570899