Recently, sparse coding has achieved remarkable success in image representation tasks. In practice, the performance of clustering can be significantly improved if limited label information is incorporated into sparse coding. To this end, in this paper, a novel semisupervised algorithm, called constrained sparse concept coding (CSCC), is proposed for image representation. CSCC considers limited label information into graph embedding as additional hard constraints, and hence obtains embedding results that are consistent with label information and manifold structure information of the original data. Therefore, CSCC can provide a sparse representation which explicitly utilizes the prior knowledge of the data to improve the discriminative power in clustering. Besides, a kernelized version of our proposed CSCC, namely kernel constrained sparse concept coding (KCSCC), is developed to deal with nonlinear data, which leads to more effective clustering performance. The experimental evaluations on the MNIST, PIE and Yale image sets show the effectiveness of our proposed algorithms.
1. Introduction
I
n realworld applications, such as face recognition, image retrieval, and data clustering
[1
,
2
,
3
,
4
,
5
,
13
,
14
,
15]
, data representation of highdimensional space is a challenging problem. Generally speaking, the high dimensional data leads to the computational time and memory requirements more expensive. Moreover, traditional methods can perform well in lowdimensional space, but may degrade in highdimensional space. To solve these issues, many researchers try to seek a representation of the data in a latent semantic “concept” space instead of the original space. Therefore, matrix factorization methods based on different criterions have attracted considerable attention in the last decades.
Principal component analysis (PCA)
[2]
and linear discriminant analysis (LDA)
[3]
are the most popular matrix factorization methods. PCA is completely unsupervised learning method, which searches for a projection axis of maximal variance. In contrast with PCA, LDA is a supervised learning method, which aims to seek a transformation that maximizes the betweenclass scatter and simultaneously minimizes the withinclass scatter. The linear methods, however, fail to deal with the nonlinear distribution data. To alleviate this problem, the kernelbased methods are developed to discover the essential structures of nonlinear data. The representative methods are kernel principal component analysis (KPCA)
[4]
and kernel Fisher discriminant analysis (KFDA)
[5]
, which are the kernel extensions of PCA and LDA, respectively. Extensive experiments have shown the effectiveness of KPCA and KFDA in many realworld applications.
A common problem of the previously mentioned methods is that they fail to discover the underlying intrinsic manifold structure. To overcome this deficiency, manifoldbased learning methods are straightforward in detecting the nonlinear structures, which have been of wide concern. Yan
et al
.
[6]
proposed a general framework, called graph embedding, for dimensionality reduction. Many manifoldbased learning methods, such as isometric feature mapping (ISOMAP)
[7]
, locally linear embedding (LLE)
[8]
, laplacian eigenmaps (LE)
[9]
and locality preserving projection (LPP)
[10]
can be integrated into this framework. To consider the label information of labeled samples, He
et al
.
[11]
presented a novel semisupervised learning algorithm, called constrained graph embedding (CGE) for feature extraction and data representation. CGE incorporates the limited label information into graph embedding as additional constraints. Experimental results on real data sets have illustrated the effectiveness of CGE.
Different from all the aforementioned methods, there is psychological and physiological evidence for partsbased representation in the cognitive process of human brain. One of the wellknown partsbased methods is nonnegative matrix factorization (NMF)
[12]
that tries to decompose the original date matrix into the product of two nonnegative matrices. In particular, due to the nonnegative limitation, NMF allows only additive, not subtractive, operation, which leads to a partsbased representation of the original data. Concept factorization (CF)
[14]
is a variant of NMF in that each cluster is linearly represented by a few data, and each data is linearly represented by the cluster centers. The major advantage of CF over NMF is that it can be performed on positive as well as negative data and simultaneously kernelized to further improve performance. Until recently, massive other works have been done
[16
,
17
,
18
,
19
,
20
,
21
,
22]
on extensions of NMF.
However, the coefficient matrix of the above methods is usually dense. This is contrary to our understanding that the coefficient matrix is sparse since each sample is represented by a linear combination of only a few concepts. Inspired by biological visual systems, sparse coding (SC) is recently proposed for data representation and has been widely applied in many fields
[23
,
24
,
25
,
26
,
27]
. Traditional SC algorithm, however, fails to take consideration of the geometrical structure information, which can significantly improve the discriminant ability in realworld applications
[16
,
17
,
28
,
29]
. Consequently, a variety of extensions of SC have been developed to explore the geometrical structure of the data by adding some constraints. Wang
et al
.
[30]
proposed a novel technique, called localityconstrained linear coding (LLC), for image representation and classification. LLC preserves the local geometric structure in feature coding process. Mairalet
et al
.
[31]
developed simultaneous sparse coding as a framework where groups of similar signals are jointly decomposed by adding group sparsity regularization term. Gao
et al
.
[32]
presented a laplacian sparse coding (LSC) framework to solve the data classification and tagging problems. LSC incorporates the laplacian regularization into the mode of SC to preserve the consistence of similar local features. Similar to LSC, Zhang
et al
.
[33]
proposed a graph regularization sparse coding (GSC) approach for image representation. GSC captures the intrinsic manifold structure with resort to laplacian graph regularization. Experimental results on image databases have shown the effectiveness of GSC. However, one evident drawback of all the aforementioned SC methods is computationally expensive to optimize their models. To overcome this limitation, Cai
et al
.
[34]
proposed a very efficient algorithm, namely sparse concept coding (SCC), for visual analysis. One of the major advantages of SCC is very efficient because it only solves a sparse eigenvalue problem and two regression problems. For each sample, SCC seeks a sparse representation of basis vectors that are embedded the semantic structure information of the data.
Unfortunately, SCC is completely unsupervised without regard to label information. Some previous research efforts reveal that the simultaneous use of the labeled data and unlabeled data can further improve performance in clustering
[35
,
36]
. Thus, we propose a novel semisupervised learning algorithm, called
constrained sparse concept coding (CSCC)
, for data representation. CSCC considers limited label information and the intrinsic manifold structure of data, simultaneously. Our empirical study on benchmark data sets shows the promising results of our proposed algorithm. The contributions of this paper are as follows:
(1) Our proposed CSCC preserves some merits of SCC. For example, it exploits the manifold structure of the data, and simultaneously only solves a sparse eigenvalue problem and two regression problems. Therefore, CSCC is also computationally efficient in comparison with other sparse coding methods.
(2) Compared with SCC, CSCC is a semisupervised learning algorithm, and thus considers limited label information as additional hard constraints. Moreover, CSCC incorporates the label information into graph embedding in a parameterfree manner. As a result, CSCC not only respects the intrinsic manifold structure of the data, but also takes advantage of the label information of the labeled data.
(3) We also propose another algorithm, called
kernel constrained sparse concept coding
(KCSCC), based on our proposed CSCC. With the kernel trick, nonlinear relationships among data are transformed into linear relationships in the highdimensional kernel space. Therefore, we may obtain more effective performance in most cases.
The remainder of this paper is organized as follows: Section 2 gives a brief review of the related work. Section 3 introduces our proposed algorithms. Section 4 provides the experimental results to demonstrate the effectiveness of the proposed algorithms. Finally, we provide some concluding remarks and suggestions for future work in Section 5.
2. Related Work
In the past few years, sparse coding is proposed based on matrix factorization for data representation, which aims to seek a sparse linear combination of basis vectors for each sample. Therefore, the model of SC can be defined based on matrix factorization model by imposing sparse constraints on representation coefficient.
Given a sample set
X
=[
x
_{1}
,⋯,
x
_{n}
]∈ℜ
^{m×n}
, where
x_{i}
stands for a sample. Let
U
=[
u
_{1}
,⋯,
u
_{k}
]∈ℜ
^{m×k}
be the basis matrix, where
u_{i}
can be regarded as a basis in the new representation space. Let
A
=[
a
_{1}
,⋯,
a
_{n}
]∈ℜ
^{k×n}
be the coefficient matrix, where
a_{i}
denotes the representation coefficient of
x_{i}
. Thus, the objective function of SC can be formally expressed as:
where 
A

_{0}
denotes the ℓ0norm and enforces the sparsity on
A
, and
β
is a regularization parameter.
Unfortunately, ℓ0norm minimization problem is not convex. Therefore, finding the sparsest solution of Eq. (2) is a NPhard problem and computationally expensive. Recent studies have shown that if the solution is sparse enough, the sparsest solution of ℓ1norm minimization problem is equal to the solution of ℓ0norm minimization problem
[37
,
38]
. Thus, an alternative formulation for Eq. (1) is to replace ℓ0norm regularization by ℓ1norm regularization to enforce sparsity constraints:
where 
A

_{1}
denotes ℓ1norm of the coefficient matrix
A
. Here, we can employ standard linear programming methods to solve the ℓ1norm minimization problem in Eq. (2).
3. Our Proposed Methods
 3.1 Matrix Factorization
Generally, factorization of matrices may be nonunique, and hence varieties of matrix factorization methods have been developed by imposing different constraints. Specifically, given a data set
X
=[
x
_{1}
,⋯,
x
_{n}
]∈ℜ
^{m×n}
, matrix factorization method tries to seek two matrices
U
∈ ℜ
^{m×k}
and
A
∈ ℜ
^{k×n}
satisfying:
where
U
denotes the basis vectors and
A
is the coefficient matrix of the samples under the basis vectors
U
. Thus, the objective function of matrix factorization can be formalized as:
where
denotes the Frobenius norm of a matrix.
The models of various matrix factorization methods, such as PCA, LDA, Graph Embedding, NMF, CF, and SC can be constructed by adding different constraints on Eq. (3) based on different purposes. In this way, we can impose some constraints on Eq. (3) that the basis vectors
U
should be embedded the semantic structure of the data, and simultaneously the representation coefficient
A
should be sparse. In the next subsection, we first introduce the CSCC algorithm in detail.
 3.2 CSCC algorithm
CSCC is a threestep algorithm for data representation. The first step is
concept extraction
. CSCC takes the label information as additional constraints into graph embedding. Therefore, the low dimensional
concepts
exploit both limited label information and the manifold structure information. The second step is
basis learning
. CSCC aims to learn a basis that can best fit the concepts. As a result, the basis are encoded the semantic structure of the original data. The final step is
sparse representation learning
. We can employ LARs
[39]
algorithm to learn a sparse representation for each data.
 3.2.1 Concept extraction
Given a data set
X
=[
x
_{1}
,⋯,
x
_{n}
]∈ℜ
^{m×n}
, an adjacency graph
G
={
X
,
W
} can be constructed with data set
X
, where
W
denotes a weighted matrix. The elements of the weighted matrix
W
can be usually defined as:
where N
_{p}
(
x_{j}
) is the set of
p
nearest neighbors of
x_{j}
,
L
=
D

W
is the Laplacian matrix,
D
is a diagonal matrix and
D_{ii}
=∑
_{j}
W_{ij}
. Let
y
=[
y
_{1}
,⋯,
y_{n}
]
^{T}
be the map from the graph to the real line. According to the reference
[13]
, we will introduce how to incorporate label information of labeled data into graph embedding.
Assume that the first
l
samples
x
_{1}
,⋯
x_{l}
belong to
c
classes as labeled set and the rest
n

l
samples
x
_{l}
_{+1}
,⋯,
x_{n}
are unlabeled. We first construct an indicator matrix
M
∈ℜ
^{l×c}
where
m_{ij}
= 1 if
x_{i}
is labeled with the
j
th class;
m_{ij}
= 0 otherwise. Once obtaining the indicator matrix
M
, we define the label constraint matrix S∈ℜ
^{n×(nl+c)}
as follows:
where
I
_{nl}
is a (
n

l
)×(
n

l
) identity matrix. For example, given
n
samples among which
x
_{1}
is from the first class,
x
_{2}
,
x
_{3}
and
x
_{4}
are from the second class,
x
_{5}
and
x
_{6}
are from the third class, and the rest
n
6 samples are unlabeled. Thus, the label constraint matrix
S
can be defined:
where
I
_{n6}
denotes a (
n
6)×(
n
6) identity matrix. Note that concept extraction can map each image
x_{i}
to
y_{i}
from the original feature space to the new concept space. To take advantage of limited label information, we can impose the label constraints by introducing an auxiliary matrix
z
:
From the Eq. (5), it can be found that if
x_{i}
and
x_{j}
share the same label, then
y_{i}
=
y_{j}
. Thus, we have:
And
Thus, the minimization problem reduces to:
We can obtain the optimal vector z of Eq. (6) by solving the following generalized eigenvalue problem:
Let
Z
=[
z
_{1}
,⋯,
z_{d}
],
z_{i}
’s are the eigenvectors of the generalized eigenvalue problem in Eq. (7) corresponding to the smallest eigenvalue. After we obtain
z
, the
y
can be derived by Eq. (5). Let
Y
=[
y
_{1}
,⋯,
y_{d}
], each row of
Y
is called a “
concept
” which embeds the semantic structure information for each sample. If there is no labeled data, we can obtain
S
=
I_{n}
. In this case, the CSCC method reduces to SCC method.
 3.2.2 Basis learning
Considering the label information and the manifold structure at the same time, we aim to find a basis
U
that can best fit
Y
. That is to say, the basis
U
needs to satisfy
X^{T}U
=
Y
. Unfortunately, the system is underdetermined, such
U
does not exist. A feasible way to solve this problem is to impose a penalty on the norm of
U
as follows:
where
α
is the nonnegative constant parameter and
α
∥
U
∥
^{2}
can avoid overfitting. In statistical learning, the model in Eq. (8) is called
Ridge Regression
problem
[41]
.
By taking the derivative of Eq. (8) with respect to
U
and setting it to zero, the optimal solution
U
^{*}
of Eq. (8) can be expressed as follows:
Actually, the dimensionality of the images is so high that it is extremely timeconsuming to solve (
XX^{T}
+
αI
)
^{1}
. Fortunately, some iterative algorithms, such as LSQR
[40]
, are used to directly find the solution of the regression problem in Eq. (8).
 3.2.3 Sparse Representation Learning
Suppose that
a_{i}
and
x_{i}
denote the
i
th column vector of
A
and
X
, respectively. Once we obtain the basis
U
, the coefficient
a_{i}
of the sample
x_{i}
can be solved through the following minimization problem:
where 
a_{i}
 indicates the ℓ1norm of
a_{i}
and
β
> 0 is a constant parameter. The ℓ1norm minimization problem in Eq. (10) is called
LASSO
in statistical learning
[41]
.
Note that the minimization optimization problem in Eq. (10) can be reformulated as follows:
Fortunately, the Least Angel Regression (LARs)
[39]
algorithm can be used to solve the minimization optimization problem in (11). Thus, we need to set the cardinality (the number of nonzero entries) of
a_{i}
without the parameter
γ
. In this way, it is easy to control the sparseness of representation coefficient
a_{i}
.
 3.2.4 Computational Complexity of CSCC
Let
q
be the average number of nonzero entries in each data, and
q
≤
m
.
p
denotes the size of nearest neighbors, and
d
represents the dimensionality of the concept. The total computational complexity of CSCC algorithm includes three parts:
(1) In concept extraction phase, we need
O
(
n
^{2}
q
+
n
^{2}
p
) operations to calculate the nearest neighbor graph and
O
(
dnp
+2
n
) operations to calculate
Y
. Considering
d
<<
n
, we need
O
(
n
^{2}
q
+
n
^{2}
p
) operations to extract the concepts.
(2) In basis learning phase, the time complexity of calculating the
U
^{*}
in Eq. (9) with LSQR method is
O
(
dnq
).
(3) In sparse representation phase, the time complexity of calculating the coefficient matrix
A
in Eq. (11) with LARs method is
O
(
d
^{3}
+
md
^{2}
).
In summary, the total complexity of our CSCC algorithm is
O
(
n
^{2}
q
+
n
^{2}
p
+
d
^{3}
+
md
^{2}
). From the reference
[34]
, it is easy to check that the total computational cost of CSCC is equal to that of SCC.
 3.3 KCSCC algorithm
To our knowledge, the kernel trick is proposed in SVM to handle classification tasks that cannot be linearly separable in the original space. Once the samples in original input space are mapped to the high dimensional feature space via kernel trick, the linear algorithms in pattern recognition can be employed to handle the data in kernel space.
Motivated by the fact that kernel methods can deal with the nonlinear structure data, we propose a nonlinear extension of our CSCC, namely KCSCC, which seeks a sparse representation of nonlinear data. Similar to CSCC, our proposed KCSCC is also a threestep method including concept extraction, basis learning and sparse representation learning.
The concept extraction phase of our proposed KCSCC is the same as that of CSCC. Consequently, we only introduce the basis learning and sparse representation learning of KCSCC.
 3.3.1 Basis learning
Suppose that there exists a nonlinear mapping function
Φ
:ℜ
^{n}
→F which can map the original feature to the kernel feature space:
Meanwhile, define the
K
(·,·) as a Gram matrix, with elements
where
κ
(·,·) is a positive semidefinite kernel function.
Similar to basis learning phase of CSCC, KCSCC needs to find a basis
in high dimensional space to satisfy:
where
Y
stands for embedding results in Eq. (4). In fact, the system in Eq. (13) is also underdetermined, such
does not exist. A natural approach is to approximate
by solving:
where
I
denotes an identity matrix and
α
> 0 is a parameter. Thus, the optimal solution
of Eq. (14) is expressed as follows:
Similarly, it is computational expensive to solve (
K
+
αI
)
^{1}
. Fortunately, finding the solution of Eq. (14) can be converted into solving a regression problem. First, we need to define a projective function in the kernel space as follows:
It is easy to show that the optimal solution
of Eq. (14) is the same as that of the following regularized regression problem:
where
f
(
x_{i}
) and
y_{i}
are the ith element of
f
(
x
) and
Y
, respectively;
F
is the RKHS associated with Mercer kernel
κ
and ∥∥
_{κ}
is the corresponding norm.
Similarly, the regression problem in Eq. (17) can be directly solved by using the LSQR
[40]
algorithm.
 3.3.2 Sparse representation learning
Suppose that
and
x_{i}
are the
i
th column of
and
X
, respectively. After we get the basis
, the coefficient
of the sample
x_{i}
can be computed as follows:
where
enforces the sparsity on
,
K
(:,
x_{i}
)=[
K
(
x
_{1}
,
x_{i}
),⋯,
K
(
x_{n}
,
x_{i}
)]
^{T}
and
β
> 0 is a regularization parameter.
Note that the ℓ1norm minimization problem in Eq. (18) can be reformulated as:
Similarly, we can employ LARs
[39]
algorithm to directly solve the Eq. (19). Therefore, we need to specify the cardinality of
without the parameter
γ
.
 3.3.3 Computational complexity of KCSCC
Similar to CSCC, the consuming time of the concept extraction phase is
O
(
n
^{2}
q
+
n
^{2}
p
) in our proposed KCSCC.
In basis learning phase, we need
O
(
mn
^{2}
) operations to map the original data to the kernelspace in Eq. (12). In addition, we also need
O
(
n
^{3}
+
n
^{2}
d
) to solve the basis vectors by using the LSQR algorithm. Considering
d
<<
n
, the total complexity of this phase is
O
(
mn
^{2}
+
n
^{3}
).
In sparse representation learning phase, we need
O
(
mn
^{2}
) operations to map the original data to the kernel space. Besides, we need
O
(
d
^{3}
+
md
^{2}
) operations to obtain the coefficient matrix
by LARs algorithm. This total complexity of sparse representation learning can be written as
O
(
mn
^{2}
+
d
^{3}
).
In summary, the total complexity of our proposed KCSCC is
O
(
n
^{2}
q
+
n
^{2}
p
+
n
^{3}
+
mn
^{2}
).
4. Experimental Classification Results and Analysis
Recent studies have shown that matrix factorization methods are very powerful in clustering tasks. We carry out several experiments to evaluate the performance of our proposed algorithms on MNIST, PIE and Yale image databases. In our experiments, the two evaluation metrics of the clustering involve accuracy (AC) and normalized mutual information (NMI). The detail definitions of AC and NMI are found in
[14]
.
 4.1 Performance Evaluation and Comparisons
In this subsection, we will systematically conduct the evaluations on some image datasets and compare the performances of CSCC and KCSCC with some other algorithms such as Kmeans, CF, NMF, PCA and SCC.
In these experiments, we randomly choose
K
(=3, 4,···,10) categories from the image data sets for clustering. For our proposed semisupervised algorithm (CSCC and KCSCC), we randomly pick up some date samples for each subject to provide the available label information, and other data samples are unlabeled. For each given cluster number
K
, the experiment process is repeated 10 times and the average performance is recorded as the final result.
 4.1.1 Experiments on MNIST handwritten database
The MNIST handwritten database includes a training set of 60 000 examples, and a test set of 10 000 examples. In this experiment, 500 samples are selected for clustering. These digit images have been normalized to 28×28 gray scale images.
Fig. 1
shows some handwritten samples from MNIST database.
Images from the MNIST database
In this experiment, we randomly pick up 50 samples for each subject and randomly select 20% from these samples as available label information. The rest digit images are left as unlabeled data. This is, for each subject, the labeled data are 10 images and the rest unlabeled data are 40 images, which are mixed as a whole for clustering. The experimental results are summarized in
Table 1
. It is noticeable that our proposed methods consistently outperform other competitors. As shown in
Table 1
, compared with the best algorithm, i.e., SCC, CSCC and KCSCC achieve 6.6 % and 9% improvement in AC, respectively. For NMI, CSCC and KCSCC algorithms achieve 7.3% and 7.6% improvement, respectively.
Clustering performance on MNIST database
Clustering performance on MNIST database
 4.1.2 Experiments on PIE face database
The PIE face database consists of 41,368 images of 68 individuals. The face images were captured by 13 synchronized cameras and 21 flashes under different pose, illumination and expression. In this experiment, we choose the frontal pose (C27) with varying lighting and illumination which leave us about 46 images per subject. The gray face images are normalized to 32×32 pixels. Some sample images from the PIE database are shown in
Fig. 2
.
Face examples from the PIE database
In this experiment, we randomly choose 46 samples per subject for clustering. For each category, we randomly pick up 9 face images to provide label information as labeled set, the rest images without label information as unlabeled set. Then the labeled set and unlabeled set are mixed for clustering. Shown in
Table 2
are the comparison results of all matrix factorization methods. As we can see, KCSCC achieves the highest average AC 68.7% and the highest average NMI 67.5%. The average AC of CSCC achieves nearly 3.7% improvement and the average NMI achieves 6.2% improvement over the SCC, respectively. Compared to the CSCC, the average AC of KCSCC is better 1.8% and the average NMI is slightly better 1.1%.
Clustering performance on PIE database
Clustering performance on PIE database
 4.1.3 Experiments on Yale database
The Yale face database contains 15 subjects each providing 11 different images, thus 165 face images in total. For some subjects, the face images were taken under various lighting conditions and facial expressions. All face images have been normalized to 32×32 pixels.
Fig. 3
shows some face images of two subjects from the Yale database.
Face examples from the Yale database
In this experiment, we randomly pick up the 20% images as labeled data, the rest images as unlabeled data for each category. In other words, since each category includes 11 images, we randomly choose 2 images to provide the label information as additional constraints, and consider the rest 9 images as the unlabeled data. Finally, the labeled data and unlabeled data are mixed as a whole for clustering. The experimental results are summarized in
Table 3
. In general, our proposed methods are always better than other algorithms on Yale database. Compared with the SCC algorithm, CSCC and KCSCC algorithms achieve 3.9% and 4.5% improvement in AC, respectively. For NMI, CSCC and KCSCC achieve 4.2% and 5.6% improvement, respectively. Compared with the best NMF algorithm, CSCC and KCSCC algorithms achieve 2.7% and 3.3% improvement in AC, respectively. For NMI, CSCC and KCSCC algorithm achieve 3.5% and 4.9% improvement, respectively.
Clustering performance on Yale database
Clustering performance on Yale database
 4.2 Parameters Discussion
In our experiments, the size of neighborhood
p
and the regularization parameter
α
are empirically set to 5 and 0.1, respectively. We implement our proposed KCSCC algorithm with degree 2 polynomial kernel. Suppose that the cardinality denotes the nonzero number of representation coefficient and is empirically specified as half of the number of the basis. In addition, the number of basis vectors is empirically set as the number of clusters.
Fig. 4
shows how the performances of our proposed algorithms vary with the cardinality parameter on all image databases. In each experiment, we simply use the first 10 categories samples for clustering. Since the number of basis vectors is empirically set to the number of clusters, there are 10 basis vectors in new concept space. In other words, we can use a 10 dimension vector to represent each data on concept space, where the vector is generally called representation coefficient. From
Fig. 4
we can see that each high dimensional data, such as 1024 dimensionality, can be represented by the coefficient with only 3 nonzero entries in new concept space. Therefore, it indicates that the representation for each image is very sparse in concept space. This observation suggests that we can use a linear combination of only a few concepts to represent each image. This is consistent with our common knowledge since most of the images contain only a few concepts.
The performance with varied the cardinality
Fig. 5
shows the performances of our proposed algorithms with the increasing of the number of labeled data. Kmeans, CF, NMF, PCA and SCC are unsupervised learning algorithms without regard to label information of the data. CSCC and KCSCC, however, are semisupervised learning algorithms, and thus the performances have a close relation to the number of the labeled data. In each experiment, we randomly choose 10 categories from image database and randomly pick up different ratios of data samples per class as labeled data for clustering. Then 10 independent experiments are taken to calculate the average performance and the results are shown in
Fig. 5
. We carry out the experiments with the ratio of the labeled data ranging from 10% to 50%. As can be seen in
Fig. 5
, CSCC and KCSCC outperform other unsupervised learning algorithms. Moreover, it can be found that the performances of our semisupervised learning algorithms, such as CSCC and KCSCC, become better as the number of labeled data increases. It implies that label information plays an important role for image representation. This is also consistent with our understanding for the role of label information in semisupervised learning algorithm.
The performance with varied the number of labeled data
Fig. 6
displays the results of all comparative algorithms with varied the number of basis vectors on three image datasets. For various matrix factorization algorithms, the selection of the number of basis vectors is a fundamental topic. Throughout above experiments, the number of basis vectors is empirically set to the number of the clusters. In this experiment, we randomly choose 10 categories data from the image database for clustering and repeat the experiment process 10 times. Then the average performances of all methods are shown in
Fig. 6
. As we can see, when the number of basis vectors ranges from 5 to 250, it is easy to check that the performances of CSCC and KCSCC are superior to other matrix factorization algorithms on all databases. It can be found that our proposed algorithms achieve the highest performance when the number of basis vectors is equal to the number of classes. Besides, we can see that the performances of CSCC and KCSCC are relatively stable with varied the number of basis vectors.
The performance with varied the number of basis vectors
 4.3 Observations
These experiments on MNIST, PIE and Yale image data sets reveal some interesting points:
(1) As we can see, SCC is superior to Kmeans, NMF, CF and PCA on MNIST and PIE databases. Unfortunately, the performance of SCC is inferior to NMF and PCA on Yale database. Thus, we can see that SCC cannot achieve the best performance compared with NMF and PCA in all data sets. The reason may be that SCC only explores the manifold structure of the data.
(2) CSCC can obtain the better performance than SCC on all image databases. The reason is that SCC is completely unsupervised. Our proposed CSCC, however, is a semisupervised learning algorithm, and thus incorporates limited label information into graph embedding. The embedding results of CSCC are consistent with the prior knowledge such that the images from the same class are merged together and simultaneously preserve the manifold structure information of data. The experimental results demonstrate that limited label information, when used in conjunction with geometric manifold structure information, can improve the performance in clustering.
(3) Regardless of the database, we can see that the average performance of KCSCC consistently outperforms CSCC. The main reason is that KCSCC not only preserves the advantages of CSCC, but also can handle the nonlinear structure data via kernel trick. Therefore, KCSS provides more discriminative power than other competitors, such as SCC and CSCC.
(4) As we can see, when the minimum for cardinality is 3, our proposed CSCC and KCSCC algorithms can maintain the stable performances on three image databases. Therefore, the representation coefficient for each image is very sparse in new concept space. It is consistent with our understanding that each image can be presented by a linear combination only a few concepts.
5. Conclusion
In this paper, a novel semisupervised method, called CSCC, is proposed for image representation, which has many advantages over traditional sparse coding techniques. CSCC explores the geometric structure information among the original data, and simultaneously takes advantage of the limited label information in a parameterfree manner.
Subsequently
, the kernel extension of CSCC, named KCSCC, is proposed to deal with the nonlinear distribution data. KCSCC has more discriminative power than its linear method in most cases.
Finally
, similar to SCC, our proposed algorithms only solve an eigenvalue problem and two regression problems. In comparison with traditional sparse coding algorithms, CSCC and KCSCC are also very efficient. Experimental results demonstrate that our algorithms can provide a better representation than stateoftheart matrix factorization algorithms.
However, there are still several drawbacks existed in CSCC and KCSCC to be considered in the future work. First, our proposed algorithms cannot provide a mechanism for noise removing, thus they are not robust methods for image representation. Therefore, we can replace the ℓ2 norm by the ℓ2,1 norm for noisy data in future work. Second, the number of basis vectors and the cardinality play a very important role in improving the performances of our proposed algorithms. How to effectively set them is an interesting topic.
BIO
Zhenqiu Shu received the B.Sc. degree from University of South China, Hengyang, China, in 2008 and the M.S. degree Kunming University of Science and Technology, Kunming, China, in 2011. From 2011 to now, he is working toward his Ph.D. degree in Pattern Recognition and Intelligent Systems from Nanjing University of Science and Technology (NUST), Jiangsu, China. His research interests include machine learning, data mining, and pattern recognition.
Chunxia Zhao received the B.E., M.S. and Ph.D. degrees from Harbin Institute of Technology, Harbin, China, in 1985, 1988, 1998, respectively, both in the Department of Electrical Engineering and Computer. She is a professor in the Department of Computer Science, Nanjing University of Science and Technology. Her current interests are in the areas of robots, computer vision, and pattern recognition.
Pu Huang received his B.S. and M.S. degrees in computer applications from Yangzhou University, PR China, in 2007 and 2010, respectively. He is currently pursuing the Ph.D. degree in Pattern Recognition and Intelligent Systems at Nanjing University of Science and Technology (NUST), China. His research interests include pattern recognition, computer vision and machine learning.
Zhao W.
,
Chellappa R.
,
Phillips P. J.
2003
“Face recognition: A literature survey”
ACM Computing Surveys (CSUR)
Article (CrossRef Link)
35
(4)
399 
458
DOI : 10.1145/954339.954342
Turk M.
,
Pentland A.
1991
“Eigenfaces for recognition”
Journal of Cognitive Neuroscience
Article (CrossRef Link)
3
(1)
71 
86
DOI : 10.1162/jocn.1991.3.1.71
Belhumeur P. N.
,
Hespanha J. P.
,
Kriegman D. J.
1997
“Eigenfaces vs. fisherfaces: recognition using class specific linear projection”
IEEE Transactions on Pattern Analysis and Machine Intelligence
Article (CrossRef Link)
19
(7)
711 
720
DOI : 10.1109/34.598228
Lee J. M.
,
Yoo C. K.
,
Choi S. W.
2004
“Nonlinear process monitoring using kernel principal component analysis”
Chemical Engineering Science
Article (CrossRef Link)
59
(1)
223 
234
DOI : 10.1016/j.ces.2003.09.012
Mika S.
,
Ratsch G.
,
Weston J.
2003
“Constructing descriptive and discriminative nonlinear features: Rayleigh coefficients in kernel feature spaces”
IEEE Transactions on Pattern Analysis and Machine Intelligence
Article (CrossRef Link).
25
(5)
623 
628
DOI : 10.1109/TPAMI.2003.1195996
Yan S.
,
Xu D.
,
Zhang B.
2007
“Graph embedding and extensions: A general framework for dimensionality reduction”
IEEE Transactions on Pattern Analysis and Machine Intelligence
Article (CrossRef Link).
29
(1)
40 
51
DOI : 10.1109/TPAMI.2007.250598
Tenenbaum J. B.
,
de Silva V.
,
Langford J. C.
2000
“A global geometric framework for nonlinear dimensionality reduction”
Science
Article (CrossRef Link)
290
(5500)
2319 
2323
DOI : 10.1126/science.290.5500.2319
Roweis S. T.
,
Saul L. K.
2000
“Nonlinear dimensionality reduction by locally linear embedding”
Science
Article (CrossRef Link)
290
(5500)
2323 
2326
DOI : 10.1126/science.290.5500.2323
Belkin M.
,
Niyogi P.
2003
“Laplacian eigenmaps for dimensionality reduction and data representation”
Neural Comput.
Article (CrossRef Link)
15
(6)
1373 
1396
DOI : 10.1162/089976603321780317
He X.
,
Yan S.
,
Hu Y.
2005
“Face recognition using Laplacianfaces”
IEEE Transactions on Pattern Analysis and Machine Intelligence
Article (CrossRef Link)
27
(3)
328 
340
DOI : 10.1109/TPAMI.2005.55
He X.
,
Ji M.
,
Bao H.
2009
“Graph embedding with constraints”
in Proc. of t the 21st International Joint Conference on Artificial Intelligence (IJCAI)
Pasadena, USA
vol. 9, Article (CrossRef Link)
1065 
1070
Lee D.
,
Seung H.
1999
“Learning the parts of objects by nonnegative matrix factorization”
Nature
Article (CrossRef Link)
401
788 
791
DOI : 10.1038/44565
Qian J.
,
Yang J.
2013
“Discriminative Histograms of Local Dominant Orientation (DHLDO) for Biometric Image Feature Extraction”
Pattern Recognition
Article (CrossRef Link)
46
(10)
2724 
2739
DOI : 10.1016/j.patcog.2013.03.005
Xu W.
,
Gong Y.
2004
“Document Clustering by Concept Factorization”
in Proc. of ACM SIGIR ’04
Article (CrossRef Link)
202 
209
Qian J.
,
Yang J.
,
Xu Y.
2013
“Local Structurebased Image Decomposition for Feature Extraction with Applications to Face Recognition”
IEEE Transactions on Image Processing
Article (CrossRef Link)
22
(9)
3591 
3603
DOI : 10.1109/TIP.2013.2264676
Chen Y
,
Zhang J.
,
Cai D.
2013
“Nonnegative local coordinate factorization for image representation”
IEEE Transactions on Image Processing
Article (CrossRef Link)
22
(3)
969 
979
DOI : 10.1109/TIP.2012.2224357
Cai D.
,
He X.
,
Han J.
2011
“Graph regularized nonnegative matrix factorization for data representation”
IEEE Transactions on Pattern Analysis and Machine Intelligence
Article (CrossRef Link)
33
(8)
1548 
1560
DOI : 10.1109/TPAMI.2010.231
Liu H.
,
Wu Z.
,
Li X.
2012
“Constrained nonnegative matrix factorization for image representation”
IEEE Transactions on Pattern Analysis and Machine Intelligence
Article (CrossRef Link)
4
(7)
1299 
1311
DOI : 10.1109/TPAMI.2011.217
He Y.
,
Lu H.
,
Huang L.
2014
“Pairwise constrained concept factorization for data representation”
Neural Networks
Article (CrossRef Link)
52
1 
17
DOI : 10.1016/j.neunet.2013.12.007
Li Z.
,
Liu J.
,
Lu H.
2013
“Structure preserving nonnegative matrix factorization for dimensionality reduction”
Computer Vision and Image Understanding
Article (CrossRef Link)
117
(9)
1175 
1189
DOI : 10.1016/j.cviu.2013.04.003
Yang J.
,
Yang S.
,
Fu Y.
2008
“Nonnegative graph embedding”
in Proc. of IEEE Conference on Computer Vision and Pattern Recognition
Alaska, USA
Article (CrossRef Link)
1 
8
Wang J.
,
Bensmail H.
,
Gao X.
2013
“Multiple graph regularized nonnegative matrix factorization”
Pattern Recognition
Article (CrossRef Link)
46
(10)
2840 
2847
DOI : 10.1016/j.patcog.2013.03.007
Wright J.
,
Yang A.
,
Sastry S.
2009
“Robust face recognition via sparse representation”
IEEE Transactions on Pattern Analysis and Machine Intelligence
Article (CrossRef Link)
31
(2)
210 
227
DOI : 10.1109/TPAMI.2008.79
Qian Y.
,
Jia S.
,
Zhou J.
2011
"Hyperspectral unmixing via L1/2 sparsityconstrained nonnegative matrix factorization"
IEEE Transactions on Geoscience and Remote Sensing
Article (CrossRef Link)
49
(11)
4282 
4297
DOI : 10.1109/TGRS.2011.2144605
Li M.
,
Tang J.
,
Zhao C.
2012
“Active Learning on Sparse Graph for Image Annotation”
KSII Transactions on Internet & Information Systems
Article (CrossRef Link)
6
(10)
2650 
2662
Tang J.
,
Hong R.
,
Yan S.
2011
“Image annotation by kNN sparse graphbased label propagation over noisily tagged web images”
ACM Transactions on Intelligent Systems and Technology
Article (CrossRef Link)
2
(2)
1 
15
DOI : 10.1145/1899412.1899418
Zheng H.
,
Ye Q.
,
Jin Z.
2014
“A Novel Multiple Kernel Sparse Representation based Classification for Face Recognition”
KSII Transactions on Internet & Information Systems
Article (CrossRef Link).
8
(4)
1483 
1480
Cai D.
,
Wang X.
,
He X.
2009
“Probabilistic dyadic data analysis with local and global consistency”
in Proc. of the 26th Annual International Conference on Machine Learning
Article (CrossRef Link)
105 
112
Cai D.
,
He X.
,
Han J.
2011
“Locally consistent concept factorization for document clustering”
IEEE Transactions on Knowledge and Data Engineering
Article (CrossRef Link)
23
(6)
902 
913
DOI : 10.1109/TKDE.2010.165
Wang J.
,
Yang J.
,
Yu K
2010
“Localityconstrained linear coding for image classification”
in Proc. of IEEE Conference on Computer Vision and Pattern Recognition
San Francisco
Article (CrossRef Link)
3360 
3367
Mairal J.
,
Bach F.
,
Ponce J.
,
Sapiro G.
,
Zisserman A.
2009
“Nonlocal sparse models for image restoration”
in Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Kyoto
Article (CrossRef Link)
2272 
2279
Gao S
,
Tsang I W H
,
Chia L T
2013
“Laplacian sparse coding, hypergraph laplacian sparse coding, and applications”
IEEE Transactions on Pattern Analysis and Machine Intelligence
Article (CrossRef Link)
35
(1)
92 
104
DOI : 10.1109/TPAMI.2012.63
Zheng M
,
Bu J
,
Chen C
2011
“Graph regularized sparse coding for image representation”
IEEE Transactions on Image Processing
Article (CrossRef Link)
20
(5)
1327 
1336
DOI : 10.1109/TIP.2010.2090535
Cai D.
,
Bao H.
,
He X.
2011
“Sparse concept coding for visual analysis”
in Proc. of IEEE Conference on Computer Vision and Pattern Recognition
Colorado, USA
Article (CrossRef Link)
2905 
2910
Zhou D.
,
Bousquet O.
,
Lal T. N.
,
Weston J.
2004
“Learning with local and global consistency”
in Proc. of Advances in Neural Information Processing Systems
Vancouver, Canada
Article (CrossRef Link)
321 
328
Zhu X.
,
Ghahramani Z.
,
Lafferty J.
2003
“Semisupervised learning using gaussian fields and harmonic functions”
in Proc. of the twentieth Internation Conference on Machine Learning
Washington, DC, United states
Article (CrossRef Link)
912 
919
Donoho D. L.
2006
“For most large underdetermined systems of linear equations the minimal L1norm solution is also the sparsest solution”
Communications on pure and applied mathematics
Article (CrossRef Link)
59
(6)
797 
829
DOI : 10.1002/cpa.20132
Candes E J
,
Romberg J K
,
Tao T
2006
“Stable signal recovery from incomplete and inaccurate measurements”
Communications on pure and applied mathematics
Article (CrossRef Link)
59
(8)
1207 
1223
DOI : 10.1002/cpa.20124
Efron B.
,
Hastie T.
,
Johnstone I.
,
Tibshirani R.
2004
“Least angle regression”
Annals of Statistics
Article (CrossRef Link)
32
(2)
407 
499
DOI : 10.1214/009053604000000067
Paige C. C.
,
Saunders M. A.
1982
“LSQR: An algorithm for sparse linear equations and sparse least squares”
ACM Transactions on Mathematical Software
Article (CrossRef Link)
8
(1)
43 
71
DOI : 10.1145/355984.355989
Friedman J.
2001
“The Elements of Statistical Learning: Data Mining, Inference, and Prediction”
The Mathematical Intelligencer
Article (CrossRef Link)
27
(2)
83 
85