Advanced
Constrained Sparse Concept Coding algorithm with application to image representation
Constrained Sparse Concept Coding algorithm with application to image representation
KSII Transactions on Internet and Information Systems (TIIS). 2014. Sep, 8(9): 3211-3230
Copyright © 2014, Korean Society For Internet Information
  • Received : April 14, 2014
  • Accepted : August 08, 2014
  • Published : September 28, 2014
Download
PDF
e-PUB
PubReader
PPT
Export by style
Share
Article
Author
Metrics
Cited by
TagCloud
About the Authors
Zhenqiu Shu
Chunxia Zhao
Pu Huang

Abstract
Recently, sparse coding has achieved remarkable success in image representation tasks. In practice, the performance of clustering can be significantly improved if limited label information is incorporated into sparse coding. To this end, in this paper, a novel semi-supervised algorithm, called constrained sparse concept coding (CSCC), is proposed for image representation. CSCC considers limited label information into graph embedding as additional hard constraints, and hence obtains embedding results that are consistent with label information and manifold structure information of the original data. Therefore, CSCC can provide a sparse representation which explicitly utilizes the prior knowledge of the data to improve the discriminative power in clustering. Besides, a kernelized version of our proposed CSCC, namely kernel constrained sparse concept coding (KCSCC), is developed to deal with nonlinear data, which leads to more effective clustering performance. The experimental evaluations on the MNIST, PIE and Yale image sets show the effectiveness of our proposed algorithms.
Keywords
1. Introduction
I n real-world applications, such as face recognition, image retrieval, and data clustering [1 , 2 , 3 , 4 , 5 , 13 , 14 , 15] , data representation of high-dimensional space is a challenging problem. Generally speaking, the high dimensional data leads to the computational time and memory requirements more expensive. Moreover, traditional methods can perform well in low-dimensional space, but may degrade in high-dimensional space. To solve these issues, many researchers try to seek a representation of the data in a latent semantic “concept” space instead of the original space. Therefore, matrix factorization methods based on different criterions have attracted considerable attention in the last decades.
Principal component analysis (PCA) [2] and linear discriminant analysis (LDA) [3] are the most popular matrix factorization methods. PCA is completely unsupervised learning method, which searches for a projection axis of maximal variance. In contrast with PCA, LDA is a supervised learning method, which aims to seek a transformation that maximizes the between-class scatter and simultaneously minimizes the within-class scatter. The linear methods, however, fail to deal with the nonlinear distribution data. To alleviate this problem, the kernel-based methods are developed to discover the essential structures of nonlinear data. The representative methods are kernel principal component analysis (KPCA) [4] and kernel Fisher discriminant analysis (KFDA) [5] , which are the kernel extensions of PCA and LDA, respectively. Extensive experiments have shown the effectiveness of KPCA and KFDA in many real-world applications.
A common problem of the previously mentioned methods is that they fail to discover the underlying intrinsic manifold structure. To overcome this deficiency, manifold-based learning methods are straightforward in detecting the nonlinear structures, which have been of wide concern. Yan et al . [6] proposed a general framework, called graph embedding, for dimensionality reduction. Many manifold-based learning methods, such as isometric feature mapping (ISOMAP) [7] , locally linear embedding (LLE) [8] , laplacian eigenmaps (LE) [9] and locality preserving projection (LPP) [10] can be integrated into this framework. To consider the label information of labeled samples, He et al . [11] presented a novel semi-supervised learning algorithm, called constrained graph embedding (CGE) for feature extraction and data representation. CGE incorporates the limited label information into graph embedding as additional constraints. Experimental results on real data sets have illustrated the effectiveness of CGE.
Different from all the aforementioned methods, there is psychological and physiological evidence for parts-based representation in the cognitive process of human brain. One of the well-known parts-based methods is non-negative matrix factorization (NMF) [12] that tries to decompose the original date matrix into the product of two non-negative matrices. In particular, due to the non-negative limitation, NMF allows only additive, not subtractive, operation, which leads to a parts-based representation of the original data. Concept factorization (CF) [14] is a variant of NMF in that each cluster is linearly represented by a few data, and each data is linearly represented by the cluster centers. The major advantage of CF over NMF is that it can be performed on positive as well as negative data and simultaneously kernelized to further improve performance. Until recently, massive other works have been done [16 , 17 , 18 , 19 , 20 , 21 , 22] on extensions of NMF.
However, the coefficient matrix of the above methods is usually dense. This is contrary to our understanding that the coefficient matrix is sparse since each sample is represented by a linear combination of only a few concepts. Inspired by biological visual systems, sparse coding (SC) is recently proposed for data representation and has been widely applied in many fields [23 , 24 , 25 , 26 , 27] . Traditional SC algorithm, however, fails to take consideration of the geometrical structure information, which can significantly improve the discriminant ability in real-world applications [16 , 17 , 28 , 29] . Consequently, a variety of extensions of SC have been developed to explore the geometrical structure of the data by adding some constraints. Wang et al . [30] proposed a novel technique, called locality-constrained linear coding (LLC), for image representation and classification. LLC preserves the local geometric structure in feature coding process. Mairalet et al . [31] developed simultaneous sparse coding as a framework where groups of similar signals are jointly decomposed by adding group sparsity regularization term. Gao et al . [32] presented a laplacian sparse coding (LSC) framework to solve the data classification and tagging problems. LSC incorporates the laplacian regularization into the mode of SC to preserve the consistence of similar local features. Similar to LSC, Zhang et al . [33] proposed a graph regularization sparse coding (GSC) approach for image representation. GSC captures the intrinsic manifold structure with resort to laplacian graph regularization. Experimental results on image databases have shown the effectiveness of GSC. However, one evident drawback of all the aforementioned SC methods is computationally expensive to optimize their models. To overcome this limitation, Cai et al . [34] proposed a very efficient algorithm, namely sparse concept coding (SCC), for visual analysis. One of the major advantages of SCC is very efficient because it only solves a sparse eigenvalue problem and two regression problems. For each sample, SCC seeks a sparse representation of basis vectors that are embedded the semantic structure information of the data.
Unfortunately, SCC is completely unsupervised without regard to label information. Some previous research efforts reveal that the simultaneous use of the labeled data and unlabeled data can further improve performance in clustering [35 , 36] . Thus, we propose a novel semi-supervised learning algorithm, called constrained sparse concept coding (CSCC) , for data representation. CSCC considers limited label information and the intrinsic manifold structure of data, simultaneously. Our empirical study on benchmark data sets shows the promising results of our proposed algorithm. The contributions of this paper are as follows:
(1) Our proposed CSCC preserves some merits of SCC. For example, it exploits the manifold structure of the data, and simultaneously only solves a sparse eigenvalue problem and two regression problems. Therefore, CSCC is also computationally efficient in comparison with other sparse coding methods.
(2) Compared with SCC, CSCC is a semi-supervised learning algorithm, and thus considers limited label information as additional hard constraints. Moreover, CSCC incorporates the label information into graph embedding in a parameter-free manner. As a result, CSCC not only respects the intrinsic manifold structure of the data, but also takes advantage of the label information of the labeled data.
(3) We also propose another algorithm, called kernel constrained sparse concept coding (KCSCC), based on our proposed CSCC. With the kernel trick, nonlinear relationships among data are transformed into linear relationships in the high-dimensional kernel space. Therefore, we may obtain more effective performance in most cases.
The remainder of this paper is organized as follows: Section 2 gives a brief review of the related work. Section 3 introduces our proposed algorithms. Section 4 provides the experimental results to demonstrate the effectiveness of the proposed algorithms. Finally, we provide some concluding remarks and suggestions for future work in Section 5.
2. Related Work
In the past few years, sparse coding is proposed based on matrix factorization for data representation, which aims to seek a sparse linear combination of basis vectors for each sample. Therefore, the model of SC can be defined based on matrix factorization model by imposing sparse constraints on representation coefficient.
Given a sample set X =[ x 1 ,⋯, x n ]∈ℜ m×n , where xi stands for a sample. Let U =[ u 1 ,⋯, u k ]∈ℜ m×k be the basis matrix, where ui can be regarded as a basis in the new representation space. Let A =[ a 1 ,⋯, a n ]∈ℜ k×n be the coefficient matrix, where ai denotes the representation coefficient of xi . Thus, the objective function of SC can be formally expressed as:
PPT Slide
Lager Image
where | A | 0 denotes the ℓ0-norm and enforces the sparsity on A , and β is a regularization parameter.
Unfortunately, ℓ0-norm minimization problem is not convex. Therefore, finding the sparsest solution of Eq. (2) is a NP-hard problem and computationally expensive. Recent studies have shown that if the solution is sparse enough, the sparsest solution of ℓ1-norm minimization problem is equal to the solution of ℓ0-norm minimization problem [37 , 38] . Thus, an alternative formulation for Eq. (1) is to replace ℓ0-norm regularization by ℓ1-norm regularization to enforce sparsity constraints:
PPT Slide
Lager Image
where | A | 1 denotes ℓ1-norm of the coefficient matrix A . Here, we can employ standard linear programming methods to solve the ℓ1-norm minimization problem in Eq. (2).
3. Our Proposed Methods
- 3.1 Matrix Factorization
Generally, factorization of matrices may be non-unique, and hence varieties of matrix factorization methods have been developed by imposing different constraints. Specifically, given a data set X =[ x 1 ,⋯, x n ]∈ℜ m×n , matrix factorization method tries to seek two matrices U ∈ ℜ m×k and A ∈ ℜ k×n satisfying:
PPT Slide
Lager Image
where U denotes the basis vectors and A is the coefficient matrix of the samples under the basis vectors U . Thus, the objective function of matrix factorization can be formalized as:
PPT Slide
Lager Image
where
PPT Slide
Lager Image
denotes the Frobenius norm of a matrix.
The models of various matrix factorization methods, such as PCA, LDA, Graph Embedding, NMF, CF, and SC can be constructed by adding different constraints on Eq. (3) based on different purposes. In this way, we can impose some constraints on Eq. (3) that the basis vectors U should be embedded the semantic structure of the data, and simultaneously the representation coefficient A should be sparse. In the next subsection, we first introduce the CSCC algorithm in detail.
- 3.2 CSCC algorithm
CSCC is a three-step algorithm for data representation. The first step is concept extraction . CSCC takes the label information as additional constraints into graph embedding. Therefore, the low dimensional concepts exploit both limited label information and the manifold structure information. The second step is basis learning . CSCC aims to learn a basis that can best fit the concepts. As a result, the basis are encoded the semantic structure of the original data. The final step is sparse representation learning . We can employ LARs [39] algorithm to learn a sparse representation for each data.
- 3.2.1 Concept extraction
Given a data set X =[ x 1 ,⋯, x n ]∈ℜ m×n , an adjacency graph G ={ X , W } can be constructed with data set X , where W denotes a weighted matrix. The elements of the weighted matrix W can be usually defined as:
PPT Slide
Lager Image
where N p ( xj ) is the set of p nearest neighbors of xj , L = D - W is the Laplacian matrix, D is a diagonal matrix and Dii =∑ j Wij . Let y =[ y 1 ,⋯, yn ] T be the map from the graph to the real line. According to the reference [13] , we will introduce how to incorporate label information of labeled data into graph embedding.
Assume that the first l samples x 1 ,⋯ xl belong to c classes as labeled set and the rest n - l samples x l +1 ,⋯, xn are unlabeled. We first construct an indicator matrix M ∈ℜ l×c where mij = 1 if xi is labeled with the j th class; mij = 0 otherwise. Once obtaining the indicator matrix M , we define the label constraint matrix S∈ℜ n×(n-l+c) as follows:
PPT Slide
Lager Image
where I n-l is a ( n - l )×( n - l ) identity matrix. For example, given n samples among which x 1 is from the first class, x 2 , x 3 and x 4 are from the second class, x 5 and x 6 are from the third class, and the rest n -6 samples are unlabeled. Thus, the label constraint matrix S can be defined:
PPT Slide
Lager Image
where I n-6 denotes a ( n -6)×( n -6) identity matrix. Note that concept extraction can map each image xi to yi from the original feature space to the new concept space. To take advantage of limited label information, we can impose the label constraints by introducing an auxiliary matrix z :
PPT Slide
Lager Image
From the Eq. (5), it can be found that if xi and xj share the same label, then yi = yj . Thus, we have:
PPT Slide
Lager Image
And
PPT Slide
Lager Image
Thus, the minimization problem reduces to:
PPT Slide
Lager Image
We can obtain the optimal vector z of Eq. (6) by solving the following generalized eigenvalue problem:
PPT Slide
Lager Image
Let Z =[ z 1 ,⋯, zd ], zi ’s are the eigenvectors of the generalized eigenvalue problem in Eq. (7) corresponding to the smallest eigenvalue. After we obtain z , the y can be derived by Eq. (5). Let Y =[ y 1 ,⋯, yd ], each row of Y is called a “ concept ” which embeds the semantic structure information for each sample. If there is no labeled data, we can obtain S = In . In this case, the CSCC method reduces to SCC method.
- 3.2.2 Basis learning
Considering the label information and the manifold structure at the same time, we aim to find a basis U that can best fit Y . That is to say, the basis U needs to satisfy XTU = Y . Unfortunately, the system is under-determined, such U does not exist. A feasible way to solve this problem is to impose a penalty on the norm of U as follows:
PPT Slide
Lager Image
where α is the nonnegative constant parameter and α U 2 can avoid over-fitting. In statistical learning, the model in Eq. (8) is called Ridge Regression problem [41] .
By taking the derivative of Eq. (8) with respect to U and setting it to zero, the optimal solution U * of Eq. (8) can be expressed as follows:
PPT Slide
Lager Image
Actually, the dimensionality of the images is so high that it is extremely time-consuming to solve ( XXT + αI ) -1 . Fortunately, some iterative algorithms, such as LSQR [40] , are used to directly find the solution of the regression problem in Eq. (8).
- 3.2.3 Sparse Representation Learning
Suppose that ai and xi denote the i th column vector of A and X , respectively. Once we obtain the basis U , the coefficient ai of the sample xi can be solved through the following minimization problem:
PPT Slide
Lager Image
where | ai | indicates the ℓ1-norm of ai and β > 0 is a constant parameter. The ℓ1-norm minimization problem in Eq. (10) is called LASSO in statistical learning [41] .
Note that the minimization optimization problem in Eq. (10) can be reformulated as follows:
PPT Slide
Lager Image
Fortunately, the Least Angel Regression (LARs) [39] algorithm can be used to solve the minimization optimization problem in (11). Thus, we need to set the cardinality (the number of non-zero entries) of ai without the parameter γ . In this way, it is easy to control the sparseness of representation coefficient ai .
PPT Slide
Lager Image
- 3.2.4 Computational Complexity of CSCC
Let q be the average number of non-zero entries in each data, and q m . p denotes the size of nearest neighbors, and d represents the dimensionality of the concept. The total computational complexity of CSCC algorithm includes three parts:
(1) In concept extraction phase, we need O ( n 2 q + n 2 p ) operations to calculate the nearest neighbor graph and O ( dnp +2 n ) operations to calculate Y . Considering d << n , we need O ( n 2 q + n 2 p ) operations to extract the concepts.
(2) In basis learning phase, the time complexity of calculating the U * in Eq. (9) with LSQR method is O ( dnq ).
(3) In sparse representation phase, the time complexity of calculating the coefficient matrix A in Eq. (11) with LARs method is O ( d 3 + md 2 ).
In summary, the total complexity of our CSCC algorithm is O ( n 2 q + n 2 p + d 3 + md 2 ). From the reference [34] , it is easy to check that the total computational cost of CSCC is equal to that of SCC.
- 3.3 KCSCC algorithm
To our knowledge, the kernel trick is proposed in SVM to handle classification tasks that cannot be linearly separable in the original space. Once the samples in original input space are mapped to the high dimensional feature space via kernel trick, the linear algorithms in pattern recognition can be employed to handle the data in kernel space.
Motivated by the fact that kernel methods can deal with the nonlinear structure data, we propose a nonlinear extension of our CSCC, namely KCSCC, which seeks a sparse representation of nonlinear data. Similar to CSCC, our proposed KCSCC is also a three-step method including concept extraction, basis learning and sparse representation learning.
The concept extraction phase of our proposed KCSCC is the same as that of CSCC. Consequently, we only introduce the basis learning and sparse representation learning of KCSCC.
- 3.3.1 Basis learning
Suppose that there exists a nonlinear mapping function Φ :ℜ n →F which can map the original feature to the kernel feature space:
PPT Slide
Lager Image
Meanwhile, define the K (·,·) as a Gram matrix, with elements
PPT Slide
Lager Image
where κ (·,·) is a positive semi-definite kernel function.
Similar to basis learning phase of CSCC, KCSCC needs to find a basis
PPT Slide
Lager Image
in high dimensional space to satisfy:
PPT Slide
Lager Image
where Y stands for embedding results in Eq. (4). In fact, the system in Eq. (13) is also under-determined, such
PPT Slide
Lager Image
does not exist. A natural approach is to approximate
PPT Slide
Lager Image
by solving:
PPT Slide
Lager Image
where I denotes an identity matrix and α > 0 is a parameter. Thus, the optimal solution
PPT Slide
Lager Image
of Eq. (14) is expressed as follows:
PPT Slide
Lager Image
Similarly, it is computational expensive to solve ( K + αI ) -1 . Fortunately, finding the solution of Eq. (14) can be converted into solving a regression problem. First, we need to define a projective function in the kernel space as follows:
PPT Slide
Lager Image
It is easy to show that the optimal solution
PPT Slide
Lager Image
of Eq. (14) is the same as that of the following regularized regression problem:
PPT Slide
Lager Image
where f ( xi ) and yi are the ith element of f ( x ) and Y , respectively; F is the RKHS associated with Mercer kernel κ and ∥∥ κ is the corresponding norm.
Similarly, the regression problem in Eq. (17) can be directly solved by using the LSQR [40] algorithm.
- 3.3.2 Sparse representation learning
Suppose that
PPT Slide
Lager Image
and xi are the i th column of
PPT Slide
Lager Image
and X , respectively. After we get the basis
PPT Slide
Lager Image
, the coefficient
PPT Slide
Lager Image
of the sample xi can be computed as follows:
PPT Slide
Lager Image
where
PPT Slide
Lager Image
enforces the sparsity on
PPT Slide
Lager Image
, K (:, xi )=[ K ( x 1 , xi ),⋯, K ( xn , xi )] T and β > 0 is a regularization parameter.
Note that the ℓ1-norm minimization problem in Eq. (18) can be reformulated as:
PPT Slide
Lager Image
Similarly, we can employ LARs [39] algorithm to directly solve the Eq. (19). Therefore, we need to specify the cardinality of
PPT Slide
Lager Image
without the parameter γ .
PPT Slide
Lager Image
- 3.3.3 Computational complexity of KCSCC
Similar to CSCC, the consuming time of the concept extraction phase is O ( n 2 q + n 2 p ) in our proposed KCSCC.
In basis learning phase, we need O ( mn 2 ) operations to map the original data to the kernelspace in Eq. (12). In addition, we also need O ( n 3 + n 2 d ) to solve the basis vectors by using the LSQR algorithm. Considering d << n , the total complexity of this phase is O ( mn 2 + n 3 ).
In sparse representation learning phase, we need O ( mn 2 ) operations to map the original data to the kernel space. Besides, we need O ( d 3 + md 2 ) operations to obtain the coefficient matrix
PPT Slide
Lager Image
by LARs algorithm. This total complexity of sparse representation learning can be written as O ( mn 2 + d 3 ).
In summary, the total complexity of our proposed KCSCC is O ( n 2 q + n 2 p + n 3 + mn 2 ).
4. Experimental Classification Results and Analysis
Recent studies have shown that matrix factorization methods are very powerful in clustering tasks. We carry out several experiments to evaluate the performance of our proposed algorithms on MNIST, PIE and Yale image databases. In our experiments, the two evaluation metrics of the clustering involve accuracy (AC) and normalized mutual information (NMI). The detail definitions of AC and NMI are found in [14] .
- 4.1 Performance Evaluation and Comparisons
In this subsection, we will systematically conduct the evaluations on some image datasets and compare the performances of CSCC and KCSCC with some other algorithms such as K-means, CF, NMF, PCA and SCC.
In these experiments, we randomly choose K (=3, 4,···,10) categories from the image data sets for clustering. For our proposed semi-supervised algorithm (CSCC and KCSCC), we randomly pick up some date samples for each subject to provide the available label information, and other data samples are unlabeled. For each given cluster number K , the experiment process is repeated 10 times and the average performance is recorded as the final result.
- 4.1.1 Experiments on MNIST handwritten database
The MNIST handwritten database includes a training set of 60 000 examples, and a test set of 10 000 examples. In this experiment, 500 samples are selected for clustering. These digit images have been normalized to 28×28 gray scale images. Fig. 1 shows some handwritten samples from MNIST database.
PPT Slide
Lager Image
Images from the MNIST database
In this experiment, we randomly pick up 50 samples for each subject and randomly select 20% from these samples as available label information. The rest digit images are left as unlabeled data. This is, for each subject, the labeled data are 10 images and the rest unlabeled data are 40 images, which are mixed as a whole for clustering. The experimental results are summarized in Table 1 . It is noticeable that our proposed methods consistently outperform other competitors. As shown in Table 1 , compared with the best algorithm, i.e., SCC, CSCC and KCSCC achieve 6.6 % and 9% improvement in AC, respectively. For NMI, CSCC and KCSCC algorithms achieve 7.3% and 7.6% improvement, respectively.
Clustering performance on MNIST database
PPT Slide
Lager Image
Clustering performance on MNIST database
- 4.1.2 Experiments on PIE face database
The PIE face database consists of 41,368 images of 68 individuals. The face images were captured by 13 synchronized cameras and 21 flashes under different pose, illumination and expression. In this experiment, we choose the frontal pose (C27) with varying lighting and illumination which leave us about 46 images per subject. The gray face images are normalized to 32×32 pixels. Some sample images from the PIE database are shown in Fig. 2 .
PPT Slide
Lager Image
Face examples from the PIE database
In this experiment, we randomly choose 46 samples per subject for clustering. For each category, we randomly pick up 9 face images to provide label information as labeled set, the rest images without label information as unlabeled set. Then the labeled set and unlabeled set are mixed for clustering. Shown in Table 2 are the comparison results of all matrix factorization methods. As we can see, KCSCC achieves the highest average AC 68.7% and the highest average NMI 67.5%. The average AC of CSCC achieves nearly 3.7% improvement and the average NMI achieves 6.2% improvement over the SCC, respectively. Compared to the CSCC, the average AC of KCSCC is better 1.8% and the average NMI is slightly better 1.1%.
Clustering performance on PIE database
PPT Slide
Lager Image
Clustering performance on PIE database
- 4.1.3 Experiments on Yale database
The Yale face database contains 15 subjects each providing 11 different images, thus 165 face images in total. For some subjects, the face images were taken under various lighting conditions and facial expressions. All face images have been normalized to 32×32 pixels. Fig. 3 shows some face images of two subjects from the Yale database.
PPT Slide
Lager Image
Face examples from the Yale database
In this experiment, we randomly pick up the 20% images as labeled data, the rest images as unlabeled data for each category. In other words, since each category includes 11 images, we randomly choose 2 images to provide the label information as additional constraints, and consider the rest 9 images as the unlabeled data. Finally, the labeled data and unlabeled data are mixed as a whole for clustering. The experimental results are summarized in Table 3 . In general, our proposed methods are always better than other algorithms on Yale database. Compared with the SCC algorithm, CSCC and KCSCC algorithms achieve 3.9% and 4.5% improvement in AC, respectively. For NMI, CSCC and KCSCC achieve 4.2% and 5.6% improvement, respectively. Compared with the best NMF algorithm, CSCC and KCSCC algorithms achieve 2.7% and 3.3% improvement in AC, respectively. For NMI, CSCC and KCSCC algorithm achieve 3.5% and 4.9% improvement, respectively.
Clustering performance on Yale database
PPT Slide
Lager Image
Clustering performance on Yale database
- 4.2 Parameters Discussion
In our experiments, the size of neighborhood p and the regularization parameter α are empirically set to 5 and 0.1, respectively. We implement our proposed KCSCC algorithm with degree 2 polynomial kernel. Suppose that the cardinality denotes the non-zero number of representation coefficient and is empirically specified as half of the number of the basis. In addition, the number of basis vectors is empirically set as the number of clusters.
Fig. 4 shows how the performances of our proposed algorithms vary with the cardinality parameter on all image databases. In each experiment, we simply use the first 10 categories samples for clustering. Since the number of basis vectors is empirically set to the number of clusters, there are 10 basis vectors in new concept space. In other words, we can use a 10 dimension vector to represent each data on concept space, where the vector is generally called representation coefficient. From Fig. 4 we can see that each high dimensional data, such as 1024 dimensionality, can be represented by the coefficient with only 3 non-zero entries in new concept space. Therefore, it indicates that the representation for each image is very sparse in concept space. This observation suggests that we can use a linear combination of only a few concepts to represent each image. This is consistent with our common knowledge since most of the images contain only a few concepts.
PPT Slide
Lager Image
The performance with varied the cardinality
Fig. 5 shows the performances of our proposed algorithms with the increasing of the number of labeled data. K-means, CF, NMF, PCA and SCC are unsupervised learning algorithms without regard to label information of the data. CSCC and KCSCC, however, are semi-supervised learning algorithms, and thus the performances have a close relation to the number of the labeled data. In each experiment, we randomly choose 10 categories from image database and randomly pick up different ratios of data samples per class as labeled data for clustering. Then 10 independent experiments are taken to calculate the average performance and the results are shown in Fig. 5 . We carry out the experiments with the ratio of the labeled data ranging from 10% to 50%. As can be seen in Fig. 5 , CSCC and KCSCC outperform other unsupervised learning algorithms. Moreover, it can be found that the performances of our semi-supervised learning algorithms, such as CSCC and KCSCC, become better as the number of labeled data increases. It implies that label information plays an important role for image representation. This is also consistent with our understanding for the role of label information in semi-supervised learning algorithm.
PPT Slide
Lager Image
The performance with varied the number of labeled data
Fig. 6 displays the results of all comparative algorithms with varied the number of basis vectors on three image datasets. For various matrix factorization algorithms, the selection of the number of basis vectors is a fundamental topic. Throughout above experiments, the number of basis vectors is empirically set to the number of the clusters. In this experiment, we randomly choose 10 categories data from the image database for clustering and repeat the experiment process 10 times. Then the average performances of all methods are shown in Fig. 6 . As we can see, when the number of basis vectors ranges from 5 to 250, it is easy to check that the performances of CSCC and KCSCC are superior to other matrix factorization algorithms on all databases. It can be found that our proposed algorithms achieve the highest performance when the number of basis vectors is equal to the number of classes. Besides, we can see that the performances of CSCC and KCSCC are relatively stable with varied the number of basis vectors.
PPT Slide
Lager Image
The performance with varied the number of basis vectors
- 4.3 Observations
These experiments on MNIST, PIE and Yale image data sets reveal some interesting points:
(1) As we can see, SCC is superior to K-means, NMF, CF and PCA on MNIST and PIE databases. Unfortunately, the performance of SCC is inferior to NMF and PCA on Yale database. Thus, we can see that SCC cannot achieve the best performance compared with NMF and PCA in all data sets. The reason may be that SCC only explores the manifold structure of the data.
(2) CSCC can obtain the better performance than SCC on all image databases. The reason is that SCC is completely unsupervised. Our proposed CSCC, however, is a semi-supervised learning algorithm, and thus incorporates limited label information into graph embedding. The embedding results of CSCC are consistent with the prior knowledge such that the images from the same class are merged together and simultaneously preserve the manifold structure information of data. The experimental results demonstrate that limited label information, when used in conjunction with geometric manifold structure information, can improve the performance in clustering.
(3) Regardless of the database, we can see that the average performance of KCSCC consistently outperforms CSCC. The main reason is that KCSCC not only preserves the advantages of CSCC, but also can handle the nonlinear structure data via kernel trick. Therefore, KCSS provides more discriminative power than other competitors, such as SCC and CSCC.
(4) As we can see, when the minimum for cardinality is 3, our proposed CSCC and KCSCC algorithms can maintain the stable performances on three image databases. Therefore, the representation coefficient for each image is very sparse in new concept space. It is consistent with our understanding that each image can be presented by a linear combination only a few concepts.
5. Conclusion
In this paper, a novel semi-supervised method, called CSCC, is proposed for image representation, which has many advantages over traditional sparse coding techniques. CSCC explores the geometric structure information among the original data, and simultaneously takes advantage of the limited label information in a parameter-free manner. Subsequently , the kernel extension of CSCC, named KCSCC, is proposed to deal with the nonlinear distribution data. KCSCC has more discriminative power than its linear method in most cases. Finally , similar to SCC, our proposed algorithms only solve an eigenvalue problem and two regression problems. In comparison with traditional sparse coding algorithms, CSCC and KCSCC are also very efficient. Experimental results demonstrate that our algorithms can provide a better representation than state-of-the-art matrix factorization algorithms.
However, there are still several drawbacks existed in CSCC and KCSCC to be considered in the future work. First, our proposed algorithms cannot provide a mechanism for noise removing, thus they are not robust methods for image representation. Therefore, we can replace the ℓ-2 norm by the ℓ-2,1 norm for noisy data in future work. Second, the number of basis vectors and the cardinality play a very important role in improving the performances of our proposed algorithms. How to effectively set them is an interesting topic.
BIO
Zhenqiu Shu received the B.Sc. degree from University of South China, Hengyang, China, in 2008 and the M.S. degree Kunming University of Science and Technology, Kunming, China, in 2011. From 2011 to now, he is working toward his Ph.D. degree in Pattern Recognition and Intelligent Systems from Nanjing University of Science and Technology (NUST), Jiangsu, China. His research interests include machine learning, data mining, and pattern recognition.
Chunxia Zhao received the B.E., M.S. and Ph.D. degrees from Harbin Institute of Technology, Harbin, China, in 1985, 1988, 1998, respectively, both in the Department of Electrical Engineering and Computer. She is a professor in the Department of Computer Science, Nanjing University of Science and Technology. Her current interests are in the areas of robots, computer vision, and pattern recognition.
Pu Huang received his B.S. and M.S. degrees in computer applications from Yangzhou University, PR China, in 2007 and 2010, respectively. He is currently pursuing the Ph.D. degree in Pattern Recognition and Intelligent Systems at Nanjing University of Science and Technology (NUST), China. His research interests include pattern recognition, computer vision and machine learning.
References
Zhao W. , Chellappa R. , Phillips P. J. 2003 “Face recognition: A literature survey” ACM Computing Surveys (CSUR) Article (CrossRef Link) 35 (4) 399 - 458    DOI : 10.1145/954339.954342
Turk M. , Pentland A. 1991 “Eigenfaces for recognition” Journal of Cognitive Neuroscience Article (CrossRef Link) 3 (1) 71 - 86    DOI : 10.1162/jocn.1991.3.1.71
Belhumeur P. N. , Hespanha J. P. , Kriegman D. J. 1997 “Eigenfaces vs. fisherfaces: recognition using class specific linear projection” IEEE Transactions on Pattern Analysis and Machine Intelligence Article (CrossRef Link) 19 (7) 711 - 720    DOI : 10.1109/34.598228
Lee J. M. , Yoo C. K. , Choi S. W. 2004 “Nonlinear process monitoring using kernel principal component analysis” Chemical Engineering Science Article (CrossRef Link) 59 (1) 223 - 234    DOI : 10.1016/j.ces.2003.09.012
Mika S. , Ratsch G. , Weston J. 2003 “Constructing descriptive and discriminative nonlinear features: Rayleigh coefficients in kernel feature spaces” IEEE Transactions on Pattern Analysis and Machine Intelligence Article (CrossRef Link). 25 (5) 623 - 628    DOI : 10.1109/TPAMI.2003.1195996
Yan S. , Xu D. , Zhang B. 2007 “Graph embedding and extensions: A general framework for dimensionality reduction” IEEE Transactions on Pattern Analysis and Machine Intelligence Article (CrossRef Link). 29 (1) 40 - 51    DOI : 10.1109/TPAMI.2007.250598
Tenenbaum J. B. , de Silva V. , Langford J. C. 2000 “A global geometric framework for nonlinear dimensionality reduction” Science Article (CrossRef Link) 290 (5500) 2319 - 2323    DOI : 10.1126/science.290.5500.2319
Roweis S. T. , Saul L. K. 2000 “Nonlinear dimensionality reduction by locally linear embedding” Science Article (CrossRef Link) 290 (5500) 2323 - 2326    DOI : 10.1126/science.290.5500.2323
Belkin M. , Niyogi P. 2003 “Laplacian eigenmaps for dimensionality reduction and data representation” Neural Comput. Article (CrossRef Link) 15 (6) 1373 - 1396    DOI : 10.1162/089976603321780317
He X. , Yan S. , Hu Y. 2005 “Face recognition using Laplacianfaces” IEEE Transactions on Pattern Analysis and Machine Intelligence Article (CrossRef Link) 27 (3) 328 - 340    DOI : 10.1109/TPAMI.2005.55
He X. , Ji M. , Bao H. 2009 “Graph embedding with constraints” in Proc. of t the 21st International Joint Conference on Artificial Intelligence (IJCAI) Pasadena, USA vol. 9, Article (CrossRef Link) 1065 - 1070
Lee D. , Seung H. 1999 “Learning the parts of objects by non-negative matrix factorization” Nature Article (CrossRef Link) 401 788 - 791    DOI : 10.1038/44565
Qian J. , Yang J. 2013 “Discriminative Histograms of Local Dominant Orientation (D-HLDO) for Biometric Image Feature Extraction” Pattern Recognition Article (CrossRef Link) 46 (10) 2724 - 2739    DOI : 10.1016/j.patcog.2013.03.005
Xu W. , Gong Y. 2004 “Document Clustering by Concept Factorization” in Proc. of ACM SIGIR ’04 Article (CrossRef Link) 202 - 209
Qian J. , Yang J. , Xu Y. 2013 “Local Structure-based Image Decomposition for Feature Extraction with Applications to Face Recognition” IEEE Transactions on Image Processing Article (CrossRef Link) 22 (9) 3591 - 3603    DOI : 10.1109/TIP.2013.2264676
Chen Y , Zhang J. , Cai D. 2013 “Nonnegative local coordinate factorization for image representation” IEEE Transactions on Image Processing Article (CrossRef Link) 22 (3) 969 - 979    DOI : 10.1109/TIP.2012.2224357
Cai D. , He X. , Han J. 2011 “Graph regularized nonnegative matrix factorization for data representation” IEEE Transactions on Pattern Analysis and Machine Intelligence Article (CrossRef Link) 33 (8) 1548 - 1560    DOI : 10.1109/TPAMI.2010.231
Liu H. , Wu Z. , Li X. 2012 “Constrained nonnegative matrix factorization for image representation” IEEE Transactions on Pattern Analysis and Machine Intelligence Article (CrossRef Link) 4 (7) 1299 - 1311    DOI : 10.1109/TPAMI.2011.217
He Y. , Lu H. , Huang L. 2014 “Pairwise constrained concept factorization for data representation” Neural Networks Article (CrossRef Link) 52 1 - 17    DOI : 10.1016/j.neunet.2013.12.007
Li Z. , Liu J. , Lu H. 2013 “Structure preserving non-negative matrix factorization for dimensionality reduction” Computer Vision and Image Understanding Article (CrossRef Link) 117 (9) 1175 - 1189    DOI : 10.1016/j.cviu.2013.04.003
Yang J. , Yang S. , Fu Y. 2008 “Nonnegative graph embedding” in Proc. of IEEE Conference on Computer Vision and Pattern Recognition Alaska, USA Article (CrossRef Link) 1 - 8
Wang J. , Bensmail H. , Gao X. 2013 “Multiple graph regularized nonnegative matrix factorization” Pattern Recognition Article (CrossRef Link) 46 (10) 2840 - 2847    DOI : 10.1016/j.patcog.2013.03.007
Wright J. , Yang A. , Sastry S. 2009 “Robust face recognition via sparse representation” IEEE Transactions on Pattern Analysis and Machine Intelligence Article (CrossRef Link) 31 (2) 210 - 227    DOI : 10.1109/TPAMI.2008.79
Qian Y. , Jia S. , Zhou J. 2011 "Hyperspectral unmixing via L-1/2 sparsity-constrained nonnegative matrix factorization" IEEE Transactions on Geoscience and Remote Sensing Article (CrossRef Link) 49 (11) 4282 - 4297    DOI : 10.1109/TGRS.2011.2144605
Li M. , Tang J. , Zhao C. 2012 “Active Learning on Sparse Graph for Image Annotation” KSII Transactions on Internet & Information Systems Article (CrossRef Link) 6 (10) 2650 - 2662
Tang J. , Hong R. , Yan S. 2011 “Image annotation by k-NN sparse graph-based label propagation over noisily tagged web images” ACM Transactions on Intelligent Systems and Technology Article (CrossRef Link) 2 (2) 1 - 15    DOI : 10.1145/1899412.1899418
Zheng H. , Ye Q. , Jin Z. 2014 “A Novel Multiple Kernel Sparse Representation based Classification for Face Recognition” KSII Transactions on Internet & Information Systems Article (CrossRef Link). 8 (4) 1483 - 1480
Cai D. , Wang X. , He X. 2009 “Probabilistic dyadic data analysis with local and global consistency” in Proc. of the 26th Annual International Conference on Machine Learning Article (CrossRef Link) 105 - 112
Cai D. , He X. , Han J. 2011 “Locally consistent concept factorization for document clustering” IEEE Transactions on Knowledge and Data Engineering Article (CrossRef Link) 23 (6) 902 - 913    DOI : 10.1109/TKDE.2010.165
Wang J. , Yang J. , Yu K 2010 “Locality-constrained linear coding for image classification” in Proc. of IEEE Conference on Computer Vision and Pattern Recognition San Francisco Article (CrossRef Link) 3360 - 3367
Mairal J. , Bach F. , Ponce J. , Sapiro G. , Zisserman A. 2009 “Non-local sparse models for image restoration” in Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Kyoto Article (CrossRef Link) 2272 - 2279
Gao S , Tsang I W H , Chia L T 2013 “Laplacian sparse coding, hypergraph laplacian sparse coding, and applications” IEEE Transactions on Pattern Analysis and Machine Intelligence Article (CrossRef Link) 35 (1) 92 - 104    DOI : 10.1109/TPAMI.2012.63
Zheng M , Bu J , Chen C 2011 “Graph regularized sparse coding for image representation” IEEE Transactions on Image Processing Article (CrossRef Link) 20 (5) 1327 - 1336    DOI : 10.1109/TIP.2010.2090535
Cai D. , Bao H. , He X. 2011 “Sparse concept coding for visual analysis” in Proc. of IEEE Conference on Computer Vision and Pattern Recognition Colorado, USA Article (CrossRef Link) 2905 - 2910
Zhou D. , Bousquet O. , Lal T. N. , Weston J. 2004 “Learning with local and global consistency” in Proc. of Advances in Neural Information Processing Systems Vancouver, Canada Article (CrossRef Link) 321 - 328
Zhu X. , Ghahramani Z. , Lafferty J. 2003 “Semi-supervised learning using gaussian fields and harmonic functions” in Proc. of the twentieth Internation Conference on Machine Learning Washington, DC, United states Article (CrossRef Link) 912 - 919
Donoho D. L. 2006 “For most large underdetermined systems of linear equations the minimal L1-norm solution is also the sparsest solution” Communications on pure and applied mathematics Article (CrossRef Link) 59 (6) 797 - 829    DOI : 10.1002/cpa.20132
Candes E J , Romberg J K , Tao T 2006 “Stable signal recovery from incomplete and inaccurate measurements” Communications on pure and applied mathematics Article (CrossRef Link) 59 (8) 1207 - 1223    DOI : 10.1002/cpa.20124
Efron B. , Hastie T. , Johnstone I. , Tibshirani R. 2004 “Least angle regression” Annals of Statistics Article (CrossRef Link) 32 (2) 407 - 499    DOI : 10.1214/009053604000000067
Paige C. C. , Saunders M. A. 1982 “LSQR: An algorithm for sparse linear equations and sparse least squares” ACM Transactions on Mathematical Software Article (CrossRef Link) 8 (1) 43 - 71    DOI : 10.1145/355984.355989
Friedman J. 2001 “The Elements of Statistical Learning: Data Mining, Inference, and Prediction” The Mathematical Intelligencer Article (CrossRef Link) 27 (2) 83 - 85