Advanced
Human Action Recognition Using Pyramid Histograms of Oriented Gradients and Collaborative Multi-task Learning
Human Action Recognition Using Pyramid Histograms of Oriented Gradients and Collaborative Multi-task Learning
KSII Transactions on Internet and Information Systems (TIIS). 2014. Feb, 8(2): 483-503
Copyright © 2014, Korean Society For Internet Information
  • Received : November 06, 2013
  • Accepted : January 09, 2014
  • Published : February 28, 2014
Download
PDF
e-PUB
PubReader
PPT
Export by style
Share
Article
Author
Metrics
Cited by
TagCloud
About the Authors
Zan Gao
Tianjin Key Laboratory of Intelligence Computing and Novel Software Technology, Tianjin University of Technology, Tianjin, 300384, P.R. China
Hua Zhang
Tianjin Key Laboratory of Intelligence Computing and Novel Software Technology, Tianjin University of Technology, Tianjin, 300384, P.R. China
An-An Liu
School of Electronic Information Engineering, Tianjin University, Tianjin, 300172, P.R. China
Yan-bing Xue
Tianjin Key Laboratory of Intelligence Computing and Novel Software Technology, Tianjin University of Technology, Tianjin, 300384, P.R. China
Guang-ping Xu
Tianjin Key Laboratory of Intelligence Computing and Novel Software Technology, Tianjin University of Technology, Tianjin, 300384, P.R. China

Abstract
In this paper, human action recognition using pyramid histograms of oriented gradients and collaborative multi-task learning is proposed. First, we accumulate global activities and construct motion history image (MHI) for both RGB and depth channels respectively to encode the dynamics of one action in different modalities, and then different action descriptors are extracted from depth and RGB MHI to represent global textual and structural characteristics of these actions. Specially, average value in hierarchical block, GIST and pyramid histograms of oriented gradients descriptors are employed to represent human motion. To demonstrate the superiority of the proposed method, we evaluate them by KNN, SVM with linear and RBF kernels, SRC and CRC models on DHA dataset, the well-known dataset for human action recognition. Large scale experimental results show our descriptors are robust, stable and efficient, and outperform the state-of-the-art methods. In addition, we investigate the performance of our descriptors further by combining these descriptors on DHA dataset, and observe that the performances of combined descriptors are much better than just using only sole descriptor. With multimodal features, we also propose a collaborative multi-task learning method for model learning and inference based on transfer learning theory. The main contributions lie in four aspects: 1) the proposed encoding the scheme can filter the stationary part of human body and reduce noise interference; 2) different kind of features and models are assessed, and the neighbor gradients information and pyramid layers are very helpful for representing these actions; 3) The proposed model can fuse the features from different modalities regardless of the sensor types, the ranges of the value, and the dimensions of different features; 4) The latent common knowledge among different modalities can be discovered by transfer learning to boost the performance.
Keywords
1. Introduction
R ecently, human action recognition has been become research hotspot in computer vision and machine learning domain, and widely applied in many areas, such as surveillance video analysis, man-machine interaction and video semantic retrieval etc. In the past decades, a lot of action recognition algorithms [1 - 6] have been proposed. In these approaches, motion history image [1 - 2] and spatio-temporal interest points methods [3 - 6] have been extensively used for representing and recognizing actions. In order to obtain accurate MHI, the target need to be segmented by various detection methods, or the background pixels will affect the construction of MHI. Similar, when we extract spatio-temporal interest points, some interest points will locate in the background, which are noise for target interest point pixels, thus, researchers still want to obtain the target contour to filter noise interest points. However, in traditional videos, it is nontrivial to quickly and reliably detect, segment and track human body, especially in the dark night where the visual contents are unrecognizable or even invisible, and most of the-state-of-the-art approaches will be failed. Specially, after obtaining MHI, different descriptors are borrowed to represent human motions, such as 7-Hu-Moment [7] , Gabor [8] , but what kinds of descriptors are suitable and robust for representing human motions? In addition, these the-state-of-art methods are assessed on well-known benchmarks, such as Weizmann [2] and KTH [4] datasets, and they can obtain satisfying recognition accuracy, but these datasets just include RGB information, and what the performance will be when these descriptors are applied into depth channel.
With the development of imaging technology, such as Microsoft Kinect and Leap Motion sensors, it has become possible to capture both color image and depth sequences simultaneously with cheap device. and Fig. 1 shows the outputs of Kinect sensor in which RGB image and depth information are given. From it, we can observe that the depth sequence can supply much more information, such as additional human body shape and motion information, at the same time, the background pixels can be filtered easily by distance, and the target could be segmented and located well and truly. Thus, the depth will be very helpful for our task, and is a powerful supplement for RGB channel. In fact, some researchers [9 - 10] have tried to adopt the depth information to human action recognition. For example, space-time volume in depth and simple descriptors [9] were constructed and extracted to represent actions, and then approximate string matching (ASM) was used to recognize the action in depth; A bag-of-3D-points or 3D silhouettes method to represent postures by sampling 3D points from depth maps was proposed in Li et al. [10] , and then an action graph was borrowed to model these points to perform action recognition. Although these descriptors obtain good performance, they focus on depth channel captured by Kinect sensor, and ignore the RGB information.
PPT Slide
Lager Image
RGB Images and Depth Maps of different actions
Thus, in this paper, we will first design some descriptors which will suitable for both RGB and depth modalities, and then evaluate these descriptors with different kind of classification models on both RGB and depth modalities, at the same time, we will compare them with classical descriptors which had been utilized in KTH and Weizmann datasets. In addition, with multi-modality features, we also propose a collaborative multi-task learning method for model learning and inference based on transfer learning theory. For this purpose, we first design two kinds of motion history maps for depth and RGB channels: 1) Depth motion history image ( DMHI ) is generated by searching the maximum and minimum motion energy of each pixel in consecutive frame, and then their differences are identified as the DMHI ; 2) RGB and depth motion history image ( RDMHI ), which is filtered and limited by depth information, is produced to characterize corresponding action categories. After that, average value in hierarchical block, gist and pyramid histograms of oriented gradients descriptors for depth and RGB channels are proposed to represent human motion respectively. In order to evaluate and analyze our proposed descriptors, KNN, SVM with linear and RBF kernels, SRC and CRC model are utilized, at the same time, a public and challenge DHA action dataset, which includes both depth and RGB information together, are employed. Fig. 2 displays the general framework of our approach.
PPT Slide
Lager Image
the overall framework of proposed scheme
Large-scale experimental results disclose that in our proposed DMHI_PHOG and RDMHI_PHOG descriptors, the neighbor gradients information and pyramid layers are very useful for our task, whose accuracies are the best on both depth and RGB channels. In addition, we also investigate the performance of our descriptors further by combining these descriptors on DHA dataset, and observe that the performances of combined descriptors are much better than just sole descriptor. What is more, with multi-modality features, collaborative multi-task learning is very helpful for our task. The large scale comparison experiments on public DHA dataset, show the superiority of the proposed method which outperforms the state-of-the-art methods.
The rest of the paper is structured as follows. In Section II and III, we respectively present the related work and motion history maps, and then feature representation and collaborate multi-task learning are given in detail respectively. The experimental results are detailed in Section VI. The conclusions are given at last.
2. Related Work
Human action recognition is a challenging problem because of the high variability of appearances, potential occlusions and shapes. Recently it has obtained increasing attention owing to its wide range of applications in surveillance video analysis, man-machine interaction and video semantic retrieval etc. In the last few decades, a lot of feature representation methods have been developed for recognizing actions from video sequences based on color/RGB cameras. For example, Sequences of human silhouettes are employed to model both spatial and temporal characteristics of human actions [1] . In [1] , motion energy images (MEIs) and motion history images (MHIs) are formed by temporally accumulated silhouettes, and then Seven Hu moments [7] and Gabor [8] features are extracted from both MEIs and MHIs to serve as action descriptors; Gaussian mixture models (GMM) is utilized in [11] to capture the distribution of the moments of silhouette sequences. In addition, motion flow patterns to represent human actions are proposed in [12 - 14] . In details, optical flows [13] are calculated for the entire image by matching consecutive video frames, and then the motion patterns [12] or the estimated motion parameters [14] are utilized for action representation. However, in the real-world, we project three-dimensional motion onto the two-dimensional image plane, it will be ambiguous.
Recently, in object recognition task, local interest points approaches, which are much robust for posture, illumination, occlusion, deformation and cluttered background than that of global appearance descriptions, are very popular, thus, researcher also designed and developed spatio-temporal interest points for action recognition in video, which achieve the state-of-the-art performances in activity recognition. The success reasons of spatio-temporal local features-based methods are that these features are much more distinctive and descriptive. These methods include Cuboid [3] , Harris3D [5] , MoSIFT [6] and HOG3D [15] . In [3] , Dollar et al. focus on temporal domain and put away the spatial constraints, thus, it can detecte periodic frequency components by employing Gabor filters on the temporal dimension. Laptev et al. [5] proposed a 3D Harris corner detector by extending 2D Harris corner detectors, to detect compact and distinctive interest points, which have high intensity variations in both temporal and spatial dimensions. Chen et al. [6] utilized the famous local interest points algorithm to detect visually remarkable areas in the spatial domain, and then these candidated interest points were reserved with the motion constrain, in which an ‘enough’ amount of optical flow around each distinctive points are employed. Although slightly different from each other, these methods share the common feature extraction and representation framework, which involves detecting local extremes of the image gradients and describing the point using histogram of oriented gradients (HOG) and histogram of optic flows (HOF).
Although many algorithms have been proposed for this task, but the previously related works mainly focus on analyzing video sources captured by RGB camera, and have achieved good performance, but what will the performance be when these descriptors are applied into depth channel? Thus, some researchers paid attention to action recognition on depth dataset, for example, Lin et al . [9] constructed the space-time volume on depth channel, and then proposed different kinds of descriptors, after that, approximate string matching was employed as the classifier; Wang et al . [16] employed the depth and skeleton point information, and constructed the actionlet ensemble model to recognize the actions; A Bag-of-3D-Points was proposed in [10] to represent postures by sampling 3D points from depth maps, and then they employed action graph to model these points to realize action recognition. Their experimental results on MSR Action3D dataset demonstrated that 3D silhouettes from depth sequence are much helpful for action recognition than 2D silhouettes. Megavannan et al . [17] also record a depth action dataset, and then constructed different motion history images, after that, different features are cascaded, and SVM classification is trained and employed; Although these algorithms were estimated on depth action dataset, there are still confusions about how to employ depth information, and what kinds of descriptors are suitable for depth channel or for both RGB and depth channels? Secondly, since RGB and depth images represent one scene in different modalities, they are complementary to each other and it will benefit human action recognition by fusing both for discriminative feature representation and model construction. In fact, in different research domains, the fusion of multi-modalities features or multi-view features have attracted the attentions of many scientists. For example, in web image search [18 - 20] , video semantic annotation or tagging [21 - 24] , 3D Object Retrieval [25 - 28] , target tracking [29] and multi-view object classification [30 - 34] , authors had discussed the importance of fusion of multi-modalities features or multi-view features, and experiments also showed its was very helpful for the tasks in different research domains. Thus, we will first assess the performances when these descriptors in RGB and Depth channels are combined. Further, with the features from multiple modality resources, we also propose a collaborative multi-task learning based on transfer learning for human action recognition to assess the importance of the fusion of multi-modality features.
In addition, for the algorithms evaluation, most above algorithms are just assessed by a kind of classification model, but it is not adequate. For example, after extracting different kind of features, all researchers [3 - 6 , 17] adopt SVM models to recognize human action; Approximate string matching [9] and graph model [10] are employed to identify human motion; In Bobick and Davis [1] , similarity matching schemes were employed; What is worse, most current methods are highly dependent on dataset and therefore the generalization ability is severely constrained. To solve this problem, some authors have proposed a model-free method for human action recognition via sparse representation. For example, Authors [35 - 42] extracted different kind of features for each action, and then employed sparse representation based classification algorithm directly without any changing. SRC [41] has been proposed firstly for face recognition, in which a testing sample is reconstructed and represented by all the training samples, after that, impulse function is designed for each class and representation, and then the minimum representation error is adopted to classify the testing sample. Similar to SRC, the philosophy of the proposed method in [35 - 40] is to decompose each video sample containing one kind of human actions as aℓ 1 sparse linear combination of several video samples containing multiple kinds of human actions, and it has achieved good performance. The reason of obtaining success is that the point’s neighborhood structure is utilized fully, and can supply better similarity measures among the testing data and all the training samples. After that, Zhang et al . [41] discussed the role of ℓ 1 -norm and ℓ 2 -norm respectively, and then concluded that the sparsity in SRC was not so important, and collaborative representation played much more important roles. Thus, what will be happened when these descriptors are assessed by mode-free models and traditional, constrained classification algorithms depended on dataset?
3. Motion History Image for RGB and Depth Modalities
In order to represent human motion, human silhouettes of each frame need to be accumulated and encoded firstly, thus, we construct human motion maps for RGB and depth channel respectively, and the details will be given as follows.
- 3.1 MHI for RGB Modality
To describe human motion, motion history images ( MHI ) [1] , where moving human silhouettes are accumulated and encoded, has been widely employed, and achieved good performance. However, Bobick and Davis [1] , firstly detected or segmented targets in RGB video, but it is difficult to come true it in real conditions. Thus, in the construction of MHI , we often make no discrimination for all pixels in RGB image, by this way, a lot of noises are brought, which will affect and disturb the motion shape. The second columns in Fig.3 display the corresponding to MHI . From them, we can see that MHI is not so clear to describe the motion shape, which will be difficult for descriptors to represent the motions. Fortunately, the depth information is very helpful to detect the target, and we can employ depth information to detect the target and filter those static noise pixels, thus, we can achieve almost precise MHI .
PPT Slide
Lager Image
there are two big columns: left four and right four small columns. From left to right, there are RGB Image, traditional MHI, RGB filtered by depth and RDMHI of jacking and tai-chi actions respectively
In details, RGB image firstly is limited and filtered by depth image, and then we compute the maximum and minimum values of each pixel in video sequence, after that, the subtraction between maximum and minimum image pixels is calculated to obtain MHI for RGB channel, called RDMHI . The third and fourth columns in Fig.3 show their results respectively. The defination of the processing is shown as follows:
PPT Slide
Lager Image
PPT Slide
Lager Image
where i and j is the pixel index, t is frame index, and N is the total frame number of an action sequence, d ( i , j , t ) and r ( i , j , t ) is current pixel depth value and RGB value in t frame respectively. r ( i , j , t ) is filtered by d ( i , j , t ) which can b calculated to obtain rd ( i , j , t ) and RDMHI ( i , j ) is MHI on RGB modality. Fig.3 demonstrates that the motion silhouette of RDMHI is much clearer than that of MHI , at the same time, a lot of static pixels are filtered in RDMHI .
- 3.2 MHI for Depth Modality
In the construction of MHI for RGB modality, detecting and locating the targets is difficult in real conditions, however, depth image, whose pixel values means the distance information, can be utilized to detect the targets, because the object often locate within a certain distance from the background, thus, according to the depth information, we set a suitable threshold depth value to remove the background pixels in depth image, whose depth is much bigger than the threshold, and foreground pixels in depth image are kept. Because motion history images for action representation have achieved good performance on RGB channel [1] and Megavannan et al . [17] also constructed different kind of MHIs and extracted different features, after that, they linked different features, which have achieved satisfying results. Thus, inspired by them, we also develop depth motion history image for all filtered depth image sequence to represent the spatial and temporal information about an action.
Suppose a depth motion image sequence
PPT Slide
Lager Image
we firstly compute the maximum and minimum motion energy for each pixel in a random length N depth image sequences, and then calculate the difference between maximum and minimum images. The definitions are shown in details as follows:
PPT Slide
Lager Image
where i and j is pixel index, t is frame index, andNis the total frame number of an action sequence d ( i , j , t )is current pixel depth value in t frame. In DMHI image, they not only convey important shape and motion clues of a human movement, but also they can filter most of static pixels. Fig. 4 shows the corresponding RGB, depth map and depth motion history images respectively. From it, we can see that DMHI also can demonstates the motion silhouettes clearly.
PPT Slide
Lager Image
there are two big columns: left three and right three small columns. From left to right: RGB Image, Depth Map and DMHI of jacking and boxing action respectively
4. Visual Representation
After we construct motion history images, the dynamic sequence of a motion became a moving human silhouette, and if we want to recognize the motion, we firstly need to describe it, thus, three different kinds of descriptors are proposed, and the following will present them in detail.
- 4.1 Hierarchical Blocks Descriptor
After obtaining the DMHI and RDMHI , we need to design some suitable descriptors to represent the motion. As the spatial information is very helpful for action recognition, thus, average value in hierarchical blocks ( AHB ) descriptor is proposed. In details, DMHI and RDMHI are divided into 8*8, 4*4, 2*2, 2*1 and 1*2 hierarchical blocks respectively, and then the average value with nonzero value in each block is calculated, thus, in total, the feature dimension is 88, and these descriptors are called as DMHI_GAHB and RDMHI_GAHB descriptors respectively. In addition, in order to extract accurate feature, we also firstly find a rectangular bounding box for DMHI and RDMHI images, and then employ the same scheme to split the image and extract the feature, and we name them as DMHI_BAHB and RDMHI_BAHB descriptors respectively. Experimental results shows the performances of DMHI_BAHB and RDMHI_BAHB are much better than that of DMHI_GAHB and RDMHI_GAHB . Thus, in the following sections, we will adopt the rectangular bounding box scheme.
- 4.2 Multi-Scale and Multi-Orient Descriptor
Although the spatial information is useful, the orient and scale information are also helpful, as objects often perform the same actions by different orients and different distances. Thus, we think the descriptors not only should have the spatial structure, but also should have the scale and orient information. At the same time, though a lot of perceptual experiments, researcher [42] proposed that perceptual dimensions (naturalness, openness, roughness, expansion, ruggedness) were very important for scene representation, they also employed the different scales and orients filter to compute these perceptual dimensions, in which each dimension depicts a meaningful property of the space of the scene. Large scale experiments on scene recognition showed its accuracy was very excellent. Inspired by this, we think that gist descriptor is also very helpful for our task. In fact, DMHI and RDMHI images also can be considered as different characteristics scenes in which naturalness, openness, roughness, expansion and ruggedness are very different. For example, the roughness dimension of two hands waving will be much bigger than that of one hand waving, the openness dimension of running will be larger than that of jumping. Thus, after obtaining DMHI and RDMHI , the gist descriptor is adopted, called DMHI_GIST and RDMHI_GIST descriptors respectively. In these descriptors, four scales and eight orients filter are employed, and then each filtered DMHI and RDMHI is divided into 4 * 4 blocks, and average value of each block is computed, thus, the dimension of DMHI_GIST and RDMHI_GIST descriptor are 512. Experiments show that these descriptors outperform some the-state-of-the-art descriptors, such as 7 Hu moment shape features, which are robust for a translation- and scale-invariant manner.
- 4.3 Pyramid Histogram of Orientated Gradients Descriptor
In above two kinds of descriptors, although the spatial information, multi-scale and multi-orient are employed, the average value in each block is calculated in which the neighbor information is ignored, and its description ability is limit, thus, we suppose that we should not only adopt the spatial information, but also employ much more robust descriptors which are related to the neighborhood. At the same time, histogram of orientated gradients ( HOG ) [43] , in which the edge or gradient distribution of the local region is extracted, and the edge or gradient structure of the target can be represented well, was proposed to describe the human shape information, and has achieved good performance. Although gradient orients in HOG have actually given the spatial location information, the effect of different spatial partition to the performance of classification is neglected, thus, pyramid histogram of orientated gradients ( PHOG ) [44] was proposed. In fact, PHOG , which not only can represents the whole shape information, but also describes the local shape information and spatial relationship, is a spatial shape descriptor, and has attained satisfying performance on object classification.
Based on above analysis, we think that PHOG also will be very useful for our task, thus, PHOG is proposed to represent the human motion maps. In our task, PHOG not only describes the shape information of the human action, but also describes the space information of it in which both shape information and spatial information are very helpful for our tasks. PHOG extraction in our task can be given as follows in detail: 1) DMHI and RDMHI motion maps are constructed and calculated, and then the rectangular bounding boxes of them are searched and obtained, and background noise pixel are filtered; 2) On the basic of the rectangular bounding box, PHOG feature are extracted for DMHI and RDMHI motion maps respectively, and we called them as DMHI_PHOG and RDMHI_PHOG respectively. In the calculating PHOG , three layers pyramid image are constructed, and the range of gradient direction is 0~360 degree in each layer, and then all pixel gradient directions in each layer or each block are normalized into 20 dimensions by the weight of pixel gradients; 3) After that, the features in each layer are cascaded into PHOG feature, whose dimension is 1700.
5. Multi-modality Features Fusion by Collaborative Multi-task Learning
Since both RGB and depth image sequences for one action can be synchronously captured by Kinect, it is reasonably assumed that there exists intrinsically correlation among multiple modalities. Consequently, we can formulate the action recognition task with m ulti- m odality feature f usion as a c ollaborative m ulti-task l earning ( MMFCML ) problem to discover the underlying common knowledge among different modalities and consequently boost the performances. We propose to formulate the MMFCML problem with multimodal signal by transfer learning.
- 5.1. Problem Formulation
In multi-modality features fusion and collaborative multi-task learning, we are given a training set of
PPT Slide
Lager Image
for K action classes with N training samples. Each member of
PPT Slide
Lager Image
contains multi-model features
PPT Slide
Lager Image
for one sample where Xij Rd means the feature representation in jtd modality for itd and T denotes the total number of modalities. In real conditions, we often obtain limited number of samples in each action classes, thus, we collectively construct the dictionary with all classes of action samples for each modality feature. For simplicity, we define
PPT Slide
Lager Image
as the dictionary with multi-model bases for the MMFCML task, where the dictionary Φ j for jtd modality, and
PPT Slide
Lager Image
consists of all training samples in the jtd modality for all action classes. For a test sample
PPT Slide
Lager Image
where Yj denotes the feature representation in the jtd modality, we can formulate the objective function:
PPT Slide
Lager Image
Where first term means the empirical loss function, and λi means the weight for the reconstructive error by the jth modality feature, and this empirical loss function is formulated based on sparse coding principle. Given yj and the corresponding dictionary Φ j , sparse coding means to decompose yj over Φ j such that yj j × W + rj where W is the sparse vector and rj is the residual. W expilicitly represents the similarity between yj and each base of Φ j . This term evaluates the reconstructive errors with T modality signals. The minimum of this termcan lead to the latent correlation transferring among multiple modality features. At the same time, the second term consists of the rigid penalty, and λ controls the weight for regularization. If two bases are highly similar to each other, they should be assigned with almost the same weights. It is well known that the ridge penalty with strict convexity can preserve consistence for the decomposed coefficient. Therefore, the rigid penalty can impose consistence for MMFCML . Furthermore, it can make the least square solution of the empirical loss function stable as well as induce a certain amount of sparsity.
The advantage of the proposed MMFCML problem is that the consistence constraint can be incorporated with the loss function decoupled for single modality-based learning problem. This combination will facilitate transferring the common knowledge among multiple modalities to boost the performance.
- 5.2 Solution and Inference
Although both empirical loss function and rigid penalty are convex in W , but the L1-norm term in the objective function Eq.4 is not differentiable, thus, gradient descent method is not available for solution. However, Nesterov’s method [48] utilizes a linear combination of previous two points as the search point to achieve high convergence speed. The Nesterov’s method is based on two sequences { xi }and { si } in which{ xi } is the sequence of approximate solution, and { si } is the sequence of search points. The search point si is the affine combination of xi −1 and xi as
PPT Slide
Lager Image
Where αi is the combination coefficient, and the approximate solution xi +1 is computed as a “gradient” step of si as
PPT Slide
Lager Image
where πG ( v ) is the Euclidean projection of v onto the convex set G :
PPT Slide
Lager Image
1/ γi is the step size, and γi is determined by the line search according to the Armijo-Goldstein rule so that γi should be appropriate for si , and the details can be found in [48] . Thus, in our paper, we adopt the Nesterov’s method to solve the optimization problem in Eq.(4), and then we can derive the analytical solution as:
PPT Slide
Lager Image
With the optimal W * , we can compute the error by the q class as follows:
PPT Slide
Lager Image
where
PPT Slide
Lager Image
denotes corresponding dictionary for jtd modality feature and q action class, and the class of the test sample can be inferred by choosing the action class q with the minimum error ( q ). For the complexity of the proposed algorithm, since
PPT Slide
Lager Image
is independent of the test sample Yj and can be pre-computed with the constructed dictionary Φ, the optimal solution of the first term in Eq.(4) can be rapidly obtained simply by projecting
PPT Slide
Lager Image
onto
PPT Slide
Lager Image
, thus, it can be computed quickly.
6. Experimental Evaluation and Discussion
In order to evaluate our proposed descriptors adequately, we will assess them by different video channels and different classification models. Although MSR-Action3D dataset [10] is a public action dataset, but it just includes depth sequences captured by a depth camera, and RGB channel is ignored, but a challenge and public DHA action dataset [9] includes both depth and RGB information together. Thus, in our experiments, DHA dataset is employed, and the popular classification models- KNN and SVM are borrowed to recognize human action, in addition, the mode-free models - SRC and CRC, also are adopted to identify human motions. In all experiments, SVM model are learned using cross-validation, and the parameters in SRC , CRC and MMFCML models are selected by cross validation within the range of [1, 0.5, 0.1, 0.05, 0.01, 0.005, 0.001, 0.0005, and 0.0001].
- 6.1 Experimental Setting Up
DHA Dataset: In this dataset [9] , it contains 17 action categories: (1) bend, (2) jack, (3) jump, (4)one-hand-wave, (5) pjump, (6)run, (7)side, (8) skip, (9) two-hand-wave, (10) walk, (11)clap-front, (12) arm-swing, (13) kick-leg, (14) pitch, (15) swing, (16) boxing and (17) tai-chi, and each action was performed by 21 people (12 males and 9 females), such that there are totally 357 videos in DHA dataset, each with both the color and depth data recorded. Although the background in DHA dataset is relative clean, there are some similar actions which will be very difficult to recognize.
Features: In our experiments, we not only extract the depth features, such as DMHI_BAHB , DMHI_GIST and DMHI_PHOG , but also extract RGB features, such as RDMHI_BAHB , RDMHI_GIST and RDMHI_PHOG . In order to compare with other descriptors, translation, scale and orientation invariant 7 Hu moments, Gabor feature and LBP [46] feature are also extracted for both depth and RGB channels respectively, and these features are named as DMHI_7_Hu_moment , DMHI_Gabor , DMHI_LBP , RDMHI_7_Hu_Moment , RDMHI_Gabor and RDMHI_LBP respectively. In addition, we also directly fuse these descriptors to evaluate them further.
Classifiers: To assess the performance of our proposed scheme well, KNN, SVM, SRC and CRC models are constructed. For SVM with RBF kernel [47] are trained on training dataset, and then the performance is assessed on testing dataset. For SRC and CRC models, all training samples will be adopted directly as the basic vectors, and then a testing sample is reconstructed and represented by these basic vectors, after that, impulse function is designed for each class and representation, and then the minimum representation error is adopted to classify the testing sample.
Evaluation Criteria: In our previous work [48] , we had discussed the evaluation protocol of action recognition, and the leave-one-person-out method should be much more reasonable. Thus, we will utilize the leave-one-person-out protocol, and the popular average accuracy is employed.
- 6.2 Performance Evaluation on Depth Channel
Firstly, we will evaluate our descriptors on depth channel in DHA dataset with different classification models respectively. At the same time, we also compare their performances with translation, scale and orientation invariant 7 Hu moments, Gabor and LBP descriptors. In order to compare fairly, all experimental settings are required to the same, and their performances are shown in Table 1 .
Performance comparison on depth channel in DHA dataset when different descriptors and models are adopted
PPT Slide
Lager Image
Performance comparison on depth channel in DHA dataset when different descriptors and models are adopted
Experimental results demonstrate that no matter what kinds of models are used, the performances of DMHI_BAHB , DMHI_GIST and DMHI_PHOG descriptors are much better than that of DMHI_7_Hu_Moment , DMHI_Gabor and DMHI_LBP descriptors. Meanwhile, we also can observe that although DMHI_BAHB descriptor has the spatial information, the orient information is ignored, thus, when DMHI_GIST descriptor with multi-scale, multi-orient and spatial information are considered together, their performances of different models is improving. In addition, in DMHI_BAHB and DMHI_GIST descriptors, the average value in each block is adopted to represent human actions, but their neighbor information of the center pixel is ignored, thus, their performances are not so good. However, in DMHI_PHOG descriptor, the pixel gradient orients with the weight of pixel gradient in each block are hired to describe human motions, whose representation ability is much more robust and efficient than that of average value in each block. No matter what kinds of models are employed, DMHI_PHOG descriptor achieves the best performance.
In addition, we also evaluate and compare DMHI_BAHB and DMHI_GAHB descriptors in our experiments. From the comparison results, we can understand that when we split the image into blocks without rectangular bounding box, its performance just achieve about 50%, but when the rectangular bounding box is employed, its performance can improve to 80%. In other word, the rectangular bounding box is very helpful for our task, thus, in our latter experiments, we will always adopt the rectangular bounding box scheme in extraction feature.
- 6.3 The Assistance of Depth Information for RGB Channel
In obtaining RGB motion history maps, it often be affected by the background pixel, and the second columns in Fig.3 has proved it. If we can acquire the foreground target, its motion history maps will be much more clear and discriminative than that of motion history maps affected by background pixels. However, in real conditions, it is difficult to segment the foreground target. Luckily, the distance between the target pixels and background pixels are very different, what is more, the depth information is just enough distance information, which can be used to segment and locate the target, and the third columns in Fig. 4 displays the corresponding results. In order to prove the assistance of depth information for RGB channel, we choose top two descriptors in depth channel and perform some comparable experiments on DHA dataset by training different models, and their results are provided in Table 2 . Table 2 shows that when we construct the motion history image on RGB channel by traditional methods, anyway models are employed, the performances of RMHI_GIST and RMHI_PHOG just about 60%. However, when the depth information is borrowed to segment the target, the performances of RDMHI_GIST and RDMHI_PHOG obtain big improvement. Especial for RDMHI_PHOG descriptor, when SVM-RBF , SRC and CRC models are adopted, their performances reach about 90%. That is to say, the depth information for RGB channel is very auxiliary, and in our latter experiments, the depth information be used to detect and locate our target.
Assistance of depth information for RGB channel
PPT Slide
Lager Image
Assistance of depth information for RGB channel
- 6.4 Performance Evaluation on RGB Channel
In order to assess our proposed descriptors further, we also evaluate them on RGB channel by different classification models, at the same time, we also compare with the-state-of-art schemes, and their performances are provided in Table 3 . Table 3 shows that RDMHI_BAHB , RDMHI_GIST and RDMHI_PHOG descriptors still are much better than that of RDMHI_7_Hu_Moment , RDMHI_Gabor and RDMHI_LBP descriptors regardless of what kinds of models are engaged. In addition, the performance of RDMHI_GIST descriptor is much better than that of RDMHI_BAHB descriptor, and the performance of RDMHI_PHOG descriptor is also much better than that of RDMHI_BAHB , RDMHI_GIST descriptor. That is to say, experimental results prove further that the pixel gradient orients with the weight of pixel gradient in each block are very helpful for describing human motions, whose representation ability is much more robust and efficient than that of average value in each block. What is more, a lot of descriptors are proposed [9] , but the best performance is 87%. However, for our proposed RDMHI_PHOG descriptor, when SVM-RBF , SRC and CRC are employed, all their performances reach above 91%.
Performance comparison of different Descriptors
PPT Slide
Lager Image
Performance comparison of different Descriptors
From above evaluation and analysis, we can conclude that our proposed and adopted descriptors can achieve much better performance than that of some the-state-of-the-art schemes on both depth and RGB channels, even different models are employed. What is more, the performance of RGB channel can obtain big improvement by the assisting of depth information, which is an important complement for RGB channel.
- 6.5 Performance Evaluation of Direct Fusion Different Modality Features
Experimental results have proved that our proposed descriptors obtain good performance on both depth and RGB channels no matter what kinds of models are adopted, but as different descriptors and different channels have some complement, thus, if we can fuse them directly, it may be helpful for action recognition. Thus, in this section, we will try to fuse our descriptors, and then compare with individual descriptor to further prove the superiority of them. In our experiments, top three descriptors are employed, and the paired descriptors among them are directly linked to form the fusion descriptor, for example, when RDMHI_GIST and DMHI_BAHB descriptors are fused, the combined descriptor is called DMHI_BAHB_RDMHI_GIST , and when R DMHI_PHOG and DMHI_PHOG descriptors are combined, the new fused descriptor is labeled as DMHI_RDMHI_PHOG . A t the same time, KNN , SVM with RBF and linear kernels, SRC and CRC models are adopted respectively, and their results are shown in Fig. 5 , Fig. 6 , Fig. 7 and Fig. 8 respectively. From them we can see that when descriptors in depth and RGB channels are combined, their performances are much better than that of sole descriptor, for example, when KNN model are used, the performances of DMHI_PHOG and RDMHI_GIST descriptors achieve 79.3% and 75.6% respectively, but the performance of DMHI_PHOG_RDMHI_GIST descriptor reaches 83.4%, whose improvement attains 7.8% for RDMHI_GIST . Similarly, when SVM model are trained, the fusion descriptors also can obtain a certain improvement, for example, the performances of DMHI_PHOG, RDMHI_PHOG and DMHI_RDMHI_PHOG descriptors are 92.4%, 92.2% and 96.1% respectively, and its improvement still reaches 4%. What is more, when SVM model with RBF kernel is trained, its performance keep stable and efficient, and Fig. 6 show their results. When comparing with Lin et al. [9] (its accuracy is 87%), the improvement achieves 9.1%.
PPT Slide
Lager Image
Performance comparison between fusion descriptor and sole descriptor by KNN classifier
PPT Slide
Lager Image
Performance comparison between fusion descriptor and sole descriptor by SVM classifier
PPT Slide
Lager Image
Performance comparison between fusion descriptor and sole descriptor by SRC classifier
PPT Slide
Lager Image
Performance comparison between fusion descriptor and sole descriptor by CRC classifier
Although KNN model does not need to train the model, its best performance just reaches 86.3%, as for SVM, although its best accuracy achieves 96.1%, its training is time-consuming and its model depends on the training dataset, thus, we will employ SRC and CRC classification models, which do not depend on complicated model selection and learning, at the same time, the generalization ability of them can be easily extended by simply adding bases, the new labeled action video, to evaluate our descriptors further; Fig. 7 and Fig. 8 display their performances. From them, we can observe that although SRC and CRC classification models do not need to complex training, and their best accuracies achieve 95% and 95.2 respectively, which is comparable to that of SVM model.
At the same time, we also can know that the combined descriptors are much better than that of sole descriptor no matter SRC and CRC models are adopted. For example, the accuracies of DMHI_BAHB and RDMHI_GIST descriptors are 81% and 89.6% respectively in SRC model, but the accuracy of fusion descriptor DMHI_BAHB_ RDMHI_GIST is 92.6%. In addition, among all the combined descriptors, the performance of DMHI_RDMHI_PHOG descriptor is the best.
In conclusion, no matter what kinds of models are adopted, our combined descriptors are much better than that of sole descriptor, and their performances can obtain some improvement when comparing with sole descriptors. In addition, when comparing with the-state-of-the-art, the accuracy of our descriptor increases from 87% to 96.2%. In a word, our descriptors are robust, stable and efficient.
- 6.6 Performance Evaluation with the Change of Layers of Pyramid Histogram of Oriented Gadients
From above analysis, we can conclude that as neighbor gradient and pyramid scheme are applied in DMHI_PHOG and RDMHI_PHOG , thus, their accuracies are the best among all the descriptors. In order to further assess it, we will evaluate the performance variation with the change of the number of pyramid layers in DMHI_PHOG and RDMHI_PHOG descriptors by different kinds of models. Fig. 9 and Fig. 10 reveal their results.
PPT Slide
Lager Image
Perchance evaluation on depth channel when different layers PHOG descriptors and different models are employed
PPT Slide
Lager Image
Perchance evaluation on RGB channel when different layers PHOG descriptors and different models are employed
Fig. 9 demonstrates that if we just adopt oriental HOG scheme without pyramid, their accuracies range from 42.9% to 54.1%, but if the layer of pyramid is added to three, their accuracies range from 79.3% to 92.4%. That is to say, with the increase of the number of pyramid layers, their accuracies step up, but when the number of pyramid layers adds up to four, their performances drop gradually no matter what kinds of models are borrowed. Similarly, when the evaluation is on RGB channel, the performances step up with the increase of the number of pyramid layers. And when three layers pyramid are employed, their accuracies are the best, but their performances gradual decline when the number of pyramid layers is added up.
In a word, we should adopt multi-layer in the DMHI_PHOG and RDMHI_PHOG descriptors, but the layer cannot be too big, or the accuracy will be bad. Generally speaking, three layers should be enough, whose representation ability is robust, efficient and stable.
- 6.7 Performance Evaluation of Collaborate Multi-task Learning
Since RGB and depth images represent one scene in different modalities, they are complementary to each other and it will benefit human action recognition by fusing both for discriminative feature representation and model construction. In section 6.5, we have proved that the concatenated different modality features is helpful, whose performances are much better than that of sole descriptor, and their performances can obtain some improvement when comparing with sole descriptors. However, the direct feature fusion of both RGB and depth information is limited to improve the performances, thus, we propose a collaborative multi-task learning method for model learning and inference based on transfer learning theory. Thus, we utilized both DMHI_PHOG and RDMHI_PHOG , DMHI_PHOG and RDMHI_GIST , DMHI_GIST and RDMHI_GIST , DMHI_BAHB and RDMHI_GIST respectively in the proposed collaborate multi-task learning framework to discover the latent correlation. To demonstrate the superiority of integrating both information of RGB and depth, we also concatenated them, and then different classifier are trained respectively and their performances are given in Table 4 . From it, we can see that when MMFCML model is employed, we can achieve the accuracies of 97.3%, 96.1%, 95.2% and 94.9% respectively, and all of them are better than that of direct fusion scheme. When comparing with Lin et al. [9] , their improvements are about 10.3%, 9.1%, 8.2% and 7.9% respectively. As for Gao et al. [50] , our proposed algorithm also is comparable. The class-wise accuracy is given in Table 5 . From it, we can see that the accuracy of most actions is above 90% and even 100%, that is to say, we almost can recognize all of actions.
Performance evaluation and comparison of MMFCML model and others
PPT Slide
Lager Image
Performance evaluation and comparison of MMFCML model and others
Class-wise accuracy when MMFCML, DMHI_PHOG and RDMHI_PHOG are employed
PPT Slide
Lager Image
Note: 1.Bend; 2.Jack; 3. Jump; 4.One-hand-wave; 5.Pjump; 6. Run; 7.Side; 8.Skip; 9. Two-hand-wave; 10.Walk; 11.Clap-front; 12.Arm-swing; 13.Kick-leg; 14.Pitch; 15. Swing; 16.Boxing; 17.Tai-chi
7. Conclusions
In our work, human action recognition using motion maps based on pyramid histogram of oriented gradients and collaborate multi-task learning is proposed. We firstly construct the motion history image for both RGB and depth channels. At the same time, depth information is borrowed to filter RGB information. And then different action descriptors are proposed and extracted in DMHI and RDMHI to represent these actions. After that, different modality descriptors are combined by direct fusion scheme and collaborate multi-task learning to further represent human actions. Large-scale comparison experiments by different kinds of classification models on DHA datasets demonstrate that the representation ability of neighbor gradients is much robust and efficient than that of average value in each block, and the pyramid also is very helpful for our task. No matter what kinds of models are employed, the performances of RDMHI_PHOG and DMHI_PHOG descriptors are the best among all the descriptors, whose best accuracy reaches 92.7% and is much better than most of the-state-of-the-art schemes. When these descriptors are combined to represent motions, the performances of combined descriptors are much better than just using only sole descriptor no matter KNN, SVM, SRC, CRC and MMFCML models are employed. What is more, our proposed collaborate multi-task learning scheme can obtain the best performance, which is much more efficient than direct fusion scheme. In total, our proposed approach is robust, efficient and stable.
BIO
Z. Gao is an associate professor in the school of Computer and Communication engineering, Key Laboratory of Computer Vision and System, Ministry of Education, Tianjin University of Technology. From Sep. 2009 to Sep. 2010, he was a visiting scholar in the School of Computer Science, Carnegie Mellon University, USA. He received his Ph.D degree from Beijin University of Posts and Telecommunications in 2011. His research interests include computer vision, multimedia analysis and retrieval.
H. Zhang is a professor in the school of Computer and Communication Engineering, Tianjin University of Technology, Tianjin, China. She received her doctor degree from Tianjin University in 2008. Her research interests include multimedia analysis and virtual reality.
A. A. Liu is an associate professor in the school of Electronic Information Engineering, Tianjin University, P.R. China. From Sep. 2008 to Nov. 2009, he was a visiting scholar in the Robotics Institute, Carnegie Mellon University, USA. His research interests include learning-based computer vision, multimedia analysis and retrieval, biomedical image processing. He is an IEEE member
Y.B. Xue is an associate researcher in the school of Computer and Communication Engineering, Tianjin University of Technology, Tianjin, China. He received his master degree from Tianjin University of Technology in 2005. His research interests include multimedia analysis and computer vision.
G.P. Xu is an associate professor in the school of Computer and Communication Engineering, Tianjin University of Technology, Tianjin, China. He received his Ph.D and M.S degree from Nankai University in 2009 and 2005, respectively. His research interests include optimal design and performance evaluation of multimedia systems and distributed storage networks.
References
Bobick A. , Davis J. 2001 “The representation and recognition of action using temporal templates” IEEE Transactions on Pattern Analysis and Machine Intelligence Article (CrossRef Link) 23 (3) 257 - 267    DOI : 10.1109/34.910878
Gorelick L. , Blank M. , Shechtman E. , Irani M. , Basri R. 2007 “Actions as space-time shapes” IEEE Transactions on Pattern Analysis and Machine Intelligence Article (CrossRef Link) 29 (12) 2247 - 2253    DOI : 10.1109/TPAMI.2007.70711
Dollar P. , Rabaud V. , Cottrell G. , Belongie S. 2005 “Behavior recognition via sparse spatio-temporal features” in Proc. of the IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance 65 - 72
Schuldt C. , Laptev L. , Caputo B. 2004 “Recognizing human actions: a local SVM approach” ICPR in Proc. of the International Conference on Pattern Recognition 32 - 36
Laptev I. , Lindeberg T. 2003 “Space-time interest points” ICCV in Proc. of the International Conference Computer Vision 432 - 439
Chen M.-Y. , Hauptmann A.-G. 2009 “MoSIFT: Reocgnizing Human Actions in Surveillance Videos” CMU-CS-09-161 Carnegie Mellon University
Hu M. 1962 “Visual pattern recognition by moment invariants” IRE Transactions on Information Theory Article (CrossRef Link) 8 (2) 179 - 187    DOI : 10.1109/TIT.1962.1057692
Mehrotra R. 1992 “Gabor filter-based edge detection” Pattern Recognition Article (CrossRef Link) 25 (12) 1479 - 1494    DOI : 10.1016/0031-3203(92)90121-X
Lin Y.-C. , Hua M.-C. , Cheng W-.H. , Hsieh Y.-H. , Chen H.-M. 2012 “Human Action Recognition and Retrieval Using Sole Depth Information” in Proc. of the 20th ACM international conference on Multimedia 1053 - 1056
Li W. , Zhang Z. , Liu Z.-C. 2010 “Action recognition based on a bag of 3D points” in Proc. of International Conference on Human Communicative Behavior Analysis Workshop, CVPR 2 - 6
Davis J. W. , Tyagi A. 2006 “Minimal-latency human action recognition using reliable-inference” Image and Vision Computing 24 (5) 455 - 472
Efros A. A. , Berg A. C. , Mori G. , Malik J. 2003 “Recognizing action at a distance” in Proc. of IEEE International Conference on Computer Vision Article (CrossRef Link) 1 - 2
Fleet J. L. B. D. J. , Beauchemin S. S. 1994 “Performance of optical flow techniques” International Journal of Computer Vision 12 (1) 43 - 77    DOI : 10.1007/BF01420984
Black M. J. , Yacoob Y. , Jepson A. D. , Fleet D. J. “Learning parameterized models of image motion” in Proc. of IEEE International Conference on Computer Vision and Pattern Recognition 1997. 1, 2. Article (CrossRef Link) 561 - 567
Klaser A. , Marszalek M. , Schmid C. 2008 “A spatio-temporal descriptor based on 3d gradients” in Proc. of The British Machine Vision Conference
Wang J. , Liu Z.-C. , Wu Y. , Yuan J.-S 2012 “Mining actionlet ensemble for action recognition with depth cameras” in Proc. of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 1290 - 1297
Megavannan V. , Agarwal B , Venkatesh Babu R. 2012 “Human Action Recognition using Depth Maps” in Proc. of International Conference on Signal Processing and Communications, SPCOM 1 - 5
Wang Meng , Li Hao , Tao Dacheng , Lu Ke , Wu Xindong 2012 “Multimodal Graph-Based Reranking for Web Image Search” IEEE Transactions on Image Processing Article (CrossRef Link) 21 (11) 4649 - 4661    DOI : 10.1109/TIP.2012.2207397
Wang Meng , Hua Xian-Sheng 2011 “Active Learning in Multimedia Annotation and Retrieval: A Survey” ACM Transactions on Intelligent Systems and Technology 2 (2) 10 - 31    DOI : 10.1145/1899412.1899414
Gao Yue , Wang Meng , Zha Zhengjun , Shen Jialie , Li Xuelong , Wu Xindong 2013 “Visual-Textual Joint Relevance Learning for Tag-Based Social Image Search” IEEE Transactions on Image Processing Article (CrossRef Link) 22 (1) 363 - 376    DOI : 10.1109/TIP.2012.2202676
Wang Meng , Hua Xian-Sheng , Tang Jinhui , Hong Richang 2009 “Beyond Distance Measurement: Constructing Neighborhood Similarity for Video Annotation” IEEE Transactions on Multimedia Article (CrossRef Link) 11 (3) 465 - 476    DOI : 10.1109/TMM.2009.2012919
Wang Meng , Hua Xian-Sheng , Hong Richang , Tang Jinhui , Qi Guo-Jun , Song Yan 2009 “Unified Video Annotation Via Multi-Graph Learning” IEEE Transactions on Circuits and Systems for Video Technology Article (CrossRef Link) 19 (5) 733 - 746    DOI : 10.1109/TCSVT.2009.2017400
Wang Meng , Ni Bingbing , Hua Xian-Sheng , Chua Tat-Seng 2012 “Assistive Tagging: A Survey of Multimedia Tagging with Human-Computer Joint Exploration” ACM Computing Surveys Article 25 4 (4)
Wang Meng , Hong Richang , Li Guangda , Zha Zheng-Jun , Yan Shuicheng , Chua Tat-Seng 2012 “Event Driven Web Video Summarization by Tag Localization and Key-Shot Identification” IEEE Transactions on Multimedia Article (CrossRef Link) 14 (4) 975 - 985    DOI : 10.1109/TMM.2012.2185041
Gao Yue , Wang Meng , Ji Rongrong , Wu Xindong , Dai Qionghai 2014 “3D Object Retrieval with Hausdorff Distance Learning” IEEE Transactions on Industrial Electronics Article (CrossRef Link) 61 (4) 2088 - 2098
Gao Yue , Wang Meng , Tao Dacheng , Ji Rongrong , Dai Qionghai 2012 “3D Object Retrieval and Recognition with Hypergraph Analysis” IEEE Transactions on Image Processing Article (CrossRef Link) 21 (9) 4290 - 4303    DOI : 10.1109/TIP.2012.2199502
Gao Yue , Tang Jinhui , Hong Richang , Yan Shuicheng , Dai Qionghai , Zhang Naiyao , Chua Tat-Seng 2012 “Camera Constraint-Free View-Based 3D Object Retrieval” IEEE Transactions on Image Processing 21 (4) 2269 - 2281    DOI : 10.1109/TIP.2011.2170081
Gao Yue , Wang Meng , Zha Zhengjun , Tian Qi , Dai Qionghai , Zhang Naiyao 2011 “Less is More: Efficient 3D Object Retrieval with Query View Selection” IEEE Transactions on Multimedia Article (CrossRef Link) 11 (5) 1007 - 1018    DOI : 10.1109/TMM.2011.2160619
Gao Yue , Ji Rongrong , Zhang Longfei , Hauptmann Alexander 2014 “Symbiotic Tracker Ensemble Towards A Unified Tracking Framework” IEEE Transactions on Circuits and Systems for Video Technology
Yu Jun , Wang Meng , Tao Dacheng 2012 “Semi-supervised Multi-view Distance Metric Learning for Cartoon Synthesis” IEEE Transactions on Image Processing 21 (11)
Yu a Jun , Tao Dacheng , Rui Yong , Cheng Jun 2013 “Pairwise constraints based multi-view featuresfusion for scene classification” Pattern Recognition 46 483 - 496    DOI : 10.1016/j.patcog.2012.08.006
Yu Jun , Rui Yong , Chen Bo 2014 “Exploiting Click Constraints and Multi-view Features for Image Reranking” IEEE Transactions on Multimedia 16 (1)    DOI : 10.1109/TMM.2013.2284755
Yu Jun , Liu Dongquan , Dacheng Tao , Seah Hock Soon 2012 On Combining Multi-view Features for Cartoon Character Retrieval and Clip Synthesis IEEE Transactions on Systems, Man and Cybernetics-Part B: Cybernetics 42 (5)
Wang Hua , Nie Feiping , Huang Heng 2013 “Multi-View Clustering and Feature Learning via Structured Sparsity” ICML
Liu A. , Han D. 2010 “Spatiotemporal Sparsity Induced Similarity Measure for Human Action Recognition” International Journal of Digital Content Technology and its Applications 4 (5) 23 - 37    DOI : 10.4156/jdcta.vol4.issue5.3
Gao Zan , Liu An-An , Zhang Hua , Xu Guang-ping , Xue Yan-bing 2012 “Human action recognition based on sparse representation induced by L1/L2 regulations” ICPR 1868 - 1871
Guo K. , Ishwar P. , Konrad J. 2010 “Action Recognition Using Sparse Representation on Covariance Manifolds of Optical Flow” in Proc. of 2010 Seventh IEEE International Conference on Advanced Video and Signal Based Surveillance 188 - 195
Liu C.-H. , Yang Y. , Chen Y. 2009 “Human action recognition using sparse representation” in Proc. of Processing of IEEE International Conference on Intelligent Computing and Intelligent Systems 184 - 188
Gao Z. , Zhang H. , Xu G.P. , Xue Y.B. 2012 “Human Behavior Recognition Using Structured and Discriminative Sparse Representation” International Journal of Digital Content Technology and its Applications 6 (23) 416 - 422    DOI : 10.4156/jdcta.vol6.issue23.47
Wright J. , Yang A. , Ganesh A. , Sastry S. , Ma Y. 2009 “Robust face recognition via sparse representation” IEEE Trans. on Pattern Analysis and Machine Intelligence Article (CrossRef Link) 31 (2) 210 - 227    DOI : 10.1109/TPAMI.2008.79
Zhang L. , Yang M. , Feng X. 2011 “Sparse Representation or Collaborative Representation: Which Helps Face Recognition?” in Proc. of International Conference on Computer Vision, ICCV
Oliva A. , Torralba A. 2001 “Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope” International Journal of Computer Vision Article (CrossRef Link) 42 (3) 145 - 175    DOI : 10.1023/A:1011139631724
Dalal N. , Triggs B. 2005 “Histograms of oriented gradients for human detection” CVPR in Proc. of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 886 - 893
Bosch A. , Zisserman M.-X. 2007 “Representing Shape with a Spatial Pyramid Kernel” in Proc. of the 6th ACM International Conference on Image and Video Retrieval 401 - 408
Ni B.-B , Wang G. , Moulin P. 2012 “RGBD-HuDaAct: A Color-Depth Video Database for Human Daily Activity Recognition” in Proc. of International Conference on Computer Vision workshop, ICCV 1147 - 1153
Marcel S. , Rodrigue Y. , Heusch G. 2007 “On the Recent Use of Binary Patterns for Face Authentication” International Journal on Image and Video Processing Special Issue on Facial image Processing 1 - 8
Chang C.-C. , Lin C.J. 2001 LIBSVM: a library for support vector machines.
Nesterov Y. 2004 “Introductory lectures on convex optimization: A basic course” Springer Article (CrossRef Link)
Gao Z. , Chen M.-Y. , Hauptmann A.-G. , Cai A.-N. 2010 “Comparing Evaluation Protocols on the KTH Dataset” HBU in Proc. of the First international conference on Human behavior understanding 88 - 100
Gao Zan , Song Jian-ming , Zhang Hua , Liu An-An , Xue Yan-bing , Xu Guang-ping 2014 “Action Recognition Via Multi-modality Information” Journal of electrical engineering & Technology 9 (2) 742 - 751