Advanced
Multiscale Spatial Position Coding under Locality Constraint for Action Recognition
Multiscale Spatial Position Coding under Locality Constraint for Action Recognition
Journal of Electrical Engineering and Technology. 2015. Jul, 10(4): 1851-1863
Copyright © 2015, The Korean Institute of Electrical Engineers
This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • Received : June 26, 2014
  • Accepted : March 17, 2015
  • Published : July 01, 2015
Download
PDF
e-PUB
PubReader
PPT
Export by style
Article
Author
Metrics
Cited by
TagCloud
About the Authors
Jiang-feng Yang
Corresponding Author: School of Communication and Engineering Information, University of Electronic Science and Technology of China (UESTC), P.R. China. (369322023@qq.com)
Zheng Ma
School of Communication and Engineering Information, UESTC, P.R. China. (wallsonyang@163.com)
Mei Xie
School of Electronic Engineering, UESTC, P.R. China. (369322023@qq.com)

Abstract
– In the paper, to handle the problem of traditional bag-of-features model ignoring the spatial relationship of local features in human action recognition, we proposed a Multiscale Spatial Position Coding under Locality Constraint method. Specifically, to describe this spatial relationship, we proposed a mixed feature combining motion feature and multi-spatial-scale configuration. To utilize temporal information between features, sub spatial-temporal-volumes are built. Next, the pooled features of sub-STVs are obtained via max-pooling method. In classification stage, the Locality-Constrained Group Sparse Representation is adopted to utilize the intrinsic group information of the sub-STV features. The experimental results on the KTH, Weizmann, and UCF sports datasets show that our action recognition system outperforms the classical local ST feature-based recognition systems published recently.
Keywords
1. Introduction
Human action recognition has attracted significant interest in the computer vision community in the recent decade and has spurred the development of a wide variety of applications, including video surveillance human-computer interaction and the analysis of sporting events. However, automatic human action recognition is highly challenging due to the non-stationary background of most video content, the ambiguity of the human body shape among different actions, and the existence of intra-class variations in the appearance, physical characteristics, and motion style of different human subjects.
Generally speaking, action recognition is composed of two components: action representation and action classifycation. Action representation is the process of modeling human behaves, encoding the extracted features from action video. The quality of the learnt model decides discriminative power usually results in good classification result. There are two representation methods: holistic action representations [1 - 7] and local action representations [8 - 18] .
Commonly, holistic action representation is derived from silhouette or body sketch. It requires fine background subtraction or body part tracking. Thus, it is sensitive to noise, variation in viewpoint, and partial occlusion. On the country, local action representation bases on the local spatio-temporal features, and it usually is utilized together with bag-of-features (BoF) to model human action. Local action representation is robust to viewpoint changes, environment noise and partial occlusion [19] .
In many literatures, BoF-based human action recognition consists of extracting local features from videos, obtaining action representation vectors via these local features, and classifying action videos with a classifier upon the vectors. To obtain action representation vector, several feature coding and pooling methods are provided. k -means and vector quantization (VQ) are used to encode features, next, action representation histogram is computed.
Recently, data locality was observed to be a key role in clustering, dimension reduction [20 , 21] , density estimation [22] , anomaly detection [23] , and image classification [24 - 27] . In pattern recognition, the k-NN (Nearest Neighbor) classifier can be considered as a recognition algorithm using data locality, since it considers the locality information of training data for performing classification. To be more precise, k-NN assigns the class label for a test input according to the majority of the nearest training data of the same class. Motivated by the importance of data locality, we proposed a Multiscale Spatial Position Coding under Locality Constraint (MSPC-LC), which considers the spatial position relationship of local features, to encode local features. Compared to the standard sparse representation based classification (SRC) proposed by Wright et al . [28] or recently proposed locality-constrained linear coding (LLC), our algorithm improves classification performance.
To reduce the quantization error caused by k -means and VQ, rather than assigning one codeword for a feature only, soft vector quantization (SVQ) and sparse coding (SC) [29] are adopted to encode features in action recognition tasks. However, the local features usually reside on nonlinear manifolds [30 , 31] . Neither SVQ nor SC can preserve the nonlinear manifold structure. The manifold is nonlinear and not Euclidean in its whole space, but linear and Euclidean in a local region [32] . Because SVQ uses all bases to encode each feature and generates dense codes, it can not precisely represent the nonlinear manifold structure with a global way. Due to the overcomplete dictionary, SC tends to choose the code words which are distant to the input features [31] . Thus, it cannot correctly represent manifold structure. Hereafter, we consider these limitations on both quantization error and loss manifold structure in feature coding as representation error. For this issue, Yu et al . [30] provided a Local Coordinate Coding (LCC) to encode feature with locality-constrained; Wang et al . [31] introduced an improved version of LCC named Locality-constrained Linear Coding (LLC) to reduce computational cost, and Wei et al . [33] proposed a local sensitive dictionary learning method for image classification.
In classification stage, support vector machine (SVM) has been widely used as action classifier. Recently, inspired by the impressive success of Sparse Representation based Classification (SRC) in face recognition [34] , some authors [18] explored SRC for human action recognition and achieved better performance than SVM. Nevertheless, these representation methods suffer one major limitation: the spatio-temporal (ST) relationship among local features is ignored, such as temporal order and spatial arrangement [35] . For instance, in Fig. 1 , same local features from two different actions are projected on the XY plane, but their spatial configurations are different. Due to the same histograms generated by BoF model, they are incorrectly considered as one action. Recently, some researchers exploited ST context information [35] , local feature distribution [36] , and spatial pyramid matching (SPM) [23 , 27] to handle this problem.
PPT Slide
Lager Image
The illustration of traditional BoF model ignoring the spatial relationship between local features. Same features from two different actions are projected on XY plane, and have different spatial configuration. However, they can not be correctly classified with BoF model, because it contains two same codeword histograms due to ignoring their spatial relationship.
In this paper, we developed MSPC-LC algorithm to address this limitation and reduce the representation error. Specifically, to reduce the representation error (quantization error and manifold structure loss), we adopted locality constraint on codebook learning and feature coding with respect to manifold learning [32] . To model the ST relationship of features, spatial position information between them is taken into account in learning codebook and coding features. In such way, the spatial relationship is embedded into the coded features. In addition, to handle the different action styles, multi-temporal-scale pooled features are built. In experiment, local features are firstly projected on the XY plane for obtaining their spatial configuration. Then codebook learning and feature coding are implemented under locality position constraint. To classify one action video (see Fig. 2 ), which can be treated as a spatiotemporal volumn (STV), this STV is then represented as a group of multi-temporal-scale sub-STVs obtained by dense sampling. Next, multi-temporal-scale pooled features are built with max-pooling method. Finally, Locality-constraint Group Sparse Representation (LGSR) [26] is adopted for action classification upon these sub-STV pooled features.
PPT Slide
Lager Image
The illustration of our proposed approach. The shape of codewords indicates their appearance information, while the color indicates their spatial position. For example, the codewords , have same appearance but different spatial position. They are considered as two different codewords during feature coding. Then, two different codeword histograms are obtained for two actions upon these codewords. Hence, Actions 1 and 2 which can not be distinguished in BoF (Fig. 1) can be correctly classified with our method.
Compared to these methods which use ST context information [35] or feature distribution [36] to handle the limitation of BoF, MSPC-LC is a more fine and complete method, because it records most fundamental information ( where , how ) of local features for human action recognition. The experimental results on the KTH, Weizmann, and UCF sports datasets show that our method achieves better performance than these methods [35 - 38] and other local ST feature-based methods.
The paper has two contributions as follows:
  • To describe spatial relationship between local features and reduce representation error, a mixed feature combining spatial configuration and motion information was proposed, and was encoded by MSPC-LC algorithm.
  • In order to utilize the intrinsic group information from these sub-STV pooled features, LGSR based classifier is employed for action classification.
The rest of this paper is organized as follows: Algorithm MSPC-LC is proposed in Section 2. The action recognition framework in MSPC-LC and LGSR is provided in Section 3. Then, experimental results and analysis are shown in Section 4. Finally, the conclusions are drawn in Section 5.
2. Multiscale Spatial Position Coding under Locality Constraint
- 2.1 Modeling position relationship between local features
Two facts motivate us to propose MSPC-LC for action representation. The first fact is that the relative position relationship between features in BoF model is ignored, and the spatial cue often plays an important role for visual appearance modeling. Moreover, inspired by the work of Liu et al . [40] , spatial position relationship between features was considered during dictionary learning and feature coding in this paper to handle this problem. They involved spatial locations into DCT based feature descriptors to model the spatial relationship of local features for face recognition. Their experimental results showed that feature position can improve local feature-based face recognition accuracy. The second fact motivate us is that local features from the same part of body are often repeatedly and similar, and usually aggregate in a local region. For instance, local features generated by hand motion usually locate in the upper part of human body at high probability; those from leg motion are most inclined to fall into the lower part of human body. Therefore, the quantization error can be alleviated by incorporating feature position and appearance information.
In this paper, local spatio-temporal oriented energy (STOE) features
PPT Slide
Lager Image
and their spatial position { rel _x, rel _y } are connected together to build a mixed feature descriptor f α ,β
PPT Slide
Lager Image
where row vectors v g denote l 2 STOE features along direction g ; and α , β are the spatial scale parameter and temporal scale parameter, respectively; ( rel _x, rel _y ) is the normalized coordinate of feature f α ,β , and obtained by (see Fig. 3 )
PPT Slide
Lager Image
The local feature position distribution map of actions in the KTH dataset.
PPT Slide
Lager Image
PPT Slide
Lager Image
PPT Slide
Lager Image
PPT Slide
Lager Image
where ( abs _x, abs _y ) is the coordinate of f α, β on the XY plane; body_length is body height of action performer. ( body_cen_x, body_cen_y ) is the position of body center; ( head _x, head _y ) is the position of performer’s head in XY plane.
Going through the above process, we can see that all local features are relocated around the origin. The purpose of introducing head position and body length is make sure that the features from different parts of body are correctly relocated with respect to head position, for example, the features from hand motion should be located in the upper of body, and those from leg motion tend to appear in the lower of body. In the paper, we manually locate head position and body length. Fortunately, there are just few videos (few of the videos in UCF Sports dataset) need to do this.
It can be seen from (1) that the distance between two features
PPT Slide
Lager Image
,
PPT Slide
Lager Image
consists of two parts: the difference of two STOE feature descriptors
PPT Slide
Lager Image
and the difference of their spatial positions( α ( rel_xi − rel_xj ), β ( rel _yi − rel _yj )). The former is fixed value, but the latter changes by setting various α, β . With a decrease in α, β , the influence of feature position on clustering result drops. As a result, k -means on features with Euclidean distance as metric gradually becomes traditional BoF clustering that ignores ST position relationship between features. Fig. 4 illustrates the different k -means results by various α, β values.
PPT Slide
Lager Image
The local feature position distribution map of actions in the KTH dataset. Spatial scale α, β influences on the k-means clustering results; (a) Input action data; (b) Local STOE features extracted from the video. (c-f) Local features containing both STOE and spatial configuration are clustered into four types (denoted by four different colors) using k-mean with Euclidean distance as metric. It can be seen that with a decrease in α, β values, the impact from feature position over the clustering results gradually drops. In (c), the large values of α, β make feature position dominates the clustering result, while the information of STOE hardly influences on the clustering result. In (f), the information of STOE significantly controls the clustering result, since the small values of α, β depress the influence of feature position.
To depict the spatial position relationship between features, codebook building and feature coding upon the feature f are implemented. The computation of STOE features and the construction of corresponding STOE descriptor are given in Section 2.4.
To explain the role of feature position during codebook building and feature coding, we realized clustering algorithm k -means for dictionary learning and encoded local features by VQ. The representation error caused by them will be solved in Section 2.2. Assuming that D α,β R N×M ,
PPT Slide
Lager Image
is a codebook built via k -means upon the features
PPT Slide
Lager Image
PPT Slide
Lager Image
, ( j = 1, ⋯, n ) with Euclidean distance as metric. Each codeword
PPT Slide
Lager Image
contains two types information: ST oriented motion information
PPT Slide
Lager Image
and feature spatial position ( α(rel _x), β(rel _y) ). The class label vector
PPT Slide
Lager Image
for feature
PPT Slide
Lager Image
is obtained with VQ:
PPT Slide
Lager Image
Where
PPT Slide
Lager Image
is the input feature and described in (1);
PPT Slide
Lager Image
is the i th codeword in codebook D α,β .
According to (1) and (6), the codeword
PPT Slide
Lager Image
chosen for encoding
PPT Slide
Lager Image
should be the most closest to
PPT Slide
Lager Image
with respect to two aspects: feature similarity and spatial distance. Therefore, the spatial position of
PPT Slide
Lager Image
can be obtained from its code
PPT Slide
Lager Image
Given a group of local features, their spatial relationship can be represented with their codeword histogram:
PPT Slide
Lager Image
where , H α,β RM is the codeword histogram with scale factors α, β ; n is the number of features; and
PPT Slide
Lager Image
is the code of
PPT Slide
Lager Image
.
For example, as illustrated in Fig. 2 , these two actions in Fig. 1 can be distinguished with their new histograms. Benefiting from involving feature position into codewords, two different codeword histograms are provided for Actions 1 and 2. Actions that have similar features but different spatial relationship can be correctly classified by this method. Therefore, fusing spatial position into codebook building and feature coding is a feasible way to model the spatial relationship of features for human action recognition.
- 2.2 Reducing representation error with locality constraint
In Section 2.1, k -means and VQ are applied in dictionary learning and feature coding. However, Yu et al . [28] discovered that VQ can not handle nonlinear manifold structure well, because it is a 0 th-order (constant) approximation of object functions from the view of function approximation, and that VQ causes nontrivial quantization error. They suggested that 1 st-order (linear) approximation can solve these problems and introduced locality constraint into object function:
PPT Slide
Lager Image
where P j = [ p j,1 , ⋯, Pj,M ] is the locality adaptor that gives different freedom for each base. The k th element in p j is obtained by pj,k = ∥ f j d k 2 and pj,k is the distance between local feature f j and k th atom d k ; sparse coefficient c j = [ c j,1 , ⋯, cj,M ] In (8), the first term represents the reconstruction error of an input feature f j with respect to codebook D ; the second term is locality constraint regularization on c j ; and λ is a regularization factor to balance these terms; ⊙ is the element-wise multiplication; 1 T . c j = 1 is the shift-invariant constraint according to [28] .
Eq. (8) tends to choose the atoms near f j for generating coefficient c j . Because distance vector p j is fixed, to minimize
PPT Slide
Lager Image
one needs to make the coefficient cj,k corresponding to large pj,k equals 0. In addition, 1 T . c j = 1 in (8) is sparse regularization term and intends to obtain sparse solution. Sparsity indicates that many elements in c j are zero, while only a few are nonzero. Thus only a few codewords in D are selected to encode feature f j . Obviously, the selected codewords belong to the local neighborhood of f j .
However, an iterative optimization is applied to solve the l 1 optimization problem in (8). To reduce the computational cost in (8), we replace the term || p j c j || 1 with || p j c j || 2 . Hence, (8) can be rewritten as follows:
PPT Slide
Lager Image
In (9), feature f j is fixed. To minimize
PPT Slide
Lager Image
the codewords far from f j are assigned zero in c j , meanwhile, the codewords close to f j are assigned nonzero in c j . Therefore, similar to (8), the codewords belonging to the neighborhood of f j are selected to generate coefficient c j . From the respect of manifold learning [16 , 18] , although the data on a manifold are nonlinear, in a local region, they can be considered as linear [16] . Therefore, using the locality constraint, the problems of VQ can be solved.
To achieve good classification performance, the coding scheme should generate similar codes for similar descriptors. Following this requirement, the locality constraint term
PPT Slide
Lager Image
in (9) presents several attractive properties:
  • Better reconstruction. In VQ, each descriptor is represented by a single basis in the codebook, as illustrated inFig. 5. Due to the large quantization errors, the VQ code for similar descriptors might be very different. Besides, the VQ process ignores the relationships between different bases. Hence non-linear kernel projection is required to make up such information loss. On the other side, as shown inFig. 5, in LLC, each descriptor is more accurately represented by multiple bases, and LLC code captures the correlations between similar descriptors by sharing bases
  • Similar to LLC, SC also achieves less reconstruction error by using multiple bases. Nevertheless, the regularization term ofl1norm in SC is not smooth. As shown inFig. 5, due to the over-completeness of the codebook, the SC process might select quite different bases for similar patches to favor sparsity, thus losing correlations between codes. On the other side, the explicit locality adaptor in LLC ensures that similar patches will have similar codes.
  • Analytical solution. Solving SC usually requires computationally demanding optimization procedures. For instance, the Feature Sign algorithm utilized by Yang et al.[43]has a computation complexity ofO(M×K) in the optimal case[44], whereKdenotes the number of non-zero elements;Mdenotes the number of bases in dictionary. Unlike SC, the object function in (5) can be solved with an analytical solution according to[33]:
PPT Slide
Lager Image
Comparison between VQ, SC (sparse coding) and LLC (locality-constrained linear coding). The selected bases for representation are highlighted in black.
PPT Slide
Lager Image
Similarly, the problems of k -means codebook building can also be solved with locality constraint. According to [38] , the object function of our codebook building method is formulated as follows:
PPT Slide
Lager Image
where F = { f 1 , ⋯, f n }, n is the number of input local features, c j RM is the j th column of C , and p j = RM is the locality adaptor whose k th element is given by pj,k = || f j d k || 2 . Eq. (11) can be solved by the Locality-Sensitive Dictionary Learning (LSDL) in [33] .
- 2.3 Constructing multiscale spatial relationship of features
Due to the different styles of human action, it is difficult to model the ST relationship of local features in a single ST scale. The actions with different styles appear in different motion range (different in spatial scale) and speed (different in temporal scale). To handle this problem produced by different action styles, multiscale is taken into account during feature coding for the purpose of capturing ST relationship between local features.
In our system, spatial and temporal information between features are taken into account, respectively. Specifically, the spatial information is fused into feature coding by MSPC-LC, and the temporal information is considered by multi-temporal-scale sub-STV.
In reality, instead of building spatial pyramid structure, position weight factors α, β are used to control the spatial scales. According to (1), a large (small) α or β intends to select the codewords within a small (large) spatial neighborhood. Thus we can use multiple α or β to obtain the multiscale feature descriptor f ms
PPT Slide
Lager Image
where i ∈ [l, ns ], j ∈ [l, nt ], and ns , nt are the number of spatial scales. For example, we set α = β = [3,2,1]. Then the final coefficient c ms is obtained by concatenating all coefficients in multiscale
PPT Slide
Lager Image
where D α(i),β(j) is the codebook learnt by LSDL over the feature descriptors at i th, j th temporal scales. Locality adaptor
PPT Slide
Lager Image
whose k th element
PPT Slide
Lager Image
is the distance between local feature f α(i),β(j) and codeword
PPT Slide
Lager Image
, and defined as
PPT Slide
Lager Image
- 2.4 Computing STOE features and their descriptors
In the paper, local STOE features are obtained by filtering using a set of Gaussian derivative filters and their corresponding Hilbert transform filters, point-wise squaring and summation over each 3D cuboid that is associated with a detected STIP. We use the approach in [45] to obtain local STOE feature. Specifically, in the first step, a small ST cuboid C ( x i ) centering on a STIP p ( x i ) is filtered by the directionally selective filters G2,θj and H 2,θj at direction cosines θj = ( αj, βj, γj ), j = {1, ⋯ , J } , J is the number of orientations. Next, the filters are taken in quadrature to eliminate phase sensitivity in the output of each filter. Local
STOE energy Eθj ( x i ) in cuboid C ( x i ) at orientation θj can be computed according to
PPT Slide
Lager Image
where x i = ( xi, yi, ti ) denotes the i th STIP coordinate; ∗ denotes convolution; G 2 is 3-D steerable, separable filter based on Gaussian second derivative; H 2 is their corresponding Hilbert transform. The method of constructing ST filters G 2 and H 2 in [45] was adopted in the paper. Then, STOE features at each STIP is normalized as follows [46] :
PPT Slide
Lager Image
where ε is a constant introduced as a noise floor and to avoid instabilities at points where the overall energy is small. In our system, motion information within each cuboid is decomposed into local STOE features along axes X, Y and T, respectively, corresponding direction cosines θ 1 = (1,0,0), θ 2 = (0,1,0), θ 3 = (0,0,1). Finally, normalized STOE features are converted into feature descriptors using the proposed algorithm as follows:
The cuboid containing normalized response is divided into non-overlapping cells. Histogram features with variable bins are then constructed by summing up normalized response within each cell.
The entropy of each histogram feature is calculated, and normalized by corresponding maximum entropies (for example, for k-bin histogram, its maximum entropy equals to (−1/ k )In(1/ k ). Feature of a cell is formed by assembling the normalized entropies of it.
Final feature descriptor is generated by concatenating all cell features.
3. Action Recognition with LGSR and MSPC_LC
- 3.1 System framework
Extraction and representation of motion information plays a crucial role in human action recognition in video sequence. To obtain sufficient motion information, the steerable filters are employed to decompose the raw dynamic information into STOE features along several directions.
In the paper, to depict ST motion information between STIPs in action video, a novel system framework is proposed, and its basic idea is illustrated in Fig. 6 . Firstly, spatio-temporal interest points (STIPs) are extracted from action videos, and local STOE features are calculated ( Fig. 6(a) , (b) ). Secondly, both of STOE features and multi-spatial-scale (MSS) positions of STIPs are connected by the proposed mixed features, and the related MSS codebooks are constructed by k -means over the mixed features ( Fig. 6(c) , (d) ). Then, the mixed features are encoded by MSPC-LC ( Fig. 6(e) , (f) ). Next, multi-temporal-scale (MTS) sub-STVs are built for making use of the temporal information between STIPs, and their MTS coefficient histograms are obtained with max-pooling ( Fig. 6(g) , (h) ). Finally, LGSR is used as classifier over the MTS coefficient histograms ( Fig. 6(i) ).
PPT Slide
Lager Image
The flowchart of the proposed action recognition with MSPC-LC and LGSR.
- 3.2 Building Sub-STVs by Multi-temporal-scale Dense Sampling (MTDS)
Using multi-temporal-scale dense sampling (MTDS) method, the temporal relationship between features can be captured. Specifically, a set of time scales is defined for MTDS depending on the possible action cycle length. Several sub-STVs are then constructed by a sliding window operation. Finally, a group multi-spatial-scale feature descriptors as (12) is built by setting [ α 1 , α 2 , α 3 , α 4 ], [ β 1 , β 2 , β 3 , β 4 ], hence the multi-temporal-scale information is included in MTDS.
The advantage of MTDS is that no necessary to consider the problem of aligning the time coordinate system with human action cycles. Because if only the training samples are sufficient, any test sub-STV can always match a training sub-STV. In many cases, this assumption is true in real applications.
- 3.3 Depicting Sub-STVs by MSPC-LC and Max-pooling Method
Given a sub-STV, there are several local features within it. Each local feature is converted into a multi-spatial-scale feature as (12), and generates a group of multi-spatial-scale feature descriptors:
PPT Slide
Lager Image
Then, MSPC-LC is used to encode each feature descriptor and obtain multiscale class label coefficients:
PPT Slide
Lager Image
sum pooling [31] :
PPT Slide
Lager Image
max pooling [31] :
PPT Slide
Lager Image
These pooled features can then normalized by sum normalization:
PPT Slide
Lager Image
l 2 normalization
PPT Slide
Lager Image
where s out ( j ) is j th element in sub-STV un-normalized descriptor s out , and
PPT Slide
Lager Image
is the MSPC-LC coefficient vector for feature
PPT Slide
Lager Image
.
- 3.4 Action classification based on LGSR
To utilize effectively the intrinsic group information from sub-STV descriptors for action classification, we treat each video as one group of sub-STV descriptors and employ Locality-constrained Group Sparse Representation (LGSR) [39] as human action classifier. LGSR was proposed in [39] for human gait recognition. It is an extended version of sparse representation-based classifier (SRC). The pioneering work of SRC was proposed by Wright et al . [34] and used to classify face images by minimizing the l 1 norm-regularized reconstruction error, in which it seeks a sparse representation for only a single test image.
Compared with traditional SRC, there are three advantages for LGSR:
  • 1) SRC is designed for single image classification and fails to classify a group of samples, while LGSR is developed for sample group classification.
  • 2) Locality constraint on LGSR is more reasonable than sparsity constraint on SRC, especially for representing manifold data[31,33].
  • 3) LGSR is a block sparse constraint classifier, and it is suitable than SRC in classification task when the features are discriminative.
The experimental result in Section 4.5 demonstrates the third advantages. The LGSR enforces both of the group sparsity and local smooth sparsity constraint by minimizing the weighted l 1,2 mixed-norm-regularized reconstructed error, and LGSR is defined as
PPT Slide
Lager Image
where the first term represents the reconstruction error of the test action video S with respect to all the training videos D . The second term is the weighted l 1,2 mixed-norm based regularization on the reconstruction coefficient A , and λ ≥ 0 is the regularization parameter to balance these terms. D is the classification codebook constructed by concatenating K class-specific codebooks [ D 1 , ⋯ D K ]. Each class-specific codebook D k is learnt with LSDL algorithm [33] over the sub-STV descriptors corresponding to the k th action class. S is the group of sub-STV descriptors for one test action. Coefficient vector A k is one part of A and corresponds to . D k . P k is the distance matrix between S and D k , and the element
PPT Slide
Lager Image
in P k is defined as
PPT Slide
Lager Image
Since A k values are independent to each other, we can separately update each A k using its subgradient [36] . To solve (24), the active set-based subgradient descent algorithm [39 , 40] was adopted.
Once we obtain the optimal reconstruction coefficient A , two classification methods [39] based on different criteria can be used to classify the test video.
  • 1) Minimum Reconstruction Error (minRE) Criterion: We compute the reconstruction error for each class as follows:
PPT Slide
Lager Image
  • where the reconstruction coefficient (Ak)∗is fromAkthat corresponds to thekth training video. Then, we classify the test video tok∗= arg minkRk((Ak)∗).
  • 2) Maximum Weighted Inverse Reconstruction Error (maxWIRE) Criterion: In the above criterion, the reconstruction coefficient is not used directly for classification. Intuitively, if the reconstruction errors of the test video with respect to two training videos are the same, we should choose the class label of the training video that is associated with the larger Frobenius norm of the reconstruction coefficient. Specifically, we define the following weighted inverse reconstruction error as follows:
PPT Slide
Lager Image
We classify the test video to k = arg min k Q k (( A k ) ). In the paper, we use maxWIRE criterion as human video action classifier, because it is more reasonable than minRE criterion [33] .
4. Experiment and Analysis
In this section, the effectiveness of our MSPC-LC is evaluated on three public datasets: the KTH, Weizmann, and UCF sports datasets. Leave-one-out cross-validation (LOOCV) strategy is used to evaluate the performance of our algorithm.
- 4.1 Experimental setup
In all experiments, Dollar detector based on multiple ST scales is used to extract STIPs form action videos, and spatial scale τ = [1.2,1.3,1.4,1.5] and temporal scale ϖ = [0.4,0.45,0.50,0.55]. For actions performed by subjects whose bodies do not move (i.e., boxing, handclapping and handwaving in the KTH dataset), STIPs extracted from each action video are coordinate normalized and moved to the region around the origin. For actions performed by moving human bodies and with static background (i.e., actions in the Weizmann dataset), we directly use background subtraction to obtain the center position of body, and centralize the extracted STIPs coordinate. In addition, for actions performed by moving human bodies and with dynamic background (i.e., actions in the UCF Sports dataset), the annotation bounding boxes are used to locate STIPs (for the UCF Sports datatset, the annotation bounding boxes are provided).
To capture multi-temporal-scale relationship of local features, the length of sub-STV is set as 5, 10, 25 and 50 frames. And to describe multi-spatial-scale relationship between local features, four spatial scale parameters are set as α = [0.1,0.05,0.01,0.005], β = [0.1,0.05,0.01,0.005] respectively. In MSPC-LC, the codebook size is set to 500. Since there are 4 spatial scales, the dimension of a sub-STV descriptor is 500 × 4 = 2000. In order to guarantee that the class-specific dictionaries in LGSR are over-complete, random projection in dimension reduction [47] is employed to reduce the dimension of the sub-STV descriptor to 400. In LGSR, the size of each class-specific codebook is set to 600. The other parameters in our methods (i.e., λ ) and the parameters of other methods are evaluated by 5-fold cross-validation.
- 4.2 Action datasets
The KTH dataset contains six classes of human action (i.e., boxing, hand clapping, hand waving, jogging, running, and walking). The actions are performed by 25 different subjects. Each subject performs four action videos in each class. Therefore, the KTH dataset includes 25 × 4 × 6 = 600 low-resolution video clips (160 × 120 pixels). Each action is performed in four scenarios: indoors, outdoors, outdoors with scale variation, and outdoors with different clothes. Examples of this datasets can be seen in Fig. 8 .
PPT Slide
Lager Image
Example images from the KTH dataset.
The UCF sports dataset includes 150 action videos, which are collected from various broadcast sports channel such as BBC and ESPN. It contains 10 different actions: diving, golf swing, horse riding, kicking, lifting, running, skating, swing bar, swing floor, and walking. This dataset is challenging with a wide range of scenarios and viewpoints. Examples of this dataset can be seen in Fig. 9 .
PPT Slide
Lager Image
Examples from the UCF Sports dataset
The Weizmann dataset contains 93 low-resolution video clips (180 × 144 pixels) from nine different subjects, each of whom performs 10 different actions including walking (walk), running (run), jumping (jump), galloping sideways (side), bending (bend), one-hand-waving (wave one), two-hand-waving (wave two), jumping in place (pjump), jumping jack (jack), and skipping (skip). One of the subjects performs walking, running, and skipping twice. The camera setting is fixed and there is no occlusion or viewpoint change. Besides, each subject performs under similar plain background. Some examples are demonstrated in Fig. 10 .
PPT Slide
Lager Image
Examples from the Weizmann dataset
- 4.3 Action datasets performance comparison with BoF model
In this section, the performance of our MSPC-LC algorithm is compared with existing local feature coding methods VQ, SC and LLC [31] . k-nearest neighbor (k-NN) is used as action classifier, and parameter k is set to 5. Keeping the same codebook size with MTDS, K-means clustering is utilized to build codebook with VQ and LLC, and the software in [41] is utilized for SC. In LLC, the locality constraint parameter k is set to 5. In our method, a group of sub-STV descriptors are extracted from a test video and classified with k-NN. Next, the vote score of these sub-STVs decides the label of the test video. In feature pooling phase, sum-pooling is used for VQ, while max-pooling is applied in SC, LLC, and MSPC-LC. And all select sum normalization as normalization method.
In addition, to evaluate the above locality coding algorithms which are used for improving BoF of feature coding in Section 2, another experiment is carried out. Firstly, considering the feature position constraint in Section 2.1, the coding method in (2) is treated as the basic spatio-temporal coding (StC). Secondly, considering the locality constraint in Section 2.2, the coding method in (5) is regarded as the locality-constrained spatio-temporal coding (LSC). In this experiment, the codebook size is still set to 500. K -means clustering is adopted to build codebook for StC. LSDL is used to build codebook for LSC. Sum-pooling is used for StC, and max-pooling is adopted for LSC. The parameter for k-NN is set to 5. The spatial control factors are set as α = [0.1,0.05,0.01,0.005], β = [0.1,0.05,0.01,0.005]. The length of sub-STV is set as 5, 10, 25 and 50 frames, for capturing multi-temporal-scale relationship of local features.
Table 1 shows the result of performance comparison. The recognition rates are the average values on three datasets. It can be seen that StC method achieves better performance than VQ, SC, and LLC. This demonstrates that the ST relationship is important for human video action recognition, and the locality-constrained ST coding is better than StC. In addition, the locality-constraint is useful to handle the manifold of local features. Benefiting from modeling the multiscale ST relationship of local features, MSPC-LC achieves the highest average recognition accuracy on the three datasets.
- 4.4 MSPC-LC versus SPM
The spatial pyramid matching (SPM) model is employed to capture the spatial relationship of local ST features [16] . Here, a 4-level SPM ( Fig. 11 ) is used for evaluation. MSPCLC, SPM, LLC and max-pooling are used to describe sub-STV, respectively. Then, classifying sub-STVs is based on k-NN criterion. Finally, the vote score-based classifier (similar with Section 4.3) decides the class label of test video. Table 2 shows the average recognition accuracies on the three datasets. It is clear that k-NN+MSPC-LC achieves better performance than only SPM on all datasets. Different from SPM [42] that only considers the spatial relationship of local features, MSPCLC simultaneously makes use of the spatial and temporal relationships. In addition, compared with the fixed grids in SPM, MSPC-LC is a more flexible representation.
PPT Slide
Lager Image
SPM with 4-level structure is employed in performance evaluation.
Performance comparison between MSPC-LC and SPM
PPT Slide
Lager Image
Performance comparison between MSPC-LC and SPM
- 4.5 LGSR versus SRC
To evaluate the classification performance based on LGSR, the standard SRC [31] is also employed. The object function of SRC is defined as
PPT Slide
Lager Image
where s i is the ith sub-STV descriptor in S , a i is its corresponding coefficient. Similar to LGSR, the maxWIRE criterion is also used in SRC. As mentioned in Section 3.4, compared with SRC, there are three advantages of LGSR. In particular, if the features are not shared with other classes, the block sparse constraint is more suitable for the classification than sparse constraint. Hence LGSR is relatively better than SRC for classification task when using less shared features. The comparison of average accuracy ( Table 3 ) shows that LGSR outperforms than SRC on the KTH and UCF sports datasets. It is worth to note that Guha and Ward [19] suggested that sparse constraint is more important than block sparse constraint in human action recognition based on local ST features. Compared with local ST features, the obtained sub-STV descriptors with MSPCLC are less shared with other actions. Hence, it is plausible to utilize block sparse constraint than sparse constraint for action classification together with MSPC-LC.
Performance comparison between LGSR and SRC
PPT Slide
Lager Image
Performance comparison between LGSR and SRC
- 4.6 Performance comparison with other systems
Table 4 shows the performance comparison between the proposed algorithm and the classical systems. In experiment setting, action performers are randomly selected as training data and the rest as test data. The competing methods include local representation-based methods [8 , 14] , and global representation-based methods [14] . Specifically, SC was used for feature coding together with BoF in [17] ; and a novel local feature detector was proposed for human action recognition in [14] ; local feature distribution information was taken into consideration in [36] ; ST context feature was employed in [34] ; a ST context constraint coding method was utilized in [38] ; sparse representation-based classification methods was applied in [18] ; and the holistic action representation method was adopted in [5] . It demonstrates from Table 5 that our algorithm outperforms these classical algorithms. The confusion matrices on the KTH and UCF sports datasets are shown in Table 4 and 5 , respectively.
Confusion matrix on the KTH dataset with our method LGSR+MSPC-LC. s1(boxing), s2(hand-clapping), s3(hand-waving), s4(walking), s5(jogging), s6(running).
PPT Slide
Lager Image
Confusion matrix on the KTH dataset with our method LGSR+MSPC-LC. s1(boxing), s2(hand-clapping), s3(hand-waving), s4(walking), s5(jogging), s6(running).
Performance comparison with other systems
PPT Slide
Lager Image
Performance comparison with other systems
Some reasons are responsible for our good classification performance. Firstly, by fusing spatial position information between local features into codebook construction and feature coding, our method performs better than these methods [8 , 17 , 18] that only use the feature appearance information to represent human action. Secondly, compared with the feature distribution information [36] and ST context methods [35 , 38] , our method is a fine and complete method. For example, as illustrated in Fig. 2 , each local feature has two types fundamental information ( where, how ) in STV. More specifically, the coordinate (x, y) indicates where the body part locates. And the motion information (described as STOE feature) shows how the body part moves. It is noted that our feature do not contain appearance information (for example, histogram of oriented gradient (HOG) feature), because appearance information is prone to be disturbed by the various clothes of subjects. In MSPC-LC, all these information ( where, how ) is represented via (x, y) and STOE feature, then, fused into feature coding. Moreover, the spatial information ( when ) between features is utilized by multi-temporal-scale sub-STV. However, these methods [35 , 38 , 36] ignore some one of these information ( where, when and how ) in action representation processing.
Confusion matrix on the UCF Sports dataset with 4-level spatial pyramid match (SPM). s1(diving), s2(golf), s3(kicking), s4(lifting), s5(horse-riding), s6(running), s7(skating), s8(swing-bench), s9(swing-high bar), s10 (walking)
PPT Slide
Lager Image
Confusion matrix on the UCF Sports dataset with 4-level spatial pyramid match (SPM). s1(diving), s2(golf), s3(kicking), s4(lifting), s5(horse-riding), s6(running), s7(skating), s8(swing-bench), s9(swing-high bar), s10 (walking)
5. Conclusion
In this paper, to capture the ST relationship of local features for human action recognition, we built the mixed features that combines STOE feature and spatial position information with spatial scale parameters. Then, the mixed features are encoded by the proposed MSPC-LC algorithm. The experimental results on the public datasets show that (1) feature spatio-temporal position information effectively improves the performance of action recognition (2) by changing spatial parameters, the mixed features can provide more useful information to the action classifier (3) combining feature spatial position into feature coding is a beneficial alternative way for this task. In particular, combining feature spatial position into feature coding is a better approach than feature distribution [36] , spatio-temporal context [35] , and SPM-based methods [23] , when using a multiscale version.
The major limitations of our system is that (1) In the preprocessing stage, it is difficult in locating the human head and measuring the human body length, when the environment is complicated, such as crowed street (2) The values of spatial scale parameters can greatly influence on the performance of action recognition. However, in many cases, their values are set empirically. It is valuable to explore new methods to capture the spatiotemporal location of local features in our future work.
Acknowledgements
This work is supported by the National Natural Science Foundation of China (No.61301270, No.61271288), the Foundation Research Funds for the Central Universities of China (No.ZYGX2013J025), and the Research Fund for the Doctoral Program of Higher Education of China (No.20130185120014), the National High Technology Research and Development Program (No.2012AA011503).
BIO
Jiang-feng Yang is a Ph.D student in the School of communication and information engineering, University of Electronic Science and Technology of China. He received his Master degree from Kunming University of Science and Technology in 2009. His current research interests include computer vision, human action recognition, motion detection.
Zheng Ma is a professor in School of Communication and Information Engineering, University of Electronic Science and Technology of China. His current research interests include image processing, computer vision.
Mei Xie is a professor in School of Electronic Engineering, University of Electronic Science and Technology of China. She received her B.S. in 1981 from Chengdu Institute of Telecommunication, and her M.S. in 1990 and Ph.D. in 1996 from University of Electronic Science and Technology of China, China. From 1997-1998 in School of Electronic Engineering, University of Hong Kong, Hong Kong, and 1998-1999 in School of Electronic Engineering, University of Texas at Austin, USA, she studied as a postdoctor. Her research interests include signal processing, machine vision and Internet security.
References
Wu X. , Lai J. 2010 “Tensor-based projection using ridge regression and its application to action classification,” IET Image Processing 4 (6) 486 - 493    DOI : 10.1049/iet-ipr.2009.0278
Chaaraoui A. A. , Perez P. C. 2013 “Silhouette-based human action recognition using sequences of key poses,” Pattern Recognition Letters 34 (15) 1799 - 1807    DOI : 10.1016/j.patrec.2013.01.021
Tran K. N. , Kakadiaris I. A. , Shah S. K. 2011 “Modeling motion of body parts for action recognition,” Proceedings of the British Machine Vision Conference 1 - 12
Huang B. , Tian G. , Zhou F. 2012 “Human typical action recognition using gray scale image of silhouette sequence,” Computers and Electrical Engineering 38 (5) 1177 - 1185    DOI : 10.1016/j.compeleceng.2012.05.005
Rahman S. A. , Leung M. K. H. , Cho S. Y. 2013 “Human action recognition employing negative space features,” Journal of Visual Communication and Image Representation 24 (3) 217 - 231    DOI : 10.1016/j.jvcir.2012.12.001
Saghafi B. , Rajan D. 2012 “Human action recognition using pose-based discriminant embedding,” Signal Processing 27 (1) 96 - 111
Yoon S. M. , Kuijper A. 2013 “Human action recognition based on skeleton splitting,” Expert Systems with Applications 40 (17) 6848 - 6855    DOI : 10.1016/j.eswa.2013.06.024
Shao L. , Ji L. , Liu Y. , Zhang J. 2012 “Human action segmentation and recognition via motion and shape analysis,” Pattern Recognition Letters 33 (4) 438 - 445    DOI : 10.1016/j.patrec.2011.05.015
Deng X. , Liu X. , Song M. 2013 “LF-EME: local features with elastic manifold embedding for human action recognition,” Neurocomputing 99 (1) 144 - 153    DOI : 10.1016/j.neucom.2012.06.011
Dollar P. , Rabaud V. , Cottrell G. , Belongie S. 2005 “Behavior recognition via sparse spatio-temporal features,” Proceedings of the 2nd Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance 65 - 72
Klaser A. , Marszalek M. , Schmid C. 2008 “A spatio-temporal descriptor based on 3D-gradients,” Proceedings of the British Machine Vision Conference
Scovanner P. , Ali S. , Shah M. 2007 “A 3-dimensional sift descriptor and its application to action recognition,” Proceedings of the 15th ACM International Conference on Multimedia 357 - 360
Willems G. , Tuytelaar T. , Van Gool L. 2008 “An efficient dense and scaleinvariant spatio-temporal interest point detector,” Proceedings of the Europen Conference on Computer Vision 650 - 663
Laptev I. , Marszaek M. , Schmid C. , Rozenfeld B. 2008 “Learning realistic human actions from movies,” Proceedings of the 26th IEEE Conference on Computer Vision and Pattern Recognition 1 - 8
Escobar M. J. , Kornprobst P. 2012 “Action recognition via bioinspired features: the richness of center-surround interaction,” Computer Vision and Image Understanding 116 (5) 593 - 605    DOI : 10.1016/j.cviu.2012.01.002
Zhu X. , Yang Z. , Tsien J. 2012 “Statistics of natural action structures and human action recognition,” Journal of Vision 12 (9) 834 - 834    DOI : 10.1167/12.9.834
Chakraborty B. , Holte M. B. , Moeslund T. B. , Gonzalez J. 2012 “Selective spatio-temporal interest points,” Computer Vision and Image Understanding 116 (3) 396 - 410    DOI : 10.1016/j.cviu.2011.09.010
Zhu Y. , Zhao X. , Fu Y. 2010 “Sparse coding on local spatial temporal volumes for human action recognition,” Springer Proceedings of the Computer Vision Berlin, Germany 660 - 671
Guha T. , Ward R K. 2012 “Learning sparse representations for human action recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (8) 1576 - 1588    DOI : 10.1109/TPAMI.2011.253
Roweis S. T. , Saul L. K. 2000 “Nonlinear dimensionality reduction by locally linear embedding,” Science 290 2323 - 2326    DOI : 10.1126/science.290.5500.2323
Tenenbaum J. , DeSilva V. , Langford J. 2000 “Aglobal geometric framework for nonlinear dimensionality reduction” Science 290 2319 - 2323    DOI : 10.1126/science.290.5500.2319
Elgammal A. , Duraiswami R. , Davis L. 2003 “Efficient kernel density estimation using the fast gauss transform with applications to color modeling and tracking,” IEEE Transactions on Pattern Analysis and Machine Intelligence 25 1499 - 1504    DOI : 10.1109/TPAMI.2003.1240123
Breunig M. , Kriegel H. P. , Ng R. T. , Sander J. 2000 “LOF: identifying density-based local outliers,” Proceeding software 2000 ACM SIGMOD International Conference on Management of Data
Yu K. , Zhang T. , Gong Y. 2009 “Nonlinear learning using local coordinate coding,” Advances in Neural Information Processing Systems 22 2223 - 2231
Wangp J. , Yang J. , Yu K. , Lv F. 2010 “Locality-constrained linear coding for image classification,” Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition 3360 - 3367
Chao Y. W. , Yeh Y. R. , Chen Y. W. 2011 “Locality-constrained group sparse representation for robust face recognition,” Proceedings of the IEEE International Conference on Image Processing (ICIP) 761 - 764
Zheng M. , Bu J. , Chen C. 2011 ”Graph regularized sparse coding for image representation,” IEEE Transactions on Image Processing 20 1327 - 1336    DOI : 10.1109/TIP.2010.2090535
Tropp J. A. 2004 “Greed is good: algorithmic results for sparse approximation,” IEEE Transactions on Information Theory 50 2231 - 2242    DOI : 10.1109/TIT.2004.834793
Olshausen B. A. , Field D. J. 1996 “Emergence of simple-cell receptive field properties by learning a sparse code for natural images,” Nature 381 (6583) 607 - 609    DOI : 10.1038/381607a0
Yu K. , Zhang T. , Gong Y. 2009 “Nonlinear learning using local coordinate coding,” Proceedings of the 23rd Annual Conference on Neural Information Processing Systems 2223 - 2231
Wang J. , Yang J. , Yu K. , Lv F. 2010 “Locality-constrained linear coding for image classification,” Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 3360 - 3367
Roweis S. T. , Saul L. K. 2000 “Nonlinear dimensionality reduction by locally linear embedding,” Science 290 (5500) 2323 - 2326    DOI : 10.1126/science.290.5500.2323
Wei C.P. , Chao Y.W. , Yeh Y. R. 2013 “Locality-sensitive dictionary learning for sparse representation based classification,” Pattern Recognition 46 (5) 1277 - 1287    DOI : 10.1016/j.patcog.2012.11.014
Wright J. , Yang A.Y. , Ganesh A. , Sastry S.S. , Ma Y. 2009 “Robust face recognition via sparse representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (2) 210 - 227    DOI : 10.1109/TPAMI.2008.79
Wu X. , Xu D. , Duan L. , Luo J. 2011 “Action recognition using context and appearance distribution features,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 489 - 496
Bregonzio M. , Xiang T. , Gong S. 2012 “Fusing appearance and distribution information of interest points for action recognition,” Pattern Recognition 45 (3) 1220 - 1234    DOI : 10.1016/j.patcog.2011.08.014
Chakraborty B. , Holte M. B. , Moeslund T. B. , Gonzalez J. 2012 “Selective spatio-temporal interest points,” Computer Vision and Image Understanding 116 (3) 396 - 410    DOI : 10.1016/j.cviu.2011.09.010
Zhang Z. , Wang C. , Xiao B. 2012 “Action recognition using context constrained linear coding,” Signal Processing Letters 19 (7) 439 - 442    DOI : 10.1109/LSP.2012.2191615
Xu D. , Huang Y. , Zeng Z. , Xu X. 2012 “Human gait recognition using patch distribution feature and locality-constrained group sparse representation,” IEEE Transactions on Image Processing 21 (1) 316 - 326    DOI : 10.1109/TIP.2011.2160956
Liu M. , Yan S. , Fu Y. , Huang T. S. 2008 “Flexible X-Y patches for face recognition,” Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing 2113 - 2116
2013 http://spams-devel.gforge.inria.fr/
Lazebnik S. , Schmid C. , Ponce J. 2006 “Beyond bags of features: spatial pyramid matching for recognizing natural scene categories,” Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2169 - 2178
Yang J. , Yu K. , Gong Y. , Huang T. 2009 “Linear spatial pyramid matching using sparse coding for image classification,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Lee H. , Battle A. , Raina R. , Ng A. 2007 Advances in Neural Information Processing Systems MIT Press “Efficient sparse coding algorithms,” 801 - 808
Derpanis K. G. , Gryn J. M. 2005 Three-dimensional nth derivative of Gaussian separable steerable filters IEEE Int. Conf. on Image Processing 3
Derpanis K.G. , Sizintsev M. , Cannons K. , Wildes R. P. 2010 “Efficient Action Spotting based on a Space time Oriented Structure Representation,” Proceedings of the IEEE Conference Computer Vision and Pattern Recognition
baraniuk R. , Wakin M. 2009 “Random projection of smooth manifold,” foundation of computational mathmaematics 9 51 - 77    DOI : 10.1007/s10208-007-9011-z