A Skip-mode Coding for Distributed Compressive Video Sensing
A Skip-mode Coding for Distributed Compressive Video Sensing
Journal of Broadcast Engineering. 2014. Mar, 19(2): 257-267
Copyright © 2014, The Korean Society of Broadcast Engineers
  • Received : February 27, 2014
  • Accepted : March 12, 2014
  • Published : March 30, 2014
Export by style
Cited by
About the Authors
Quang Hong, Nguyen
Khanh Quoc, Dinh
Viet Anh, Nguyen
Chien Van, Trinh
영현, 박
병우, 전

Distributed compressive video sensing (DCVS) is a low cost sampling paradigm for video coding based on the compressive sensing and the distributed video coding. In this paper, we propose using a skip-mode coding in DCVS under the assumption that in case of high temporal correlation, temporal interpolation can guarantee sufficiently good quality of nonkey frame, therefore no need to transmit measurement data in such a nonkey frame. Furthermore, we extend it to use a hierarchical structure for better temporal interpolation. Simulation results show that the proposed skip-mode coding can save the average subrate of whole video sequence while the PSNR is reduced only slightly. In addition, by using the proposed scheme, the computational complexity is also highly decreased at decoder on average by 43.75% for video sequences that have strong temporal correlation.
Ⅰ. Introduction
A video encoding paradigm with low complexity is quite on demand for real-time applications such as video conference on mobile devices or wireless visual sensor networks, but its practical realization is still very challenging. A possible solution for such scenario is the distributed video coding (DVC), in which, the highly correlated video frames are independently encoded but are jointly decoded [1 - 5] . In DVC, the most computationally intensive task of motion estimation/motion compensation is shifted from encoder to decoder, thus resulting in a very simple encoder at the expense of a complex decoder.
By the way, the traditional Nyquist sampling may be burdensome much at encoder in terms of storage and data transmission, because the Nyquist/Shannon rate required to be at least twice the highest frequency of the input signal is already too high for high resolution video. As an alternative, the compressive sensing (CS) is an emerging sampling technique allowing sparse signals (or those having sparse representation in some transform domains) to be sampled at a rate much lower than the Nyquist/Shannon rate via a linear projection into a measurement domain. One extreme demo was the single-pixel CS camera reported in 2008 [6] . CS provides a lot of potentials of dramatic complexity reduction of computation and resource since it can directly acquire signal in a very compressive form.
Inspired by low sampling cost of CS and low encoding complexity of DVC, researchers [7 - 11] combined them into a framework of distributed compressive video sensing (DCVS). In DCVS system, an input video sequence is divided into many group of pictures (GOP), each containing a key frame (called K-frame) and several nonkey frames (called CS-frame). In [7] , conventional intra coding (such as MPEG or H.26x Intra) is used for key frames, and block-based CS is used for nonkey frames. However, this framework is less attractive due to its need for two parallel samplers: one for Nyquist/Shannon sampling; and the other for the CS. In another approach [11] , Y. Baig et al. proposed a DCVS system that employs the CS for both key and nonkey frames leading to significantly low sampling cost at encoder. Side information (SI) is generated by creating a dictionary from CS measurements of key frames and incorporated into recovery process of nonkey frames.
Most of the DCVS systems [8 - 11] use the same subrate for all nonkey frames. However, note that the temporal correlation, i.e., the similarity of a frame to its temporally neighboring frames, may be very different from frame to frame within a sequence, and from sequence to sequence. Accordingly, the importance of received measurement at a decoder may be different from frame to frame. In the case of low temporal correlation, the received measurements are very important, since the nonkey frames cannot be estimated accurately, by an interpolation process, from its neighbours. On the other hand, an interpolation process may generate a nonkey frame of a good quality, for which, the received measurement may not be so essential. Naturally, in this paper, we propose a skip-mode coding for measurement data of frames in such a case of high temporal correlation at the expense of only a little quality loss of recovered video. In implementation, the encoder computes temporal correlation between measurements of neighboring frames to decide if the skip-mode is a right choice. In this scheme, we send a skip indicator to notify a corresponding decoder of the skip for a given nonkey frame. With frame interpolation at decoder instead of CS recovery, the decoding time may be decreased considerably, especially for very stable video scenes.
The rest of this paper is organized as following: related works to this paper are described in Section Ⅱ. Section Ⅲ shows the proposed method to decide the skip-mode coding, while section Ⅳ presents simulation results. Finally, in section Ⅴ, we make concluding remark.
Ⅱ. Related works
- 1. Compressive Sensing (CS)
A signal x RN is called k -sparse if it has at most k non-zero elements (i.e., ∥ x 0 k where ∥*∥ 0 is the 0 -norm). By CS frameworks in [12] , a sparse signal or approximately compressible signal x can be recovered from a small number of measurements y RM where y is a linear projection of a signal x by a measurement matrix Փ R M×N as:
PPT Slide
Lager Image
Note that, for recovery of a signal x , the measurement matrix Փ needs to satisfy the restricted isometry property of order- k [13] , i.e., there exits a δk ∈ (0,1) holding for all k -sparse vectors x such that:
PPT Slide
Lager Image
From a practical viewpoint, there are very few real world signals being truly sparse. However, most of natural signals can be sparsely represented in some selected transform domains, for example, in DCT or DWT domains, especially in case of image/video signals.
A length-N signal x associated with a sparse transform basis Ψ can be represented as x = ΨTθ , where θ is a vector consisting of the transform coefficients of x . CS proves that it is possible to recover sparse representation θ of x from M measurements, y = Փx = ՓΨTθ , where M N (ratio s = M / N is called the subrate or measurement rate) by the 0 -minimization:
PPT Slide
Lager Image
Because the 0 -minimization of eq.(3) is an NP-hard problem [14] , CS recovery solves an alternative of 1 -minimization, which is a convex optimization problem as:
PPT Slide
Lager Image
There are many algorithms to solve eq.(4); for example, Matching Pursuit (MP) [15] , or Orthogonal Matching Pursuit (OMP) [16] algorithms.
- 2. Motion Compensation-Block-based Compressive Sensing-Smoothed Projected Landweber (MC-BCS-SPL)
In [8] , S. Mun et al. presented an effective scheme of DCVS with all CS frames pursuing the simplest sampling. Furthermore, they employed motion estimation (ME) / motion compensation (MC) at the decoder side, similarly to DVC, for a simplest encoder. The block diagram of MC-BCS-SPL is shown in Fig. 1 . At the encoder, a video sequence is divided into GOPs, each consisting of a key frame and several nonkey frames. The key frame and all nonkey frames are partitioned into non-overlapping blocks of size B × B , and then projected into a measurement domain using eq.(1). For the best quality of recovered frames, the key frame is always sampled with a higher subrate than that of nonkey frame. MC-BCS-SPL decoder consists of two main processes: forward/backward ME/MC and residual reconstruction. Forward/backward ME/MC, i.e., forward temporal direction to reconstruct the first half of GOP and backward temporal direction to reconstruct the last half of GOP, are employed to exploit temporal correlation between successive frames in video sequence. This structure of forward/backward ME/MC is shown in Fig. 2 . ME/MC generates SI that helps to enhance the performance of CS recovery process. Residual reconstruction recovers the difference between a given frame and its SI under an assumption that the residual signal is sparser than the signal itself (then gives better performance of recovery). In a nutshell, MC-BCS-SPL has some interesting merits :
PPT Slide
Lager Image
MC-BCS-SPL의 블록 다이어그램 [8] Fig. 1. Block diagram of the MC-BCS-SPL [8]
PPT Slide
Lager Image
GOP 크기 8에서의 순방향/역방향 구조 Fig. 2. An illustration of forward/backward structure with GOP size of 8
- As aforementioned, MC-BCS-SPL uses CS for both key frames and nonkey frames. This is different from some DCVS models [7 , 9] which use conventional intra coding such as MPEG or H.26x intra coding for key frames. Even though the conventional intra coding can have much quality gain of reconstructed key frames but it suffers much from high computational complexity not only in encoder but also in decoder. Note that, in CS framework, one of the main tasks is to make a simple encoder. Demanded by this mission, MC-BCS-SPL relieves out the burden of Nyquist/Shannon sampling. By other means, MC-BCS-SPL integrates sampling and compression in a single process with both key frames and nonkey frames, which enables a low complexity encoder.
- It is obvious that a large size CS sensing which satisfies the RIP condition with an overwhelming probability [20] , usually gives high recovered performance. As a result, the authors in [9] use the frame-based sensing for their DCVS model. Unfortunately, this is not quite attractive to a real time applications due to large size of natural video frames. By dividing a video frame to smaller non-overlapping blocks, MC-BCS-SPL not only reduces memory requirement for a large size sensing matrix but also makes this model more amenable to real-time applications.
However, MC-BCS-SPL still has much space to upgrade its performance, for example, it needs capability to properly address the difference in temporal correlation from frame to frame in a video sequence. It urges us to investigate a new skip-mode coding as presented in detail in section Ⅲ.
- 3. Skip-mode coding in DCVS
Based on the different temporal correlation from frame to frame in video sequence, the framework in [7] proposed a method to skip transmitting a block that changes very little from its temporally co-located previous block; however it makes the encoder more complex in order to determine the skip-mode for each block. Additionally, it still needs the encoder to decode key frames, which is impractical for DCVS systems that use CS for both key frame and nonkey frame. Note that, since one of the unique features of CS framework is to reduce the complexity of encoder, the decoding requirement of key frames at encoder is less attractive.
Ⅲ . Proposed skip-mode coding
In a video sequence, low motion video sequence has high temporal correlation among successive frames. The aim of the proposed skip-mode coding is to allow the CS system exploiting this property by not transmitting measurements of such nonkey frames having high temporal correlation. It means that the number of measurements for low motion frames is reduced remarkably in transmission or storage. When the motion in video sequence increases (i.e., the temporal correlation among successive frames decreases), the number of skip-mode frames is reduced to avoid poor quality of recovered video frames.
Even though the raw video data (sampled at a Nyquist/Shannon rate) is not available at the encoder, the temporal correlation between successive frames may still be computed in measurement domain with its value not much different from that in spatial domain. Thus, in the frameworks [11 , 19] , the authors postulate that high CS measurement correlation also results from high spatial correlation. By experimental results, [11] also points out that the frames in various sequences have high temporal correlation among CS measurements with the corresponding value even above 0.9.
For two length-L measurement vectors y 1 and y 2 , denote
PPT Slide
Lager Image
PPT Slide
Lager Image
, respectively, the averages of y 1 and y 2 , the temporal correlation coefficient between y 1 and y 2 is computed as:
PPT Slide
Lager Image
where i is an element index of y 1 and y 2 .
To show more clearly the measurement correlation in video sequence, we analyse correlation for the first 80 frames of Hall Monitor, Salesman, Soccer and Football. CS measurement is obtained for each frame with subrate of 0.5 by Gaussian measurement matrix. The measurement correlation coefficients between a frame and its forward adjacency ( ρf ), and backward adjacency ( ρb ) are computed as eq.(5); then the average correlation,
PPT Slide
Lager Image
is taken and plotted in Fig. 3 . All video sequences show high correlation between CS measurements with correlation coefficient above 0.9. Those video sequences having low motion such as Hall Monitor, Salesman have high correlation between CS measurement:
PPT Slide
Lager Image
is very close to 1. On the other hand, those video sequences having high motion such as Soccer or Football, the correlation is quite far from 1. Thus, the measurement correlation decreases when motion in video sequence increases.
PPT Slide
Lager Image
측정 상관 계수 Fig. 3. Measurement correlation coefficient
Utilizing the measurement correlation in eq.(5), the proposed scheme decides whether or not a nonkey frame should be skip-mode coded or not. At the encoder, the measurement correlation coefficient ρn between a nonkey frame and its reference frames, (where n is index of reference frame), is computed using eq.(5). The average of its measurement correlation coefficients, ρ = average ( pn ), is then compared with a predefined threshold τ . Nonkey frame mode is determined as:
PPT Slide
Lager Image
The process described above is repeated until modes of all the nonkey frames within a current GOP are identified. A block diagram of the proposed method is shown in Fig. 4 . Note that the encoder needs to signal its decoder whether a given nonkey frame is encoded with the skip-mode or not. This signaling requires an additional bit for each nonkey frame.
PPT Slide
Lager Image
압축센싱에서의 스킵모드 부호화 Fig. 4. Skip-mode coding at compressive sensing
The decoder processes a nonkey frame in the skip- or inter-mode depending on the received signal bits. If a nonkey frame is marked as the skip-mode, the reconstruction is obtained by interpolation from its referenced frames. Here, we use a simple but effective temporal interpolation which takes an average of its recovered referenced frames. Otherwise (i.e., nonkey frame is in inter-mode), it is recovered normally by using received measurement vector by residual CS reconstruction.
Furthermore, in this paper, the proposed skip-mode coding is implemented on MC-BCS-SPL with a modified GOP structure. That is, instead of using a simple forward/backward structure in Fig. 2 that is known not the best for temporal interpolation, we use a hierarchical structure as shown in Fig. 5 since it is well-known as being better at exploiting temporal correlation among neighboring frames [17 , 18] . As a result, performance of MC-BCS-SPL is improved.
PPT Slide
Lager Image
GOP 크기 8에서의 계층 구조 Fig. 5. An illustration of hierarchical structure with GOP size of 8
Ⅳ. Experimental results
In this section, we evaluate performance of the proposed method with the first 144 frames of four QCIF (144×176) video sequences, namely, Hall Monitor, Salesman, News, and Soccer. Input video sequence is divided into GOPs with size of 8. At the encoder, each frame is split into non-overlapping 16×16blocks, and all blocks are sampled using Gaussian measurement matrix with a subrate of 0.7 for key frames and a subrate in the range of 0.1 to 0.5 for nonkey frames. For the proposed method, we set τ =0.999 in eq.(6) which gives good performance in the skip-mode coding. Futhermore, the average subrate
PPT Slide
Lager Image
for whole sequence is computed as:
PPT Slide
Lager Image
where nK and nNK are numbers of key and nonkey frames in a given video sequence; sK and sNK are corresponding subrates of key and nonkey frame. PSNR of the proposed method is calculated based on an average subrate. The rate-distortion (RD) performance of the proposed method and its decoding time comparison are given in Fig. 6 and Fig. 7 .
PPT Slide
Lager Image
제안 방법의 율-왜곡 성능 Fig. 6. RD-performance of the proposed method
PPT Slide
Lager Image
계층구조에서의 복호화 시간 비교 (Hall Monitor 영상) Fig. 7. Comparison of decoding time with hierarchical structure (Hall Monitor sequence)
- 1. Rate-distortion performance of proposed method
Fig. 6 shows the RD-comparison of the proposed skip-mode coding. First of all, it is obvious that, at the same average subrate, performance of MC-BCS-SPL with the hierarchical structure is always better than that of the forward/backward structure; PSNR can gain by up to 1.5 dB for Hall Monitor. This can be explained that because by using hierarchical structure, decoder can generate more reliable estimation of motion field that helps to create high quality of SI in decoding process. Consequently, the proposed method of skip-mode coding always gives better result under the hierarchical structure than the forward/backward structure.
Naturally, with the proposed skip-mode coding for nonkey frame, quality of recovered frames (i.e., PSNR) of the proposed method is expected not as good as that of conventional MC-BCS-SPL. It is because of the potential erroneous interpolation of the nonkey frames in recovery process. However, the average subrate of the whole sequence is obviously reduced a lot. Average saving rate (ASR) is calculated as:
PPT Slide
Lager Image
PPT Slide
Lager Image
PPT Slide
Lager Image
are average subrate of conventional MC-BCS-SPL and that of the proposed skip-mode coding on top of the MC-BCS-SPL, respectively,
Because the video sequences, Hall Monitor and Salesman, contain only a little motion and very stable background, there are many nonkey frames in video sequence which are skipped. Consequently, proposed method shows good performance. For example with the comparison at subrate 0.5 of nonkey frame (two points are marked in red circle at Fig. 6 (a)), the ASR of Hall Monitor is about 54% with a PSNR reduction of only 0.65 dB. Contrarily, because Soccer has rapid motions (background and objects change quickly), temporal correlation between frames is not as high as Hall Monitor. Accordingly, most of nonkey frames are encoded as the inter-mode; thus, the proposed method makes no difference for such video sequence with rapid motion. For the News sequence which contains moderate motion, number of skipped nonkey frame is smaller than that of Hall Monitor or Salesman. At two points that are marked in red circle at Fig. 6 (c), the ASR is approximately 12.6% with a trade-off in PSNR of about 0.1 dB. Thus, the experimental results showed that the skip-mode coding could reduce lots of unnecessary transmission (or storage) of measurement data with only insignificant PSNR loss.
- 2. Time complexity of the proposed method.
The proposed method helps MC-BCS-SPL to reduce computational complexity at decoder. By using a simple but also effective interpolation as aforementioned in section Ⅲ, decoding time can be considerably saved. Average saving time (AST) is calculated as following:
PPT Slide
Lager Image
where tc and tp are decoding time of conventional MCBCS-SPL and the proposed skip-mode coding on top of the MC-BCS-SPL, respectively.
Fig. 7 illustrates comparison of accumulated decoding time of Hall Monitor. It shows that the decoding time is reduced remarkably when using the skip-mode coding; average AST is approximately 43.75% over the conventional hierarchical MC-BCS-SPL. Thus, the more measurements of nonkey frames are skipped, the more decoding time can be saved.
Ⅴ . Conclusion
In this paper, a method of skip-mode coding for DCVS system is proposed. In the proposed method, the encoder exploits the temporal correlation between measurements in GOP to determine skip-mode coding of nonkey frames. For frames with less motion, the proposed skip-mode coding implemented on MC-BCS-SPL allows to skip transmitting measurements of frames. Experimental results show that the proposed method works well with video sequences having low motion like Hall Monitor or Salesman for which not only up to 54% of average subrate is saved but also decoding time is decreased remarkably in comparison to the conventional MC-BCS-SPL.
Quang Hong Nguyen
- 2013년 : Hanoi University of Science and Technology (Vietnam)
- 2013년 ~ 현재 : Master student at Digital Media Lab, Sungkyunkwan University (Korea)
- 주관심분야 : Video coding and Compressive Sensing
Khanh Quoc Dinh
- 2010년 : Hanoi University of Science and Technology (Vietnam)
- 2012년 : Master at Digital Media Lab, Sungkyunkwan University (Korea)
- 2012년 ~ 현재 : PhD Student at Digital Media Lab, Sungkyunkwan University (Korea)
- 주관심분야 : Video Compression and Compressive Sensing
Viet Anh Nguyen
- 2011년 : Hanoi University of Science and Technology (Vietnam)
- 2013년 : Master at Digital Media Lab, Sungkyunkwan University (Korea)
- 2013년 ~ 현재 : Phd. student at Digital Media Lab, Sungkyunkwan University (Korea)
- 주관심분야 : Compressive Sensing, Video coding
Chien Van Trinh
- 2012년 : Hanoi University of Science and Technology (Vietnam)
- 2012년 ~ 현재 : Master candidate at Digital Media Lab, Sungkyunkwan University (Korea)
- 주관심분야 : Video Compression and Compressive Sensing
박 영 현
- 2011년 : 성균관대학교 전자전기공학부 졸업 (학사)
- 2011년 ~ 현재 : 성균관대학교 전자전기컴퓨터공학과 석박사 통합과정
- 주관심분야 : 영상압축, 압축센싱
전 병 우
- 1985년 : 서울대학교 전자공학과 졸업 (학사)
- 1987년 : 서울대학교 전자공학과 졸업 (석사)
- 1992년 : Purdue Univ, School of Elec. 졸업 (공학박사)
- 1993년 ~ 1997년 : 삼성전자 신호처리연구소 수석연구원
- 1997년 ~ 현재 : 성균관대학교 정보통신공학부 교수
- 주관심분야 : 멀티미디어 영상압축, 영상인식, 신호처리
Girod B. , Aaron A. , Rane S. , Rebollo-Monedero D. 2005 "Distributed Video Coding," in Proc. IEEE Special Issue On Advance In Video Coding And Delivery June vol. 93 71 - 83
Park J. , Jeon B. , Wang D. , Vincent A. 2009 "Wyner-Ziv Video Coding with Region Adaptive Quantization and Progressive Channel Noise Modeling," in Proc. IIEEE International Symposium on Broadband Multimedia Systems and Broadcasting May 1 - 6
Cho H. , Eun H. , Shim H. J. , Jeon B. 2012 "Transform-domain Wyner-Ziv Residual Coding using Temporal Correlation," Journal of Broadcasting Engineering 17 (1) 140 - 151    DOI : 10.5909/JEB.2012.17.1.140
Kim D.-Y. , Park G.-H. , Kim K.-H. , Suh D.-Y. 2007 "A study on performance evaluation of DVCs with different coding method and feasibility of spatial scalable DVC," Journal of Broadcasting Engineering 12 (6) 585 - 595    DOI : 10.5909/JBE.2007.12.6.585
Park J. , Jeon B. 2011 "Fast Distributed Video Coding using Parallel LDPCA Encoding," Journal of Broadcasting Engineering 16 (1) 144 - 154    DOI : 10.5909/JEB.2011.16.1.144
Duarte M. , Davenport M. , Takhar D. , Laska J. , Sun T. , Kelly K. , Baraniuk R. 2008 "Single-pixel Imaging via Compressive Sampling," IEEE Signal Processing Magazine 25 83 - 91    DOI : 10.1109/MSP.2007.914730
Prades-Nebot J. , Ma Y. , Huang T. 2009 "Distributed Video Coding using Compressive Sampling," in Proc. of the Picture Coding Symposium May 1 - 4
Mun S. , Flower J. E. 2011 "Residual Reconstruction for Block-based compressed sensing of Video," in Proc. Data Compression Conference (DCC) March 183 - 192
Do T. T. , Chen Y. , Nguyen D. T. , Nguyen N. , Gan L. , Tran T. D. 2009 "Distributed Compressive Video Sensing," IEEE International Conference on Image Processing Nov. 1393 - 1396
Kang L. W. , Lu C. S. 2012 "Distributed Compressive Video Sensing," in Proc. IEEE International Conference on Acoustic, Speech and Signal Processing July 325 - 330
Baig Y. , Lai E. M. K. , Punchihewa A. 2012 "Distributed Video Coding Based on Compressive Sensing," in Proc. IEEE International Conference on Multimedia and Expo Workshops (ICMEW) July 325 - 330
Baraniuk R. 2007 “Compressed sensing,” IEEE Signal Processing Magazine 24 118 - 121    DOI : 10.1109/MSP.2007.4286571
Candes E. , Romberg J. , Tao T. 2005 "Decoding by linear programing," IEEE Trans. Inform. Theory 51 4203 - 4215    DOI : 10.1109/TIT.2005.858979
Natarajan B. K. 1995 "Sparse approximate solution to linear systems," SIAM J. Comput. 24 227 - 234    DOI : 10.1137/S0097539792240406
Marcia S. , Zhang Z. 1993 "Matching pursuits with time-frequency dictionaries," IEEE Trans. Signal Processing 41 3397 - 4415    DOI : 10.1109/78.258082
Tropp J. , Gilbert A. C. 2007 "Signal recovery from random measurements via orthogonal marching pursuit," IEEE Trans. Inform. Theory 53 4655 - 4666    DOI : 10.1109/TIT.2007.909108
Ascenso J. , Pereira F. 2008 "Hierarchical Motion Estimation for Side Information Creation in Wyner-Ziv Video Coding," in Proc. of 2nd International Conference on Ubiquitous Information Management and Communication 347 - 352
Quoc K. D. , Van X. H. , Jeon B. 2012 "An Iterative Algorithm for Efficient Adaptive GOP Size in Transform Domain Wyner-Ziv Video Coding," in Advances in Image and Video Technology Lecture Notes in Computer Science 7088 347 - 358
Quoc K. D. , Shim H. J. , Jeon B. 2013 "Measurement Coding for Compressive Imaging using a structural measuremnet matrix," in Proc. 20th IEEE Internal Conference on Image Processing (ICIP) September 10 - 13
Baraniuk R. , Davenport M. , Vore R. D. , Wakin M. 2008 "A simple proof of the restricted isometry property for random matrices," Constructive Approximation 28 253 - 263    DOI : 10.1007/s00365-007-9003-x