Advanced
Low-Complexity Sub-Pixel Motion Estimation Utilizing Shifting Matrix in Transform Domain
Low-Complexity Sub-Pixel Motion Estimation Utilizing Shifting Matrix in Transform Domain
Journal of Electrical Engineering and Technology. 2016. Jul, 11(4): 1020-1026
Copyright © 2016, The Korean Institute of Electrical Engineers
This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • Received : September 24, 2015
  • Accepted : February 03, 2016
  • Published : July 01, 2016
Download
PDF
e-PUB
PubReader
PPT
Export by style
Article
Author
Metrics
Cited by
About the Authors
Chul, Ryu
Dept. of Information and Communication Engineering, Dougguk University, Seoul, Korea.
Jae-Young, Shin
Dept. of Information and Communication Engineering, Dougguk University, Seoul, Korea.
Eun-Chan, Park
Corresponding Author: Dept. of Information and Communication Engineering, Dougguk University, Seoul, Korea.

Abstract
Motion estimation (ME) algorithms supporting quarter-pixel accuracy have been recently introduced to retain detailed motion information for high quality of video in the state-of-the-art video compression standard of H.264/AVC. Conventional sub-pixel ME algorithms in the spatial domain are faced with a common problem of computational complexity because of embedded interpolation schemes. This paper proposes a low-complexity sub-pixel motion estimation algorithm in the transform domain utilizing shifting matrix. Simulations are performed to compare the performances of spatial-domain ME algorithms and transform-domain ME algorithms in terms of peak signal-to-noise ratio (PSNR) and the number of bits per frame. Simulation results confirm that the transform-domain approach not only improves the video quality and the compression efficiency, but also remarkably alleviates the computational complexity, compared to the spatial-domain approach.
Keywords
1. Introduction
The recent advances in mobile communication technologies facilitate diverse multimedia services in a wireless environment, and thus, the demands of high-quality video services in mobile devices are continuously increasing. In order to satisfy these demands, video compression is an imperative technique because high-quality video requires a great amount of information to be transmitted and received over a bandwidth-limited wireless channel. ITU-T and ISO jointly developed and proposed H.264/AVC aiming to provide high-quality video at substantially lower bit rates than the previous standards (e.g., MPEG-2 and H.263) without increasing the design complexity substantially. Therefore, H.264/ AVC became one of the most prevailing video codecs for recording, compression, and distribution of video contents, e.g., Internet video streaming and HDTV broadcasting, and it is widely embedded in various consumer electronics such as digital cameras/camcorders, digital TV set-top boxes, mobile phones, and CCTVs, as well desktops and laptops.
The essence of video compression is to reduce spatial and temporal redundancies, and a motion estimation (ME) algorithm is the key part of video compression to remove temporal redundancy among consecutive frames in the video sequence. In literature, ME algorithms have been continuously and widely studied by many researchers to improve the compression efficiency and/or to lower the computational complexity, because the computational complexity of ME takes up about 60%-80% of total computational complexity of video compression [1] . For the purpose of motion estimation, a block matching algorithm is usually used to estimate a motion vector by finding a location that gives the minimum matching error within the search area in the previous frame. A video encoder transfers the motion vectors found by the ME algorithm and the errors between the current and the best-matched blocks using entropy coding. More precise motion vectors can reduce the temporal redundancy effectively, because they lead to less amount of error between the current and the target block found by motion estimation.
Motion vector does not always occur in integer-pixels, it may also occur in sub-pixels due to changes of brightness among frames, nonlinear motion, or objects being appeared or disappeared in the boundary of pixel. Therefore, it cannot be expected to obtain precise motion information with a ME algorithm supporting integer-pixel accuracy, and it is necessary to adopt a sub-pixel ME algorithm for achieving more accurate motion vectors. For this reason, MPEG-2 and H.263 utilize half-pixel accuracy ME and H.264 further expands it to the level of quarter-pixel accuracy to achieve better video quality than the existing video coding standards.
Conventional ME algorithms are usually performed in the spatial domain. However, the sub-pixel ME algorithms in the spatial domain result in a serious increment of computational complexity of video encoders due to embedded interpolation algorithms. To overcome this drawback, this paper proposes a low-complexity ME algorithm based on a transform-domain approach. By using shifting matrix in the transform domain and predefined coefficient of interpolation, the proposed ME algorithm significantly reduces the computational complexity without impairing the video quality. The simulation results confirm that the proposed transform-domain approach outperforms the conventional spatial-domain approach in terms of the peak signal-to-noise ratio (PSNR) and the number of bits per frame, which are effective measures for the video quality, compression efficiency, and computational complexity.
The paper is organized as follows: Section 2 describes the ME algorithm in the transform domain. In Section 3, a shifting matrix algorithm is proposed for quarter-pixel accuracy ME. Section 4 presents simulation results comparing the performance of spatial-domain ME algorithms with that of transform-domain ME algorithms. Section 5 concludes and addresses the utilization of the proposed algorithm.
2. Motion Estimation in Transform Domain
Since video encoding and decoding are performed in a unit of block, block-based ME algorithms are preferred. Block-based ME algorithms can be classified as spatial-domain algorithms and transform-domain algorithms. The common spatial-domain ME algorithms are block matching algorithm and gradient-based algorithm, while the typical transform-domain ME algorithms are discrete cosine transform (DCT) and wavelet-based algorithm [2 - 6] .
Among the transform-domain algorithms, the wavelet-based algorithm spends less time in searching and matching however, it incurs a high complexity in estimating motion vectors and increases the bit rate to transmit motion vectors in all the sub-bands. On the other hand, DCT-based algorithm is widely used in various video compression standards, and thus, it maintains a high compatibility to many existing standards.
DCT transforms the spatial-domain data into the transform-domain coefficients. One of its desirable features is that the energy of transformed coefficients is concentrated on the low frequency bands, which can be a basis for high efficiency of video compression. In general, the spatial-domain motion estimation requires the process of inverse DCT (IDCT) in order to utilize the spatial-domain information of the restored frame which is used as a reference frame in the estimation process, as shown in Fig. 1 . Here, Q and IQ denote quantization and inverse quantization, respectively. On the other hand, IDCT can be omitted in the transform-domain ME since the motion estimation can be performed using coefficients transformed by DCT, as illustrated in Fig. 2 [7] . One of the key advantages of transform-domain ME is the low computational complexity, which is originated from the simple encoder structure without IDCT.
PPT Slide
Lager Image
Motion estimation and compensation of video encoder in the spatial domain
PPT Slide
Lager Image
Motion estimation and compensation of video encoder in the transform domain
The shifting matrix algorithm proposed by Plompen showed that a prediction block fpred can be represented with the horizontal and vertical conversion of neighboring blocks [8] . Fig. 3 shows an example that four neighboring blocks in the previous frame, f 1 , f 2 , f 3 , and f 4 , are used to estimate the fpred which is the best-matched block to the current block fcu rr .
PPT Slide
Lager Image
Estimation of current block from the best-matched block composed of neighboring blocks in the previous frame
The prediction block fpred can be expressed with a vertical shifting matrix hi and a horizontal shifting matrix vi , i.e.,
PPT Slide
Lager Image
where, hi and vi in Eq. (1) are defined as follows.
PPT Slide
Lager Image
PPT Slide
Lager Image
Here, the displacement matrix Dn is defined as
PPT Slide
Lager Image
where, In is a n × n unit matrix. For example, Fig. 4 demonstrates a process of predicting fpred with ∆ x = 2, ∆ y = 0, and Eq. (5) and (6) represent hi used in this example.
PPT Slide
Lager Image
Example of estimation process of fpred with ∆x = 2, ∆y = 0
PPT Slide
Lager Image
PPT Slide
Lager Image
Applying orthogonality and separability of DCT, fpred , DCT coefficient of the prediction block, can be expressed from Eq. (1) as follows:
PPT Slide
Lager Image
Note that fpred can be obtained by carrying out eight operations of matrix multiplication and three operations of matrix subtraction. However, the amount of computation can be greatly reduced by utilizing the characteristic of energy concentration of DCT. Accordingly, Eq. (7) can be derived as follows, in order to relieve the computational complexity by taking the advantage of DCT transformed block Fi , i.e.,
PPT Slide
Lager Image
All Fi ( m , n ) with zero in Eq. (8) can be excluded from computations so that the efficiency of prediction can be further improved using the scarcity of prediction block Fi . Considering that the value of coefficients in high frequency band are almost zero, it is necessary to redistribute the coefficients of Fi in the sequence of zigzag scanned order as
PPT Slide
Lager Image
The recursive equation derived from Eq. (9) provides an efficient way to calculate Fi by adjusting non-zero factor of Fi . Furthermore, the computational complexity can be efficiently reduced according to the block size and the number of non-zero factors of Fi . From Eq. (9), the operation of outer product of Vi and Hi needs to be performed only with non-zero factors of Fi , and Eq. (9) can be expressed in a recursive form as
PPT Slide
Lager Image
Here,
PPT Slide
Lager Image
is a 4 × 4 zero matrix and k increases from one to the total number of non-zero DCT factors in the block.
3. Shifting Matrix for Sub-Pixel Accuracy ME
The spatial-domain ME supporting sub-pixel accuracy utilizes interpolation algorithms which often up-sample more than four times of its samples. Therefore, the performance of motion estimation depends on the interpolation algorithm applied to it. For example, the performance of bilinear interpolation incorporated in the ME with half-pixel accuracy is degraded, compared to the higher-order interpolation algorithms [9] . Meanwhile, the higher-order interpolation algorithms, such as b-spline interpolation or cubic interpolation, each of which is given in Eq. (11) and Eq. (12), respectively, provide better prediction performance than the bilinear interpolation at the cost of large computational complexity. Accordingly, H.264 can enhance the image quality by utilizing a ME with quarter-pixel accuracy; however it exacerbates the computational complexity at the same time.
PPT Slide
Lager Image
PPT Slide
Lager Image
This section presents a shifting matrix algorithm supporting quarter-pixel accuracy of ME and proposes a transform-domain ME with b-spline or cubic interpolation schemes, with which the performance can be enhanced compared to the bilinear interpolation. In order to apply b-spline or cubic interpolation, the shifting matrix should be represented as follows.
PPT Slide
Lager Image
PPT Slide
Lager Image
PPT Slide
Lager Image
Considering the quarter-pixel accuracy of ME,
PPT Slide
Lager Image
. If n < 0, Dn and
PPT Slide
Lager Image
use 4 × 4 zero matrices, otherwise if n > 4,
PPT Slide
Lager Image
and
PPT Slide
Lager Image
. In Eqs. (13) and (14), ⌊ d ⌋ and ⌈ d ⌉ denote the rounding-up and rounding-down values of d , respectively.
In the case of bilinear interpolation, the frequency information in both sides of interpolated location is used to estimate the motion at the sub-pixel accuracy. On the other hand, in the case of b-spline or cubic interpolation, the frequency information in four locations is required. Therefore, when | d | < 1 or | d | > 3 , the frequency information beyond the boundary of reference block can be utilized. However, in order to obtain frequency information in a specific location from the block converted to the transform domain, all of the block information including the specific location is required, which causes an unnecessary access of memory. To resolve this problem, the vertical or horizontal intra prediction mode is adopted in predicting a specific location [10] .
For instance, if the current frame fcurr moves in y-axis at the amount of −4 < d < −3, the information beyond boundary of the search area is required and the problem of memory allocation happens. However, if the intra prediction mode is applied, the information beyond boundary can be estimated by duplicating information in the search area upwards as indicated in Fig. 5 . In this way, the problem of unnecessary memory allocation can be effectively avoided. The following Eqs. (16) and (17) show the vertical shifting matrix in the ME with quarter-pixel accuracy for the example illustrated in Fig. 5 .
PPT Slide
Lager Image
Estimation of information beyond boundary using intra prediction mode
PPT Slide
Lager Image
PPT Slide
Lager Image
It is worthwhile to note that the proposed mechanism does not require IDCT and alleviates the computational complexity by utilizing the recursive equation. Table 1 compares the number of addition and multiplication operations used in the spatial-domain ME and the transform-domain ME when the block size is N × N and the search area is p . For the typical values of N = 4 and p = 8, the transform-domain ME decreases the number of addition and multiplication operations by about 27 and 3.5 times, respectively, compared to the spatial-domain ME, which confirms the outstanding decrease of computational complexity due to the transform-domain approach.
Comparison of computational complexity in spatial-domain ME and transform-domain ME
PPT Slide
Lager Image
Comparison of computational complexity in spatial-domain ME and transform-domain ME
In order to compare the computational complexities of spatial-domain ME and transform-domain ME, two indices of γ + and γ * are introduced as
PPT Slide
Lager Image
where,
PPT Slide
Lager Image
and
PPT Slide
Lager Image
are the numbers of addition operations in the spatial-domain ME and transform-domain ME, respectively, and
PPT Slide
Lager Image
and
PPT Slide
Lager Image
are the corresponding numbers of multiplication operations. Fig. 6(a) and 6(b) show γ + and γ * when N ranges from 2 to 8 and p ranges from 2 to 10, respectively. In these ranges of N and p , γ + lies between 0.024 and 0.063, while γ * lies between 0.14 and 0.57; that is, the transform-domain ME remarkably decreases the numbers of addition and multiplication operations by about 16~42 times and 2~7 times, respectively, compared to the spatial-domain ME. Moreover, as shown in Fig. 6 , both γ + and γ * almost linearly increase with respect to the increase of block size N, but they are hardly affected by search area p .
PPT Slide
Lager Image
Relative computational complexity of transform-domain ME to spatial-domain ME
4. Simulation and Discussion
In this section, the performances of several ME algorithms are evaluated and compared via simulations. In the simulations, H.264/AVC video coder is used, the quarter-pixel accuracy ME is considered, and the sum of absolute difference is used as a criterion to determine the motion vector. The performance is evaluated with two indices, PSNR and the number of bits per frame (BPF), which are widely used to measure video quality and compression efficiency. The performance is compared by considering the following several ME algorithms.
• H.264: This is a baseline algorithm as standardized by ITU-T and ISO. It is the most typical spatial-domain ME algorithm and uses a 4-tap FIR filter as the interpolation scheme.
• CSD/BSD: These two are the spatial-domain ME algorithms with the cubic and b-spline interpolation schemes, respectively.
• CTD/BTD: These two are the proposed transform-domain ME algorithms with the cubic and b-spline interpolation schemes, respectively.
• HCL/HBL: These two correspond to the enhanced versions of CTD and BTD, respectively. In order to further mitigate the computational complexity, a hybrid approach is adopted, i.e., the bilinear interpolation is applied to half-pixel and the cubic (HCL) or b-spline (HBL) interpolation is applied to quarter-pixel.
Note that H.264, CSD, and BSD are the spatial-domain ME algorithms while the others are the transform-domain ME algorithm as proposed in this paper.
Table 2 and 3 demonstrate the performance of several ME algorithms using akiyo and football sequences in CIF format with different value of quantization parameter (QP), respectively. In these tables, the values in bold type indicate the highest PSNR and lowest BPF for each QP. In akiyo sequence (see Table 2 ), the transform-domain ME algorithms.
Performance comparison of several motion estimation algorithms for akiyo sequence
PPT Slide
Lager Image
Performance comparison of several motion estimation algorithms for akiyo sequence
Performance comparison of several motion estimation algorithms for football sequence
PPT Slide
Lager Image
Performance comparison of several motion estimation algorithms for football sequence
However, when QP≥21, the transform-domain algorithms require slightly higher bit-rate than the spatial-domain algorithms, which results from the increase of quantization error with respect to the increase of QP.
In the case of football sequence, as shown in Table 3 , the transform-domain algorithms have higher PSNR than the spatial-domain algorithms, regardless of QP values. In terms of BPF, CTD requires more bits than the spatial- domain algorithms for some specific values of QP (e.g., QP=3 and QP≥21). This is because the quantization errors are accumulated when estimating locations at half-pixel and quarter-pixel. When estimating the location of certain sub-pixels such as 1.25, 1.5, 1.75, 3.25, 3.5 and 3.75 by using the shifting matrix, some reference locations are beyond the boundary and are forced to be duplicated as shown in Fig. 5 . Interpolating these locations causes the estimation errors, leading to the decrease of PSNR. This problem can be effectively dealt with the hybrid algorithms, HCL and HBL. By applying the differentiated interpolation scheme in half-pixel location (bilinear interpolation) and quarter-pixel location (cubic interpolation in HCL and b-spline interpolation in HBL), the quantization error is reduced, which contributes to the increase of PSNR accordingly.
Fig. 7(a) and 7(b) show the values of PSNR for 50 frames with akiyo and football sequences, respectively. Here, the value of QP is fixed to 18. As shown in Fig. 7 , the transform-domain algorithms achieve higher PSNR than the spatial-domain algorithms, on the whole. Also, it is shown that HCL outperforms the others for the most frames in the case of akiyo sequence (see Fig. 7(a) ), while HBL and BTD attain the highest value of PSNR in the case of football sequence (see Fig. 7(b) ). The value of PSNR is high in the order of CSD, BSD, and H.264 among the spatial-domain algorithms, while it is high in the order of the hybrid algorithms (HCL and HBL), BTD and CTD among the transform-domain algorithms. When the interpolation is performed in the spatial domain, the surrounding pixels of target location are used to produce an interpolated pixel; however, the proposed transform-domain algorithms use the DCT coefficient instead. In the case of block-based 2D DCT, most of the information in the block is used to obtain the specific frequency component. This is the reason why the performance of specific spatial-domain ME does not retain a consistently in the transform domain.
PPT Slide
Lager Image
Performance comparison of motion estimation in spatial domain and transform domain (QP =18)
5. Conclusion
This paper proposed the low-complexity transform-domain ME algorithm supporting quarter-pixel accuracy. By utilizing the shifting matrix in the transform domain, the proposed algorithm can attain high quality of video and high efficiency of compression, while reducing the computational complexity substantially. The simulation results verified that the proposed algorithm not only provides higher PSNR but also requires smaller number of bits per frame compared to the conventional ME algorithms in the spatial domain. Moreover, due to the utilization of recursive equation and lack of IDCT block, the proposed transform-domain ME algorithm significantly relieved the computational complexity, which is the critical problem of the spatial-domain ME algorithms. As a result, the proposed algorithm is quite suitable for the real-time or streaming video service in mobile devices or embedded systems that have the limited capability of computing power, memory space, battery lifetime, and communication bandwidth. The proposed algorithm is also applicable to the next-generation video compression standard of H.265 by revising the shifting matrix to cope with the variable block size from 4 × 4 to 32 × 32. The gain of compression efficiency that can be achieved by the proposed algorithm is further intensified with a large block size such as 32 × 32, by utilizing the low-complexity recursive equation [11] .
Acknowledgements
This work was supported by the research program of Dongguk University.
BIO
Chul Ryu He received his M.S and Ph.D degrees in electrical engineering in 1991 and 1997, respectively, both from the Polytechnic Institute of New York University. From 1998 to 1999, he served as a research engineer of advanced wireless technologies lab at LG Electronics. He has been a professor of the Department of Information and Communication Engineering, Dongguk University, Seoul since 1999. His research area includes visual communication, super-resolution reconstruction, and real-time hardware design for video codec.
Jae-Young Shin She received Master’s degree in Information and Communication Engineering from Dongguk University. Her research interests are digital image processing and compression.
Eun-Chan Park He received B.S, M.S, and Ph.D degrees from the School of Electrical Engineering and Computer Science, Seoul National University, Seoul, Korea, in 1999, 2001, 2006, respectively. From 2006 to 2008, he was a senior engineer with Samsung Electronics, Korea. He is currently an associate professor in the Department of Information and Communication Engineering, Dongguk University, Seoul since 2006. His research interests include network performance analysis, resource allocation and MAC protocol design of wireless networks.
References
Agha S. , Dwyer V. M 2003 “Algorithm and VLSIarchitectures for MPEG-4 motion estimation,”
Oh J. 2010 “Improved sub-block matching algorithm,” Journal of Korean Institute of Communications and Information Sciences 33 (7) 628 - 633
Keller Y. , Averbuch A. 2003 “Fast gradient methods based on global motion estimation for video compression,” IEEE Transactions on Circuits and Systems for Video Technology 13 (4) 300 - 309    DOI : 10.1109/TCSVT.2003.811360
Nguyen Q. H. , Nguyen V. A. , Trinh C. V. , Dinh K. Q. , Park Y. , Jeon B. 2014 “Block-based transform-domain measurement coding for compressive sensing of images,” Journal of Korean Institute of Communications and Information Sciences 39A (12) 746 - 755    DOI : 10.7840/kics.2014.39A.12.746
Yang C. M. , hung K. C. 2014 “Embedded image compression scheme using rate-distortion optimized block coding of wavelet coefficients,” Journal of Korean Institute of Communications and Information Sciences 39A (11) 625 - 636    DOI : 10.7840/kics.2014.39A.11.625
Song J. , Yeo B. 2000 “A fast algorithm for DCT-domain inverse motion compensation based on shared information a macroblock,” IEEE Transactions on Circuits and Systems for Video Technology 10 (5) 767 - 775    DOI : 10.1109/76.856453
Kleihorst R. , Caberera F. 1998 “Implement of DCT- domain motion estimation and compensation,” Proceedings of IEEE Workshop on Signal Processing Systems (SIPS) 53 - 62
Plompen R.H.J.M. , Schuurink B. F. , Biemond J. 1985 “A new motion-compensated transform coding scheme,” IEEE Transactions on Acoutstics, Speech and Signal Process 1 371 - 374
Kang M. , Heo J. , Ryu C. 2011 “Half-pixel accuracy motion estimation algorithm in the transform domain for H.264,” Journal of Korean Institute of Communi- cations and Information Sciences 33 (11) 917 - 924
Wiegand T. , Sullivan G. J. , Bjontegaard G. , Luthra A. 2003 “Overview of the H.264/AVC video coding standard,” IEEE Transactions on Circuits and Systems for Video Technology 13 (7) 560 - 576    DOI : 10.1109/TCSVT.2003.815165
Yoon D. , Ho Y. 2012 “Fast mode decision method for HEVC in depth video,” Journal of Korean Institute of Communications and Information Sciences 37 (1A) 51 - 56    DOI : 10.7840/KICS.2012.37A.1.51