Motion estimation (ME) algorithms supporting quarterpixel accuracy have been recently introduced to retain detailed motion information for high quality of video in the stateoftheart video compression standard of H.264/AVC. Conventional subpixel ME algorithms in the spatial domain are faced with a common problem of computational complexity because of embedded interpolation schemes. This paper proposes a lowcomplexity subpixel motion estimation algorithm in the transform domain utilizing shifting matrix. Simulations are performed to compare the performances of spatialdomain ME algorithms and transformdomain ME algorithms in terms of peak signaltonoise ratio (PSNR) and the number of bits per frame. Simulation results confirm that the transformdomain approach not only improves the video quality and the compression efficiency, but also remarkably alleviates the computational complexity, compared to the spatialdomain approach.
1. Introduction
The recent advances in mobile communication technologies facilitate diverse multimedia services in a wireless environment, and thus, the demands of highquality video services in mobile devices are continuously increasing. In order to satisfy these demands, video compression is an imperative technique because highquality video requires a great amount of information to be transmitted and received over a bandwidthlimited wireless channel. ITUT and ISO jointly developed and proposed H.264/AVC aiming to provide highquality video at substantially lower bit rates than the previous standards (e.g., MPEG2 and H.263) without increasing the design complexity substantially. Therefore, H.264/ AVC became one of the most prevailing video codecs for recording, compression, and distribution of video contents, e.g., Internet video streaming and HDTV broadcasting, and it is widely embedded in various consumer electronics such as digital cameras/camcorders, digital TV settop boxes, mobile phones, and CCTVs, as well desktops and laptops.
The essence of video compression is to reduce spatial and temporal redundancies, and a motion estimation (ME) algorithm is the key part of video compression to remove temporal redundancy among consecutive frames in the video sequence. In literature, ME algorithms have been continuously and widely studied by many researchers to improve the compression efficiency and/or to lower the computational complexity, because the computational complexity of ME takes up about 60%80% of total computational complexity of video compression
[1]
. For the purpose of motion estimation, a block matching algorithm is usually used to estimate a motion vector by finding a location that gives the minimum matching error within the search area in the previous frame. A video encoder transfers the motion vectors found by the ME algorithm and the errors between the current and the bestmatched blocks using entropy coding. More precise motion vectors can reduce the temporal redundancy effectively, because they lead to less amount of error between the current and the target block found by motion estimation.
Motion vector does not always occur in integerpixels, it may also occur in subpixels due to changes of brightness among frames, nonlinear motion, or objects being appeared or disappeared in the boundary of pixel. Therefore, it cannot be expected to obtain precise motion information with a ME algorithm supporting integerpixel accuracy, and it is necessary to adopt a subpixel ME algorithm for achieving more accurate motion vectors. For this reason, MPEG2 and H.263 utilize halfpixel accuracy ME and H.264 further expands it to the level of quarterpixel accuracy to achieve better video quality than the existing video coding standards.
Conventional ME algorithms are usually performed in the spatial domain. However, the subpixel ME algorithms in the spatial domain result in a serious increment of computational complexity of video encoders due to embedded interpolation algorithms. To overcome this drawback, this paper proposes a lowcomplexity ME algorithm based on a transformdomain approach. By using shifting matrix in the transform domain and predefined coefficient of interpolation, the proposed ME algorithm significantly reduces the computational complexity without impairing the video quality. The simulation results confirm that the proposed transformdomain approach outperforms the conventional spatialdomain approach in terms of the peak signaltonoise ratio (PSNR) and the number of bits per frame, which are effective measures for the video quality, compression efficiency, and computational complexity.
The paper is organized as follows: Section 2 describes the ME algorithm in the transform domain. In Section 3, a shifting matrix algorithm is proposed for quarterpixel accuracy ME. Section 4 presents simulation results comparing the performance of spatialdomain ME algorithms with that of transformdomain ME algorithms. Section 5 concludes and addresses the utilization of the proposed algorithm.
2. Motion Estimation in Transform Domain
Since video encoding and decoding are performed in a unit of block, blockbased ME algorithms are preferred. Blockbased ME algorithms can be classified as spatialdomain algorithms and transformdomain algorithms. The common spatialdomain ME algorithms are block matching algorithm and gradientbased algorithm, while the typical transformdomain ME algorithms are discrete cosine transform (DCT) and waveletbased algorithm
[2

6]
.
Among the transformdomain algorithms, the waveletbased algorithm spends less time in searching and matching however, it incurs a high complexity in estimating motion vectors and increases the bit rate to transmit motion vectors in all the subbands. On the other hand, DCTbased algorithm is widely used in various video compression standards, and thus, it maintains a high compatibility to many existing standards.
DCT transforms the spatialdomain data into the transformdomain coefficients. One of its desirable features is that the energy of transformed coefficients is concentrated on the low frequency bands, which can be a basis for high efficiency of video compression. In general, the spatialdomain motion estimation requires the process of inverse DCT (IDCT) in order to utilize the spatialdomain information of the restored frame which is used as a reference frame in the estimation process, as shown in
Fig. 1
. Here, Q and IQ denote quantization and inverse quantization, respectively. On the other hand, IDCT can be omitted in the transformdomain ME since the motion estimation can be performed using coefficients transformed by DCT, as illustrated in
Fig. 2
[7]
. One of the key advantages of transformdomain ME is the low computational complexity, which is originated from the simple encoder structure without IDCT.
Motion estimation and compensation of video encoder in the spatial domain
Motion estimation and compensation of video encoder in the transform domain
The shifting matrix algorithm proposed by Plompen showed that a prediction block
f_{pred}
can be represented with the horizontal and vertical conversion of neighboring blocks
[8]
.
Fig. 3
shows an example that four neighboring blocks in the previous frame,
f
_{1}
,
f
_{2}
,
f
_{3}
, and
f
_{4}
, are used to estimate the
f_{pred}
which is the bestmatched block to the current block
f_{cu rr}
.
Estimation of current block from the bestmatched block composed of neighboring blocks in the previous frame
The prediction block
f_{pred}
can be expressed with a vertical shifting matrix
h_{i}
and a horizontal shifting matrix
v_{i}
, i.e.,
where,
h_{i}
and
v_{i}
in Eq. (1) are defined as follows.
Here, the displacement matrix
D_{n}
is defined as
where,
I_{n}
is a
n
×
n
unit matrix. For example,
Fig. 4
demonstrates a process of predicting
f_{pred}
with ∆
x
= 2, ∆
y
= 0, and Eq. (5) and (6) represent
h_{i}
used in this example.
Example of estimation process of f_{pred} with ∆x = 2, ∆y = 0
Applying orthogonality and separability of DCT,
f_{pred}
, DCT coefficient of the prediction block, can be expressed from Eq. (1) as follows:
Note that
f_{pred}
can be obtained by carrying out eight operations of matrix multiplication and three operations of matrix subtraction. However, the amount of computation can be greatly reduced by utilizing the characteristic of energy concentration of DCT. Accordingly, Eq. (7) can be derived as follows, in order to relieve the computational complexity by taking the advantage of DCT transformed block
F_{i}
, i.e.,
All
F_{i}
(
m
,
n
) with zero in Eq. (8) can be excluded from computations so that the efficiency of prediction can be further improved using the scarcity of prediction block
F_{i}
. Considering that the value of coefficients in high frequency band are almost zero, it is necessary to redistribute the coefficients of
F_{i}
in the sequence of zigzag scanned order as
The recursive equation derived from Eq. (9) provides an efficient way to calculate
F_{i}
by adjusting nonzero factor of
F_{i}
. Furthermore, the computational complexity can be efficiently reduced according to the block size and the number of nonzero factors of
F_{i}
. From Eq. (9), the operation of outer product of
V_{i}
and
H_{i}
needs to be performed only with nonzero factors of
F_{i}
, and Eq. (9) can be expressed in a recursive form as
Here,
is a 4 × 4 zero matrix and
k
increases from one to the total number of nonzero DCT factors in the block.
3. Shifting Matrix for SubPixel Accuracy ME
The spatialdomain ME supporting subpixel accuracy utilizes interpolation algorithms which often upsample more than four times of its samples. Therefore, the performance of motion estimation depends on the interpolation algorithm applied to it. For example, the performance of bilinear interpolation incorporated in the ME with halfpixel accuracy is degraded, compared to the higherorder interpolation algorithms
[9]
. Meanwhile, the higherorder interpolation algorithms, such as bspline interpolation or cubic interpolation, each of which is given in Eq. (11) and Eq. (12), respectively, provide better prediction performance than the bilinear interpolation at the cost of large computational complexity. Accordingly, H.264 can enhance the image quality by utilizing a ME with quarterpixel accuracy; however it exacerbates the computational complexity at the same time.
This section presents a shifting matrix algorithm supporting quarterpixel accuracy of ME and proposes a transformdomain ME with bspline or cubic interpolation schemes, with which the performance can be enhanced compared to the bilinear interpolation. In order to apply bspline or cubic interpolation, the shifting matrix should be represented as follows.
Considering the quarterpixel accuracy of ME,
. If
n
< 0,
D_{n}
and
use 4 × 4 zero matrices, otherwise if
n
> 4,
and
. In Eqs. (13) and (14), ⌊
d
⌋ and ⌈
d
⌉ denote the roundingup and roundingdown values of
d
, respectively.
In the case of bilinear interpolation, the frequency information in both sides of interpolated location is used to estimate the motion at the subpixel accuracy. On the other hand, in the case of bspline or cubic interpolation, the frequency information in four locations is required. Therefore, when 
d
 < 1 or 
d
 > 3 , the frequency information beyond the boundary of reference block can be utilized. However, in order to obtain frequency information in a specific location from the block converted to the transform domain, all of the block information including the specific location is required, which causes an unnecessary access of memory. To resolve this problem, the vertical or horizontal intra prediction mode is adopted in predicting a specific location
[10]
.
For instance, if the current frame
f_{curr}
moves in yaxis at the amount of −4 <
d
< −3, the information beyond boundary of the search area is required and the problem of memory allocation happens. However, if the intra prediction mode is applied, the information beyond boundary can be estimated by duplicating information in the search area upwards as indicated in
Fig. 5
. In this way, the problem of unnecessary memory allocation can be effectively avoided. The following Eqs. (16) and (17) show the vertical shifting matrix in the ME with quarterpixel accuracy for the example illustrated in
Fig. 5
.
Estimation of information beyond boundary using intra prediction mode
It is worthwhile to note that the proposed mechanism does not require IDCT and alleviates the computational complexity by utilizing the recursive equation.
Table 1
compares the number of addition and multiplication operations used in the spatialdomain ME and the transformdomain ME when the block size is
N
×
N
and the search area is
p
. For the typical values of
N
= 4 and
p
= 8, the transformdomain ME decreases the number of addition and multiplication operations by about 27 and 3.5 times, respectively, compared to the spatialdomain ME, which confirms the outstanding decrease of computational complexity due to the transformdomain approach.
Comparison of computational complexity in spatialdomain ME and transformdomain ME
Comparison of computational complexity in spatialdomain ME and transformdomain ME
In order to compare the computational complexities of spatialdomain ME and transformdomain ME, two indices of γ
^{+}
and γ
^{*}
are introduced as
where,
and
are the numbers of addition operations in the spatialdomain ME and transformdomain ME, respectively, and
and
are the corresponding numbers of multiplication operations.
Fig. 6(a)
and
6(b)
show γ
^{+}
and γ
^{*}
when
N
ranges from 2 to 8 and
p
ranges from 2 to 10, respectively. In these ranges of
N
and
p
, γ
^{+}
lies between 0.024 and 0.063, while γ
^{*}
lies between 0.14 and 0.57; that is, the transformdomain ME remarkably decreases the numbers of addition and multiplication operations by about 16~42 times and 2~7 times, respectively, compared to the spatialdomain ME. Moreover, as shown in
Fig. 6
, both γ
^{+}
and γ
^{*}
almost linearly increase with respect to the increase of block size N, but they are hardly affected by search area
p
.
Relative computational complexity of transformdomain ME to spatialdomain ME
4. Simulation and Discussion
In this section, the performances of several ME algorithms are evaluated and compared via simulations. In the simulations, H.264/AVC video coder is used, the quarterpixel accuracy ME is considered, and the sum of absolute difference is used as a criterion to determine the motion vector. The performance is evaluated with two indices, PSNR and the number of bits per frame (BPF), which are widely used to measure video quality and compression efficiency. The performance is compared by considering the following several ME algorithms.
• H.264: This is a baseline algorithm as standardized by ITUT and ISO. It is the most typical spatialdomain ME algorithm and uses a 4tap FIR filter as the interpolation scheme.
• CSD/BSD: These two are the spatialdomain ME algorithms with the cubic and bspline interpolation schemes, respectively.
• CTD/BTD: These two are the proposed transformdomain ME algorithms with the cubic and bspline interpolation schemes, respectively.
• HCL/HBL: These two correspond to the enhanced versions of CTD and BTD, respectively. In order to further mitigate the computational complexity, a hybrid approach is adopted, i.e., the bilinear interpolation is applied to halfpixel and the cubic (HCL) or bspline (HBL) interpolation is applied to quarterpixel.
Note that H.264, CSD, and BSD are the spatialdomain ME algorithms while the others are the transformdomain ME algorithm as proposed in this paper.
Table 2
and
3
demonstrate the performance of several ME algorithms using akiyo and football sequences in CIF format with different value of quantization parameter (QP), respectively. In these tables, the values in bold type indicate the highest PSNR and lowest BPF for each QP. In akiyo sequence (see
Table 2
), the transformdomain ME algorithms.
Performance comparison of several motion estimation algorithms for akiyo sequence
Performance comparison of several motion estimation algorithms for akiyo sequence
Performance comparison of several motion estimation algorithms for football sequence
Performance comparison of several motion estimation algorithms for football sequence
However, when QP≥21, the transformdomain algorithms require slightly higher bitrate than the spatialdomain algorithms, which results from the increase of quantization error with respect to the increase of QP.
In the case of football sequence, as shown in
Table 3
, the transformdomain algorithms have higher PSNR than the spatialdomain algorithms, regardless of QP values. In terms of BPF, CTD requires more bits than the spatial domain algorithms for some specific values of QP (e.g., QP=3 and QP≥21). This is because the quantization errors are accumulated when estimating locations at halfpixel and quarterpixel. When estimating the location of certain subpixels such as 1.25, 1.5, 1.75, 3.25, 3.5 and 3.75 by using the shifting matrix, some reference locations are beyond the boundary and are forced to be duplicated as shown in
Fig. 5
. Interpolating these locations causes the estimation errors, leading to the decrease of PSNR. This problem can be effectively dealt with the hybrid algorithms, HCL and HBL. By applying the differentiated interpolation scheme in halfpixel location (bilinear interpolation) and quarterpixel location (cubic interpolation in HCL and bspline interpolation in HBL), the quantization error is reduced, which contributes to the increase of PSNR accordingly.
Fig. 7(a)
and
7(b)
show the values of PSNR for 50 frames with akiyo and football sequences, respectively. Here, the value of QP is fixed to 18. As shown in
Fig. 7
, the transformdomain algorithms achieve higher PSNR than the spatialdomain algorithms, on the whole. Also, it is shown that HCL outperforms the others for the most frames in the case of akiyo sequence (see
Fig. 7(a)
), while HBL and BTD attain the highest value of PSNR in the case of football sequence (see
Fig. 7(b)
). The value of PSNR is high in the order of CSD, BSD, and H.264 among the spatialdomain algorithms, while it is high in the order of the hybrid algorithms (HCL and HBL), BTD and CTD among the transformdomain algorithms. When the interpolation is performed in the spatial domain, the surrounding pixels of target location are used to produce an interpolated pixel; however, the proposed transformdomain algorithms use the DCT coefficient instead. In the case of blockbased 2D DCT, most of the information in the block is used to obtain the specific frequency component. This is the reason why the performance of specific spatialdomain ME does not retain a consistently in the transform domain.
Performance comparison of motion estimation in spatial domain and transform domain (QP =18)
5. Conclusion
This paper proposed the lowcomplexity transformdomain ME algorithm supporting quarterpixel accuracy. By utilizing the shifting matrix in the transform domain, the proposed algorithm can attain high quality of video and high efficiency of compression, while reducing the computational complexity substantially. The simulation results verified that the proposed algorithm not only provides higher PSNR but also requires smaller number of bits per frame compared to the conventional ME algorithms in the spatial domain. Moreover, due to the utilization of recursive equation and lack of IDCT block, the proposed transformdomain ME algorithm significantly relieved the computational complexity, which is the critical problem of the spatialdomain ME algorithms. As a result, the proposed algorithm is quite suitable for the realtime or streaming video service in mobile devices or embedded systems that have the limited capability of computing power, memory space, battery lifetime, and communication bandwidth. The proposed algorithm is also applicable to the nextgeneration video compression standard of H.265 by revising the shifting matrix to cope with the variable block size from 4 × 4 to 32 × 32. The gain of compression efficiency that can be achieved by the proposed algorithm is further intensified with a large block size such as 32 × 32, by utilizing the lowcomplexity recursive equation
[11]
.
Acknowledgements
This work was supported by the research program of Dongguk University.
BIO
Chul Ryu He received his M.S and Ph.D degrees in electrical engineering in 1991 and 1997, respectively, both from the Polytechnic Institute of New York University. From 1998 to 1999, he served as a research engineer of advanced wireless technologies lab at LG Electronics. He has been a professor of the Department of Information and Communication Engineering, Dongguk University, Seoul since 1999. His research area includes visual communication, superresolution reconstruction, and realtime hardware design for video codec.
JaeYoung Shin She received Master’s degree in Information and Communication Engineering from Dongguk University. Her research interests are digital image processing and compression.
EunChan Park He received B.S, M.S, and Ph.D degrees from the School of Electrical Engineering and Computer Science, Seoul National University, Seoul, Korea, in 1999, 2001, 2006, respectively. From 2006 to 2008, he was a senior engineer with Samsung Electronics, Korea. He is currently an associate professor in the Department of Information and Communication Engineering, Dongguk University, Seoul since 2006. His research interests include network performance analysis, resource allocation and MAC protocol design of wireless networks.
Agha S.
,
Dwyer V. M
2003
“Algorithm and VLSIarchitectures for MPEG4 motion estimation,”
Oh J.
2010
“Improved subblock matching algorithm,”
Journal of Korean Institute of Communications and Information Sciences
33
(7)
628 
633
Keller Y.
,
Averbuch A.
2003
“Fast gradient methods based on global motion estimation for video compression,”
IEEE Transactions on Circuits and Systems for Video Technology
13
(4)
300 
309
DOI : 10.1109/TCSVT.2003.811360
Nguyen Q. H.
,
Nguyen V. A.
,
Trinh C. V.
,
Dinh K. Q.
,
Park Y.
,
Jeon B.
2014
“Blockbased transformdomain measurement coding for compressive sensing of images,”
Journal of Korean Institute of Communications and Information Sciences
39A
(12)
746 
755
DOI : 10.7840/kics.2014.39A.12.746
Yang C. M.
,
hung K. C.
2014
“Embedded image compression scheme using ratedistortion optimized block coding of wavelet coefficients,”
Journal of Korean Institute of Communications and Information Sciences
39A
(11)
625 
636
DOI : 10.7840/kics.2014.39A.11.625
Song J.
,
Yeo B.
2000
“A fast algorithm for DCTdomain inverse motion compensation based on shared information a macroblock,”
IEEE Transactions on Circuits and Systems for Video Technology
10
(5)
767 
775
DOI : 10.1109/76.856453
Kleihorst R.
,
Caberera F.
1998
“Implement of DCT domain motion estimation and compensation,”
Proceedings of IEEE Workshop on Signal Processing Systems (SIPS)
53 
62
Plompen R.H.J.M.
,
Schuurink B. F.
,
Biemond J.
1985
“A new motioncompensated transform coding scheme,”
IEEE Transactions on Acoutstics, Speech and Signal Process
1
371 
374
Kang M.
,
Heo J.
,
Ryu C.
2011
“Halfpixel accuracy motion estimation algorithm in the transform domain for H.264,”
Journal of Korean Institute of Communi cations and Information Sciences
33
(11)
917 
924
Wiegand T.
,
Sullivan G. J.
,
Bjontegaard G.
,
Luthra A.
2003
“Overview of the H.264/AVC video coding standard,”
IEEE Transactions on Circuits and Systems for Video Technology
13
(7)
560 
576
DOI : 10.1109/TCSVT.2003.815165
Yoon D.
,
Ho Y.
2012
“Fast mode decision method for HEVC in depth video,”
Journal of Korean Institute of Communications and Information Sciences
37
(1A)
51 
56
DOI : 10.7840/KICS.2012.37A.1.51