Auto-Covariance Analysis for Depth Map Coding
Auto-Covariance Analysis for Depth Map Coding
KSII Transactions on Internet and Information Systems (TIIS). 2014. Sep, 8(9): 3146-3158
Copyright © 2014, Korean Society For Internet Information
  • Received : June 12, 2014
  • Accepted : July 23, 2014
  • Published : September 30, 2014
Export by style
Cited by
About the Authors
Lei, Liu
Yao, Zhao
Chunyu, Lin
Huihui, Bai

Efficient depth map coding is very crucial to the multi-view plus depth (MVD) format of 3-D video representation, as the quality of the synthesized virtual views highly depends on the accuracy of the depth map. Depth map contains smooth area within an object but distinct boundary, and these boundary areas affect the visual quality of synthesized views significantly. In this paper, we characterize the depth map by an auto-covariance analysis to show the locally anisotropic features of depth map. According to the characterization analysis, we propose an efficient depth map coding scheme, in which the directional discrete cosine transforms (DDCT) is adopted to substitute the conventional 2-D DCT to preserve the boundary information and thereby increase the quality of synthesized view. Experimental results show that the proposed scheme achieves better performance than that of conventional DCT with respect to the bitrate savings and rendering quality.
1. Introduction
R ecent years witnessed the significant attention of three-dimensional (3-D) videos. 3-D videos enable the viewers to select an interactive viewpoint and perceive immersive experience of the reality scenes [1] . However, the applications of 3-D video suffer from the huge amount of multi-view video data to be compressed and transmitted. To solve this problem, multi-view video coding (MVC) [2] has been proposed and standardized as an extension of the H.264/MPEG-4 AVC by the joint video team of the ITU-T video coding experts group (VCEG) and ISO/IEC moving picture experts group (MPEG). The MVC tries to exploit the inter-view correlation between different video sequences and the spatial/temporal correlation in each single sequence. Though MVC achieves tolerable compression, it still seems to be too many video data volumes to process. Another feasible solution is using the data format of multi-view video plus depth (MVD) representation [3] , in which the intermediate virtual (novel) views can be generated by the transmitted video views and their corresponding depth maps. In MVD coding structure, video views along with their depth maps are encoded and transmitted; at the decoder side, a number of desired intermediate views can be synthesized from the neighboring viewpoints, via some depth-image-based rendering (DIBR) techniques [4] .
Depth map represents the distance information between the capturing camera and the objects in the scene. It can be considered as a gray scale image. Fig. 1 presents a texture (color) image Cones and its corresponding depth image. As can be seen, the depth image contains nearly no texture but sharp object boundaries, as the gray levels are almost the same in most regions within an object but change abruptly when across the boundaries [5] . Depth map plays an important role in the virtual view synthesis. In the view synthesis, the distortion of depth data, especially around the object boundaries, will lead to geometry changes and occlusion variations of the texture image. This will seriously degrade the quality of the synthesized views [6] . Therefore, efficient depth map coding that preserves the depth information (especially the boundary fidelity) is an essential part to the 3DV systems.
PPT Slide
Lager Image
An example of texture image and depth image (Cones). (a) texture image; (b) depth image.
In the image and video compression techniques, the two-dimensional discrete cosine transform (2-D DCT) is the most widely used transform because of its efficient energy compaction capability and computational simplicity. However, the 2-D DCT has already been proven that it is inefficient to encode image blocks with complex textures or edges [7] . Since the quality of synthesized views is very sensitive to the boundary accuracy of the depth, it seems that DCT is also not the best choice for the depth map coding.
In this paper, we analyze the local anisotropic features of depth map by an adaptive auto-covariance characterization which shows some special statistical characteristics of depth map. According to the characterization analysis, we propose an efficient depth map coding scheme using directional discrete cosine transforms (DDCT) [7] adapted to these locally anisotropic features, which can well preserve the boundary accuracy of depth maps and consequently increase the quality of the synthesized views.
The rest of the paper is organized as follows. Section 2 gives a brief overview of some related works. Section 3 firstly presents the auto-covariance analysis for the depth map and then the proposed coding scheme. Experimental results and performance comparisons are shown in Section 4, and finally Section 5 concludes the paper.
2. Related Works
- 2.1 Depth Map Coding
A direct approach to compress the depth map is to treat it as an ordinary image or video and process it using the existing coding standards such as JPEG, H.264/AVC [8] , or the most recently emerging high efficiency video coding (HEVC) standard [9] . However, depth maps have unique characteristics different from texture/color images, which make these compression tools not suitable for depth map coding. First, depth map solely represents the distance between camera and objects, so depth levels within an object are nearly the same. It contains no texture, smooth regions within objects and the background, and discontinuous boundaries. Second, the temporal consistency of depth map is much lower than that of the color videos, because of low resolution depth capture devices or the inaccurate depth estimation method. Moreover, the depth map is never displayed; it only assists the decoder in synthesizing the virtual views. Thus, to achieve optimal synthesis results, the affection of depth distortion to the synthesized views needs careful investigation during depth map coding process [10] .
Considering these characteristics, several approaches have been proposed for efficient depth map coding. [5] proposed an adaptive geometry-based intra prediction method, in which some partitioned intra prediction modes were properly produced along object boundaries to reduce the coding loss of boundary information. In [11] , an edge-aware intra prediction method was produced to reduce prediction error in blocks with arbitrary edge shapes, using a graph-based representation of the pixels based on edge information. Beside these geometry-based presentation methods, [12 - 14] proposed some shape-adaptive transforms for efficient depth map coding. These transforms require the edge information be knowable a priori and implement along these detected edges rather than across them. These transforms produce smaller coefficients and achieve remarkable coding efficiency improvement. However, all these works do not analyze the characteristics of depth map quantitatively. Further analyses of depth map need to be investigated to quantify its characteristics.
- 2.2 Directional transforms in image coding
Usually, the conventional 2-D DCT used in the image and video compression is implemented as two separable 1-D DCTs along the horizontal and vertical directions, respectively. However, there are many image blocks contain other directional information rather than the horizontal/vertical one, such as the anisotropic edges, boundaries, textures, etc. When the two 1-D DCTs implement across these edges, some unnecessary non-zero coefficients will be produced, which makes the conventional 2-D DCT not be the best choice for these image blocks [7] .
Oriented information is very important to the human visual system. To achieve high coding performance, the oriented information must be exploited and preserved as much as possible. Many literatures take the directional information into account, and show significant coding gain by exploiting the directional information within images [7 , 15 - 17] . The video coding standard H.264/AVC has developed several directional predictions (including the vertical and horizontal directions); furthermore, the most recently finalized HEVC provides up to 33 directional prediction modes for its prediction unit. Zeng and Fu proposed a directional discrete cosine transforms (DDCT) framework [7] , in which the first 1-D transform of the conventional 2-D DCT is reorganized following the dominating edge direction of an image block, and the produced coefficients are rearranged appropriately to align with each other and make the second transform a horizontal one. Theoretical analysis showed that the DDCT reached a remarkable coding gain compared with the conventional DCT.
3. Depth characteristic analysis and coding
In this section, we first analyze the characteristics of the depth maps. Then we briefly introduce the DDCT framework. Finally, the depth map is coded using DDCT with all the available directional modes, in which a synthesized view distortion optimization is performed to select the best DDCT mode.
- 3.1 Auto-Covariance Analysis for Depth Map
A stationary first-order Markov signal has an auto-covariance given by:
PPT Slide
Lager Image
where I presents the distance between two elements from which the auto-covariance is computed.
Images can be modeled as Markov signals, as the value of a pixel practically depends only upon a finite number of neighboring pixels. For 2-D auto-covariance function, separable model and generalized model can be constructed from (1), which are given below in (2) and (3), respectively [18] :
PPT Slide
Lager Image
PPT Slide
Lager Image
The generalized model is a rotated case of the separable model, which can capture the local anisotropies within images. The parameter θ represents the rotation angle.
Consider the ground truth disparity (depth) image of Cones shown in Fig. 1 (b), we estimate the parameters of the two models, i.e., ρ 1 and ρ 2 for the separable model, and ρ 1 , ρ 2 and θ for the generalized model. For the 8x8 size blocks of the depth map, the auto-covariance coefficient of each block is firstly estimated using the unbiased estimator. Then the parameters ρ 1 , ρ 2 and θ are found by minimizing the mean square error between the estimated auto-covariance and the models in (2) and (3). Here for intensively display of the estimated parameters, ρ 1 is always chosen as the larger covariance coefficient and θ varies between 0 and π . The estimation results are shown in Fig. 2 . In the figures, each point is estimated from a block of size 8x8.
PPT Slide
Lager Image
Estimated auto-covariance model parameters for the depth of Cones. (a) separable model; (b) generalized model.
As can be seen in Fig. 2 (a) and (b), the points in the plot from the generalized model concentrate towards the southeast region, while the points in the plot from the separable model distributes somewhat evenly. Quantitatively, most values of ρ 1 in Fig. 2 (b) are enlarged (tend to 1) while the values of ρ 2 are nearly remaining the same, as compared with the values of ρ 1 and ρ 2 in Fig. 2 (a). This demonstrates that the correlation of pixels oriented along an angle θ is enhanced. This implies that the generalized model which considers the directional information within an image can provide a more faithful characterization of the image, thus better compression of the image can be expected accordingly.
- 3.2 Directional DCT
There are eight directional modes defined in the DDCT framework, following the intra prediction modes used in H.264/AVC. The directional modes are denoted as Modes 0-1 and Modes 3-8. Modes 0-1 are vertical and horizontal modes, respectively, and they come back into the conventional DCT (we denote them as Mode 0/1 here). Modes 3-8 are diagonal down-left, diagonal down-right, vertical-right, horizontal-down, vertical-left and horizontal-up, respectively. We use these notations directly in our paper. Mode 2, i.e., the planar mode, is not considered here. So there are total seven directional modes here. After the definition of the directional modes, the DDCT is conducted by performing the first 1-D transform along the chosen direction, followed by the second 1-D transform arranged as a horizontal one. Finally, a modified zigzag scanning converts these manipulated coefficients into a 1-D sequence so as to facilitate the runlength-based VLC. Obviously, three extra bits are needed to identify the selected mode for each image block.
To show the defect of 2-D DCT and the effectiveness of DDCT, we present an example in Fig. 3 . An artificial 8x8 block with a distinct directional (diagonal) boundary is firstly created, as shown in Fig. 3 (a). The pixels along the boundary carry the same grey level x ( i , j )=140, while the rest pixels have the same grey level 50. Then the manual block is transformed by the 2-D DCT and DDCT (with the corresponding directional mode), and the coefficients are shown in (4) and (5), respectively:
PPT Slide
Lager Image
An example of the defect of 2-D DCT. (a) original block with a directional boundary; (b) reconstructed block by 2-D DCT with a quantization step 30 (MSE=95.69); (c) reconstructed block by DDCT with the same quantization step (MSE=27.43).
PPT Slide
Lager Image
PPT Slide
Lager Image
The results show that the DDCT coefficients are more sparse than the 2-D DCT coefficients and also have a different energy distribution. Then the coefficients are quantized using a uniform quantizer with a quantization step (Q-Step) 30. Finally, the de-quantized values are transformed by the inverse transforms to get the reconstructed block, which are shown in Fig. 3 (b) and (c). As can be seen in Fig. 3 (b), the values along the boundary vary severely, and there are also many disgusting distortions around the boundary. On the other hand, Fig. 3 (c) shows that the values along boundary are much more coincident with the original block, and with less distortion around the boundary.
- 3.3 Directional Mode Selection
According to the analysis of Sec. 3.1, taking the directional information into account may lead to better coding performance. We use the DDCT (including 2-D DCT as a special case) for the depth map coding. Each block of the depth map is encoded with the available transform. Then a rate-distortion optimization (RDO) function is formed as in (6) to select the transform mode:
PPT Slide
Lager Image
where Dimode is the distortion (indicated by MSE) of the i -th block for the current DDCT mode and Rimode is the amount of bits needed to encode that block, respectively. All the available modes are conducted for the block, and the one with minimum R-D cost is selected as the best mode for that block.
However, the cost function in (6) does not consider the synthesis distortion during the DIBR process and will not get the optimal performance. A newly defined synthesized view distortion optimization (SVDO) function considering both the depth map error and the virtually rendered view quality is proposed as follows [19] :
PPT Slide
Lager Image
where Δ Dk is the depth distortion at position k ,
PPT Slide
Lager Image
are color pixel values at position k , k - 1 and k + 1, respectively; and α is a coefficient determined by the camera parameters through the following equation:
PPT Slide
Lager Image
where f is the focal length, L is the baseline between the current and rendered view, Znear and Zfar are the nearest and farthest depth values of the scene, respectively.
This new cost function combines both the position shift by depth errors and difference between adjacent texture image pixels during the warping/rendering process. It could be give a more reasonable mode selection in terms of rendering view quality.
4. Experimental Results
To validate the performance of the proposed scheme, we compare it with the coding scheme based on the conventional 2-D DCT. Five ground truth disparity images provided by Middlebury stereo datasets [20] : Barn1 , Cones , Poster , Sawtooth and Venus , which contain piecewise planar scenes with distinct visible directional boundaries, are tested in the experiments. Among the test sets, each disparity image (depth) of the second view is selected to process, and the virtual view is warped from the second view and the processed depth. To estimate the quality of the warped synthesized view, we use the warped virtual view from the original second view and its original depth as the reference.
Fig. 4 shows the R-D performance (PSNR vs. Bitrate) for the synthesized virtual views of the test images. As can be seen, compared with the 2-D DCT, DDCT (with both mode selection methods) achieves better compression performance at all the encoding bitrates for all the test sets. On the other hand, mode selection using the SVDO cost function outperforms that using the RDO cost function, as the former takes the rendering view quality into account.
PPT Slide
Lager Image
R-D performance for the synthesized views (Bitrate-PSNR).
For immediate observation, the Bjonteggard Delta PSNR (BD-PSNR) and Bjonteggard Delta Bitrate (BD-Bitrate) [21] of the synthesized views for the test depth maps between the proposed scheme and the 2-D DCT based scheme are shown in Fig. 5 . The BD-PSNR measures the average vertical distance between two R-D curves [ Fig. 4 (a)-(e)] and it implies the average coding gains in terms of dB [ Fig. 5 (a)]. The BD-Bitrate measures the average horizontal distance between two R-D curves [ Fig. 4 (a)-(e)] and it implies the average bitrate savings in terms of percentage [ Fig. 5 (b)]. These results show that the proposed scheme achieves improvement with maximum coding gain of 2.81 dB and 1.80 dB on average, or 18.40% and 13.20% of bitrate saving, respectively.
PPT Slide
Lager Image
The average coding gains and average bitrate savings for the synthesized views. (a) average coding gains; (b) average bitrate savings.
Fig. 6 shows the difference between the synthesized view and the reference view of Barn1 (with an enlarged portion). It can be seen that the depth map coded by our proposed scheme produces less distortions in the synthesized view, especially in the area lightened by the red rectangle.
PPT Slide
Lager Image
The difference between the synthesized views using various compressed depths and the reference view of Barn1. (a) reference view synthesized using the original depth; (b) view synthesized using depth compressed by 2-D DCT; (c) view synthesized using depth compressed by DDCT+RDO; (d) view synthesized using depth compressed by DDCT+SVDO. The figures in the bottom row are the enlarged portions of the red rectangle areas in the top row.
Fig. 7 shows the mode selection result using the two cost functions for Barn1 . As can be seen, the most popular scenario is Mode 0/1, i.e., the conventional 2-D DCT is selected in nearly all the smooth areas; however, around the boundaries, various directional modes are selected. This is quite corresponding to the characteristics of depth map. The directional modes distributed around the boundaries can explain the coding gains (or the bitrate savings) when using DDCT to some extent. Though the amount of directional modes is quite small compared with Mode 0/1, there are always some remarkable coding gains achieved by the directional transforms. This indicates that the fidelity of the boundary area is conclusively important to the quality of synthesized views. Moreover, as shown in Fig. 8 , there is some difference between the two scenarios of mode distribution which use different cost functions. This is the reason why the performance of DDCT plus SVDO always outperforms that of DDCT plus RDO, as shown in Fig. 5 (DDCT+SVDO vs. DDCT+RDO).
PPT Slide
Lager Image
DDCT mode distributions for Barn1 using two cost functions. (a) RDO; (b) SVDO.
PPT Slide
Lager Image
The difference between the mode selection results using RDO and SVDO.
5. Conclusion
In this paper, the statistical characteristics of the depth map are firstly analyzed by an auto-covariance model. In order to better preserve the boundary information of the depth, an efficient depth map coding scheme is proposed using the directional discrete cosine transforms (DDCT), in which the conventional DCT is manipulated to implement along the image boundaries. A rate-distortion optimization cost function, which considers both depth errors and view synthesis errors, is adopted to select the best directional mode for each image block. By exploiting the directional information of the boundary, experimental results show that the proposed scheme achieves significant performance improvement for depth map coding, and thereby improves the quality of the synthesized views. Future works will be focused on the analysis of influence of depth map distortion on synthesized view. A more sophisticated rate-distortion optimization metric based on the view synthesis distortion is also needed careful investigation.
Lei Liu received the B.E. degree form Shandong University of Science and Technology, Qingdao, China, in 2009, and the M.E. degree form Taiyuan University of Science and Technology, Taiyuan, China, in 2012. He is currently working toward the PhD degree at Beijing Jiaotong University, Beijing, China. His current research interests include image/video compression and processing.
Yao Zhao received the BS degree from Fuzhou University, China, in 1989, and the ME degree from Southeast University, Nanjing, China, in 1992, both from the Radio Engineering Department, and the PhD degree from the Institute of Information Science, Beijing Jiaotong University (BJTU), China, in 1996. He became an associate professor at BJTU in 1998 and became a professor in 2001. From 2001 to 2002, he was a senior research fellow with the Information and Communication Theory Group, Faculty of Information Technology and Systems, Delft University of Technology, Delft, The Netherlands. He is currently the director of the Institute of Information Science, BJTU. His current research interests include image/video coding, digital watermarking and forensics, and video analysis and understanding. He serves on the editorial boards of several international journals, including as associate editors of IEEE Transactions on Cybernetics, IEEE Signal Processing Letters, and an area editor of Signal Processing: Image Communication (Elsevier), etc. He was named a distinguished young scholar by the National Science Foundation of China in 2010, and was elected as a Chang Jiang Scholar of Ministry of Education of China in 2013. He is a senior member of the IEEE and a fellow of IET.
Chunyu Lin was born in Liaoning Province, China. He works as a lecturer in Beijing Jiaotong University. He obtained his doctor degree in Beijing Jiaotong University in 2011. From 2009 to 2010, he was a visiting researcher at the ICT group of Delft University of Technology, Netherlands. From 2011 to 2012, He was a postdoc in Gent University, Belgium. His research interests are in the areas of image/video compression and robust transmission, stereo matching and 3D video coding.
Huihui Bai received the Ph.D. degrees in signal and information processing from Beijing Jiaotong University (BJTU), Beijing, China, in 2008. She is currently an associate professor in Institute of Information Science in BJTU. She has been engaged in R&D work in video coding technologies and standards such as HEVC, 3D video compression, multiple description video coding (MDC) and distributed video coding (DVC). She is leading or participating in several research projects such as 973 Program, 863 Program, National Natural Science Foundation of China, Beijing Natural Science Foundation and Jiangsu Provincial Natural Science Foundation.
Smolic A. , Mueller K. , Merkle P. , Fehn C. , Kauf P. , Eisert P. , Wiegand T. 2006 “3D video and free view-point video-technologies, applications and MPEG standard” in Proc. of IEEE Int. Conf. Multimedia and Expo (ICME 2006) Article (CrossRef Link) 2161 - 2164
Vetro A. , Yea S. , Zwicker M. , Matusik W. , Pfister H. 2007 “Overview of multiview video coding and anti-aliasing for 3D displays” in Proc. of IEEE Int. Conf. Image Process. (ICIP 2007) Article (CrossRef Link) I-17 - I-20
Merkle P. , Smolic A. , Muller K. , Wiegand T. 2007 “Multi-view video plus depth representation and coding” in Proc. of IEEE Int. Conf. Image Process. (ICIP 2007) Article (CrossRef Link) I-201 - I-204
Fehn C. 2003 “A 3D-TV approach using depth-image-based rendering (DIBR)” in Proc. of Visual., Imag., Image Process. 482 - 487
Kang M.-K. , Ho Y.-S. 2012 “Depth video coding using adaptive geometry based intra prediction for 3-D video systems” IEEE Trans. Multimedia Article (CrossRef Link) 14 (1) 121 - 128    DOI : 10.1109/TMM.2011.2169238
Zhao Y. , Zhu C. , Chen Z. , Yu L. 2011 “Depth no-synthesis-error model for view synthesis in 3-D video” IEEE Trans. Image Process. Article (CrossRef Link) 20 (8) 2221 - 2228    DOI : 10.1109/TIP.2011.2118218
Zeng B. , Fu J.-J. 2008 “Directional discrete cosine transforms—A new framework for image coding” IEEE Trans. Circ. Syst. for Video Technology Article (CrossRef Link) 18 (3) 305 - 313    DOI : 10.1109/TCSVT.2008.918455
2005 ITU-T Rec. H.264 | ISO/IEC 14496-10 (MPEG-4 AVC), “Advanced video coding for generic audiovisual services”
Sullivan G. , Ohm J.-R. , Han W.-J. , Wiegand T. 2012 “Overview of the high efficiency video coding (HEVC) standard” IEEE Trans. Circ. Syst. for Video Technology Article (CrossRef Link) 22 (12) 1649 - 1668    DOI : 10.1109/TCSVT.2012.2221191
Kim W.-S. , Ortega A. , Lai P. , Tian D. , Gomila C. 2007 “Depth map distortion analysis for view rendering and depth coding” in Proc. of IEEE Int. Conf. Image Process. (ICIP 2009) Article (CrossRef Link) 721 - 724
Shen G. , Kim W.-S. , Ortega A. , Lee J. , Wey H. 2010 “Edge-aware intra prediction for depth-map coding” in Proc. of IEEE Int. Conf. Image Process. (ICIP 2010) Article (CrossRef Link) 3393 - 3396
Maitre M. , Do M. N. 2008 “Joint encoding of depth image based representation using shape-adaptive wavelets” in Proc. of IEEE Int. Conf. Image Process. (ICIP 2008) Article (CrossRef Link) 1768 - 1771
Shen G. , Kim W.-S. , Narang S. K. , Ortega A. , Lee J. , Wey H. 2010 “Edge-adaptive transforms for efficient depth map coding” in Proc. Picture Coding Symp. (PCS 2010) Article (CrossRef Link) 566 - 569
Kim W.-S. , Narang S. K. , Ortega A. 2012 “Graph based transforms for depth video coding” in Proc. of Int. Conf. Acoustics, Speech and Signal Process. (ICASSP 2012) Article (CrossRef Link) 813 - 816
Xu J. , Zeng B. , Wu F. 2010 “An overview of directional transforms in image coding” in Proc. of IEEE Int. Symp. Circuits and Systems (ISCAS 2010) Article (CrossRef Link) 3036 - 3039
Gu Z. , Lin W. , Lee B.-S. , Lau C. T. 2012 “Rotated orthogonal transform (ROT) for motion-compensation residual coding” IEEE Trans. Image Process. Article (CrossRef Link) 21 (12) 4770 - 4781    DOI : 10.1109/TIP.2012.2206045
Liu L. , Wang A. , Zhu K. , Lin C. , Zhao Y. 2013 “Directional block compressed sensing for image coding” in Proc. of IEEE Int. Conf. Circ. Syst. (ISCAS 2013) Article (CrossRef Link) 1644 - 1647
Kamisli F. , Lim J. S. 2011 “1-D transforms for the motion compensation residual” IEEE Trans. Image Process. Article (CrossRef Link) 20 (4) 1036 - 1046    DOI : 10.1109/TIP.2010.2083675
Oh B. T. , Lee J. , Park D. 2011 “Depth map coding based on synthesized view distortion function” IEEE J. of Sel. Topics in Signal Process. Article (CrossRef Link) 5 (7) 1344 - 1352    DOI : 10.1109/JSTSP.2011.2164893
Middlebury Stereo Datasets
Bjontegaard G. 2001 “Calculation of average PSNR differences between RD-curves” Tech. Rep. VCEG-M33