GPU-Accelerated Single Image Depth Estimation with Color-Filtered Aperture
GPU-Accelerated Single Image Depth Estimation with Color-Filtered Aperture
KSII Transactions on Internet and Information Systems (TIIS). 2014. Mar, 8(3): 1058-1070
Copyright © 2014, Korean Society For Internet Information
  • Received : November 19, 2013
  • Accepted : January 23, 2014
  • Published : March 31, 2014
Export by style
Cited by
About the Authors
Yueh-Teng, Hsu
A-MTK Corp. Taipei 236, Taiwan
Chun-Chieh, Chen
Department of Electronic Engineering, National Taipei University of Technology Taipei 106, Taiwan
Shu-Ming, Tseng
Department of Electronic Engineering, National Taipei University of Technology Taipei 106, Taiwan

There are two major ways to implement depth estimation, multiple image depth estimation and single image depth estimation, respectively. The former has a high hardware cost because it uses multiple cameras but it has a simple software algorithm. Conversely, the latter has a low hardware cost but the software algorithm is complex. One of the recent trends in this field is to make a system compact, or even portable, and to simplify the optical elements to be attached to the conventional camera. In this paper, we present an implementation of depth estimation with a single image using a graphics processing unit (GPU) in a desktop PC, and achieve real-time application via our evolutional algorithm and parallel processing technique, employing a compute shader. The methods greatly accelerate the compute-intensive implementation of depth estimation with a single view image from 0.003 frames per second (fps) (implemented in MATLAB) to 53 fps, which is almost twice the real-time standard of 30 fps. In the previous literature, to the best of our knowledge, no paper discusses the optimization of depth estimation using a single image, and the frame rate of our final result is better than that of previous studies using multiple images, whose frame rate is about 20fps.
1. Introduction
A GPU is installed in most computing devices such as the personal computer (PC) and smart phone, and it can provide almost all the data processing operations required in the field of digital signal processes and digital image processes. Although the modern multi-core central processing unit (CPU) has good performance, the speed of the other tasks originally assigned to it may be decreased while it implements the depth estimation task. As regards three-dimensional (3D) game graphics, however, GPUs are seldom used most of the time. The best advantage of using the GPU is its parallel processing capability, which means that it is able to process many data at the same time in parallel by using its large supply of powerful arithmetic logic units (ALUs). In [1] , the author uses GPUs to accelerate the implementation of parameterized baseband models for two different orthogonal frequency division multiplexing (OFDM) protocols of 802.11a and 802.16 and achieve real-time throughput. In [2] , the author shows a design example of a mobile WiMAX terminal implemented on the GPU platform which is nearly 90 times faster than the conventional DSP-driven modem. In [3] , a novel scheme that accelerates password recovery of PDF files on GPUs using a compute unified device architecture (CUDA) is proposed, and the experimental results show that this scheme has higher speed performance at low cost.
There are several ways to realize depth estimation, generally divided into two categories: multiple image depth estimation [4] [5] [6] and single image depth estimation [7] [8] [9] , the advantage of the former is fast implementation thanks to its simple algorithm but it has the drawback of higher hardware cost of multiple cameras. Accordingly, it is easier to achieve real time by utilizing multiple images to estimate depth density [5] [4] . On the other hand, the latter method has a lower hardware cost with only one camera, but it is quite difficult to achieve real time because of its numerous mathematical calculations. One of the recent trends in this field is to make a system compact, or even portable, and to simplify the optical elements to be attached to the conventional camera [10] [11] and the smart mobile. Consequently, our research focuses on accelerating the depth estimation process with a single image until finally we achieve real time by employing our evolutional algorithm and complete the implementation on a GPU which has powerful computing capability.
Dense depth estimation with either a single view image or multiple images has been explored for decades, and a wealth of literature discusses the speed performance of the latter much more than that of the former. The reason is that it is still a challenge to obtain a good performance by using only a single image to estimate depth because of its complex calculation. The implementation optimized by two-dimension (2-D) generalization of dynamic programming (DP) [12] is precise; however, the optimized scheme is too complex to achieve real-time application. In [13] , the authors use single instruction multiple data (SIMD) like MMX and SSE on a CPU to reconstruct depth maps via multiple images, utilizing parallel processing to achieve real-time performance but with lower precision. In [5] [4] , real-time performance utilizing multiple images based on GPU is achieved. The author in [5] presents an algorithm relying on the conventional sum of square difference (SSD) dissimilarity measure between correlation windows. The author in [4] extends the basic idea of simple plane sweep to generalized search spaces of any geometry and maintains performance.
All the literature discussed in the previous paragraph implements depth estimation using multiple images. To the best of our knowledge, there is still no literature on real-time depth map application using single image. In our approach, we follow the basic idea of depth estimation with a single image presented by [7] . The prefix sum [14] algorithm included in the box filter method [15] is modified to be suited to data-parallel processing in the GPU. With implementation on a compute shader, which is a programmable shader that expands Microsoft Direct3D 11 beyond graphics programming and provides high-speed general purpose computing to take advantage of the large numbers of parallel processors on a GPU, we achieve real-time performance of 53 fps with a single image scenario which is even more than the frame rate (20 fps) achieved by [4] with multiple images.
The remainder of this paper is organized as follows. We present a camera with color-filtered aperture in Section 2. In Section 3, we introduce the depth estimation algorithm. In Section 4, we explain the computational complex of the algorithm and present some methods to reduce it. In Section 5, we take simple examples to illustrate box-filter and prefix sum methods. Depth estimation results are presented in Section 6. Section 7 concludes.
2. Color-filtered Aperture
Our camera lens equipped with color filters is shown in Fig. 1 . The RGB filters displace the captured image of the sensor with respect to the center. According to our placement of RGB filters, if an object is further than the focused depth, the captured image will have a rightward shift in the R plane, a leftward shift in the B plane, and an upward shift in the G plane. If an object is nearer than the focused depth, the captured image will be shifted in the opposite directions. Fig. 2 shows the displacement of the R (top right), G (bottom right) and B (bottom left) planes, and the red lines highlight the shift between each of two RGB components.
PPT Slide
Lager Image
(a) Camera lens equipped with color filters placed in front of the aperture (b) color filter placement
PPT Slide
Lager Image
Example of captured image (top left) and its R component (top right),G component (bottom right) and B component (bottom left). The red lines are superimposed to highlight the displacement of each component.
We use a Canon EOS40D DSLR for implementation, and we cut out a disc with a triple square-shaped hole from a piece of black cardboard (the same fabrication process as in [7] ), and make color filters (Fujifilter SC-58, BPB-53, and BPB-45) stick to it, then attach it in front of a Canon EF 50mm f/1.8 II lens. The fabrication process is simple and takes little time.
3. Depth Estimation Method
The image we have captured is a form of RGB plane consisting of Ir , Ig , and Ib , and because of the RGB color-filter placed in front of the lens, as shown in Fig. 1 , we can observe that a scene pixel further than the focused depth has a rightward shift in the R plane, a leftward shift in the B plane, and an upward shift in the G plane. As a result, if we assume that pixel ( x , y ) has a disparity d , we can find the aligned pixel at Ir ( x + d , y ), Ig ( x , y - d ), and Ib ( x - d , y ).
We determine the window size according to the captured image size and we slide the window through the whole captured image from left to right, from top to bottom. We denote w ( x , y ) as a window around ( x , y ) , and consider a set S I ( x , y ; d ) with hypothesized disparity d as S I ( x , y ; d )={( Ir ( s + d , t ), Ig ( s , t - d ), Ib ( s - d , t ))|( s , t )∈ w ( x , y )}. To rectify the misalignment, we should search for d to minimize the following color alignment metric [7]
PPT Slide
Lager Image
where λ 0 , λ 1 , and λ 2 are the descending positive eigenvalues of the covariance matrix Σ of S I ( x , y ; d ) and
PPT Slide
Lager Image
PPT Slide
Lager Image
are the variance of Ir ( x + d , y ), Ig ( x , y - d ), and Ib ( x - d , y ), respectively, and also the diagonal elements of Σ. According to principal component analysis (PCA) [20] , the product of eigenvalues is equal to the determinant of Σ, and the color alignment metric L in (1) is smaller when λ 0 is much larger than the λ 1 and λ 2 ; then the distribution of the pixel colors within the window in the RGB space will be elongated, which means that the pixels of RGB planes are more correlated after shifting d pixels toward the virtual center.
4. Computational Complexity: Explanation and Reduction
For each window (described in Section 3) within the captured image, we have to calculate all the elements of the covariance matrix of each RGB plane with disparity d denoted by S I ( x , y ; d ) including variance of S I ( x , y ; d ) and covariance of each pair of S I ( x , y ; d ), to obtain the color alignment metric Ld with disparity d within a local window. We search for all the disparity d (-5 to 10 in our implementation) to find the smallest color alignment metric
PPT Slide
Lager Image
PPT Slide
Lager Image
Then we slide the window by one pixel and repeat the above process until we have gone through the whole captured image. It is easy to see that there is a large amount of computation, especially as we have to calculate the variance and covariance a number of times. Assuming R and G are random variables of R plane and G plane within a local window, the variance of R and covariance of R and G can be calculated as follows:
PPT Slide
Lager Image
where E[.] represents the expectation operator. The four items of Eq. (2) need to obtain the sum of the pixels in either R or G plane within a window; therefore, we employ the box filter [15] to improve the speed of calculating the variance and covariance, so the complexity will be reduced. The box filter executes four operations for each output picture pixel and is independent of the box size. Using this method we can calculate the expectations of all pixel values belonging to each window in a captured image at one time. It greatly reduces the cycles of calculation and also significantly accelerates our implementation.
To use the box filter method, all the pixel’s values of each RGB component must first be accumulated, which is a sequential operation. Unfortunately, such an operation is hard to realize in a GPU, which usually processes data in parallel. Therefore, we utilize the prefix sum algorithm, which is a parallel operation for cumulating pixels. The details are discussed in [14] [16] .
5. Explanation of Box-filter and Prefix Sum
In this section, we explain the effect of the box-filter [15] and prefix sum [14] [16] by means of simple examples.
- 5.1 Box-filter
We take the simple case of a 5×5 image as shown in Fig. 3 (a), and the numbers inside the frame are assumed to be the intensity of the pixels. The sliding window size is assumed to be 2×2 . First, we obtain the cumulation of Fig. 3 (a) by a two-step operation which cumulates the value in the horizontal ( Fig. 3 (b)) and vertical directions ( Fig. 3 (c)). If we want to obtain the sum of the values inside the red rectangular window in Fig. 3 (a), it could be calculated by the following operation:
PPT Slide
Lager Image
(a) An example of a5×5 image (b) The cumulation of (a) in the horizontal direction (c) The cumulation of (b) in the vertical direction
  • 3+5+1+8 =17 = 39 -16 -10 + 4
The numbers on the right-hand side are highlighted by a red rectangle in Fig. 3 (c). It is easy to extend this to calculate the sum of each window, and could be done in parellel by a GPU.
- 5.2 Prefix Sum
As described in Section 6.1, we should calculate the cumulation in the horizontal and vertical directions. A prefix sum algorithm is modified such that it can be data-parallel processing in the GPU. In the following, we use a simple example to explain this method. Fig. 4 illustrates an example of the prefix sum algorithm when the sequence length is eigh. The parameter d is one initially, and it grows of a power of two in each stage until it achieves half the sequence length (=4 in Fig. 4 ). The i th value ( i > d ) of the sequence in one stage is the sum of the i th and j th ( j = i - d ) values of the sequence in the previous stage. Fig. 4 shows a simple case of eight values. Following the steps described above, we can see the bottom row is a cumulation of the top row. It is also easy to extend the process from 1-D to 2-D, and we can use the GPU to calculate the prefix sum in parallel, which greatly speeds up this operation
PPT Slide
Lager Image
Illustration of prefix sum
6. Depth Estimation Result
In this section, we implement the depth estimation in the hardware environment listed in Table 1 . For software, we consider three languages: Matlab, C++ and compute shader. We use C++ along with Integrated Performance Primitives (IPP) library which is an extensive library of multicore-ready and highly optimized software functions with single instruction multiple data (SIMD) instructions. The compute shader we use is a shader language of DirectX 11. The languages’ platform and version are listed in Table 2 . The image in our implementation has a pixel resolution of 640×480.
Hardware list
PPT Slide
Lager Image
Hardware list
List of languages, platform and version
PPT Slide
Lager Image
List of languages, platform and version
As shown in Table 3 , the CPU loading of the compute shader is almost zero, because the process is running on a GPU, which means that it does not affect the processes executed in a CPU. The frame rates of Matlab employing a box filter algorithm are almost 40 times greater than if it did not use a box filter algorithm. The performance of C++ using the IPP library is 5.3 times faster than it would be in Matlab. Finally, the performance of the compute shader is about 80 times faster than C++ and 420 times faster than Matlab. The implemented compute shader has a frame rate of 52.63fps, which exceeds the recognized real-time standard of 30fps. The standard is based on a well-known real-time television system, NTSC (National Television System Committee), whose frame rate is 29.97 fps [17] [18] .
Performance comparison of depth estimation among different languages for a 640×480
PPT Slide
Lager Image
Performance comparison of depth estimation among different languages for a 640×480
Fig. 5 (a) shows the captured image taken with a lens attached by color filters and the camera focused on the background which is the farthest object from the image. We can see that the nearest object has the largest displacement. For example, the nearest beverage can is more misaligned than the next one, and that one is more misaligned than the farthest one. The depth map is presented in Fig. 5 (b), where the whiter color shows that the region has a smaller displacement. We can see that the nearer object in Fig. 5 (a) corresponds to a darker region in Fig. 5 (b). The depth estimation which is a kind of local optimization has two major limitations, however. The first is that the texture of the local region should be reasonably obvious, as highlighted by yellow circles in Fig. 5 (b). The second is that objects should not have too high a proportion of single, pure R, G, or B colors, as does the bottle of the nearest beverage highlighted by the red circle in Fig. 5 (b). Local optimization such as normalized cross-correlation (NCC) [19] and PCA [20] cannot solve the problem of misestimation. Global optimization such as graphic cut [21] [7] and dynamic programming [22] may partially solve this problem, but there is still the difficulty that objects must not be of one pure color. We would like to further investigate the problem of misestimation in future research.
PPT Slide
Lager Image
(a) The captured image. The foreground color is misaligned. (b) Estimated depth (the darker, the nearer)
7. Conclusions
We present an implementation of depth estimation from a single image. It is recognized that computational complexity is the main challenge to implementing depth estimation in real time, especially with a single image. In this paper, we propose an evolutional algorithm and modify it to fit the data-parallel processing of a GPU. The speed of the final result with a compute shader was 52fps. Previous research using multiple images obtained 20fps, so our result is about 2.5 times better. It is noteworthy that we use only a single image with one camera and lens but obtain higher frame rates.
We also compare the performance of several program languages with and without additional algorithms, and the result shows that our evolutional algorithm can improve the performance of frame rates about 40-fold. Moreover, the final implementation in a compute shader based on a GPU, which is 80 times faster than one implemented in C++ employing an IPP library and 420 times faster than one implemented in Matlab (all three use our evolutional algorithm), has frame rates up to 52 fps, exceeding the recognized real-time standard of 30 fps. Consequently the implementation of depth estimation with a single image on a GPU with our evolutional algorithm has greatly improved performance and successfully achieved real-time application.
Yueh-Teng Hsu received an M.Sc. degree in electrical engineering from the National Taiwan University in 1997 and a Ph.D. in electrical engineering from Chang Gung University in 2007. From 2004 to 2013, he served as R&D manager at Liteon Co. and Mitac Co. at Taiwan. Presently he serves as a Director of R&D at A-MTK Co., developing mobile applications with GPU-based algorithms. His research interests include parallel computation algorithms, software basebands of DAB/DVB receivers and computational photography.
Chun-Chieh Chen received a B.Sc. degree in electronic engineering from the National Taipei University of Technology in 2012, and studied at the Graduate Institute at the National Taipei University of Technology.
Shu-Ming Tseng received a B.Sc. degree from the National Tsing Hua University, Taiwan, and an M.Sc. and Ph.D. from Purdue University, IN, USA, all in electrical engineering, in 1994, 1995, and 1999, respectively. He worked for the Department of Electrical Engineering, Chang Gung University, Taiwan, from 1999 to 2001. Since 2001, he has worked for the Department of Electronic Engineering, the National Taipei University of Technology, Taiwan, where he is currently a Professor. His research interests are MIMO, OFDM,queueing analysis, software defined radio, cooperative communications, and power-line communications. He has served as an editor for KSII Transactions on Internet and Information Systems, indexed in SCIE, since 2013. Prof. Tseng served as a Technical Program Committee member for symposia of the IEEE VTC Fall 2003, WirelessCom 2005, IWCMC 2006, M-CCN 2007, WiCON 2010, ISITA2010/ISSSTA2010, APWCS 2011, etc. He has been listed in Marquis Who’s Who in the World since 2006. His team implemented a real-time PC-based software DAB receiver in 2006 and the real-time PC-based software DVB-T receiver in 2012.
Li Rongchun , Dou Yong , Zhou Jie , Li Baofeng , Xu Jinbo 2013 “From WiFi to WiMAX: Efficient GPU-based Parameterized Transceiver across Different OFDM Protocols” KSII Transactions on Internet and Information systems Article (CrossRef Link) 7 (8) 1911 - 1932
Kim J. , Hyeon S. 2010 “Implementation of an SDR System Using Graphics Processing Unit” IEEE Communications Magazine Article (CrossRef Link) 48 (3) 156 - 162    DOI : 10.1109/MCOM.2010.5434388
Kim K. , Lee S. , Hong D. , Ryou J. C. 2011 “GPU-Accelerated Password Cracking of PDF Files” KSII Transactions on Internet and Information Systems Article (CrossRef Link) 5 (11) 2235 - 2253
Woetzel J. , Koch R. 2004 “Real-time Multi-stereo Depth Estimation on GPU with Approximative Discontinuity Handling” in Proc. of 1st European Conference on Visual Media Production (CVMP) March Article (CrossRef Link) 245 - 254
Pollefeys R. Yang , M. 2003 “Multiresolution Real-time Stereo on Commodity Graphics Hardware” in Proc. of Conference Society on Computer Vision and Pattern Recognition June vol. 1, Article (CrossRef Link) 211 - 217
Zach C. , Klaus A. , Reitinger B. , Karner K. 2003 “Optimized Stereo Reconstruction Using 3D Graphics Hardware” in Proc. of Workshop of Vision, Modeling and Visualization (VMV) August Article (CrossRef Link) 119 - 126
Bando Y. , Chen B. , Nishita T. 2008 “Extracting Depth and Matte Using a Color-filtered Aperture” ACM Transactions on Graphics (TOG) Article (CrossRef Link) 27 (5) 1 - 9    DOI : 10.1145/1409060.1409087
Liu B. , Gould S. , Koller D. 2010 “Single Image Depth Estimation from Predicted Semantic Labels” in Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) June Article (CrossRef Link) 1253 - 1260
Lai S. H. , Fu C. W. , Chang S. 1992 "A Generalized Depth Estimation Algorithm with a Single Image" IEEE Transactions on Pattern Analysis ans Machine Intelligence (PAMI) Article (CrossRef Link) 14 (4) 405 - 411    DOI : 10.1109/34.126803
Ng R. , Levoy M. , Br´e Dif M. , Duval G. , Horowitz M. , Hanrahan P. 2005 “Light Field Photography with Hand-held Plenoptic Camera,” Stanford Computer Science Tech. Rep. CSTR 2005-02 Article (CrossRef Link)
Veeraraghavan A. , Raskar R. , Agrawal A. , Mohan A. , Tumblin J. 2007 “Dappled photography: mask enhanced cameras for heterodyned light fields and coded aperture refocusing” ACM Transactions on Graphics (TOG) Article (CrossRef Link) 26 (3) 1 - 12    DOI : 10.1145/1276377.1276463
Kolmogorov V. , Zabih R. 2008 “Multi-camera Scene Reconstruction via Graph Cuts” in Proc. of Seventh European Conf. Computer Vision May vol 3, Article (CrossRef Link) 82 - 96
Hirschmueller H. 2001 “Improvements in Realtime Correlation-based Stereo Vision” in Proc. of IEEE Workshop on Stereo and Multi- Baseline Vision December Article (CrossRef Link) 141 - 148
Sengupta S. , Lefohn A. E. , Owens J. D. 2006 “A Work-efficient Step-efficient Prefix Sum Algorithm” in Proc. of the Workshop on Edge Computing Using New Commodity Architecture May Article (CrossRef Link) 26 - 27
McDonnell M. J. 1981 “Box-filtering Techniques” Computer Graphics and Image Processing Article (CrossRef Link) 17 65 - 70    DOI : 10.1016/S0146-664X(81)80009-3
Hillis D. W. , Steele G.L. 1986 “Data Parallel Algorithms” Communications of the ACM Article (CrossRef Link) 29 (12) 1170 - 1183    DOI : 10.1145/7902.7903
Williams Richard 1999 “All in Good Timecode” Adobe Magazine Article (CrossRef Link) 57 - 59
Li Ze-Nian , Drew Mark S. 2004 “Fundamentals of Multimedia” China Machine Press Article (CrossRef Link) 104 -
Lewis J. P. 1995 “Fast Template Matching” in Proc of Vision Interface May Article (CrossRef Link) 120 - 123
Jolliffe I.T. 1986 Principal Component Analysis Springer-Verlag New York Article (CrossRef Link)
Kim J. , Kolmogorov V. , Zabih R. 2003 “Visual Correspondence Using Energy Minimization and Mutual Information” in Proc. of International Conference on Computer Vision (ICCV) vol.2, Article (CrossRef Link) 1033 - 1040
Wang L. , Miao L. , Minglun G. , Yang R. , Nister D 2006 “High-quality Real-time Stereo Using Adaptive Cost Aggregation and Dynamic Programming” in Proc. of Third International Symposium on 3D Data Processing, Visualization and Transmission Article (CrossRef Link) 798 - 805