There are two major ways to implement depth estimation, multiple image depth estimation and single image depth estimation, respectively. The former has a high hardware cost because it uses multiple cameras but it has a simple software algorithm. Conversely, the latter has a low hardware cost but the software algorithm is complex. One of the recent trends in this field is to make a system compact, or even portable, and to simplify the optical elements to be attached to the conventional camera. In this paper, we present an implementation of depth estimation with a single image using a graphics processing unit (GPU) in a desktop PC, and achieve realtime application via our evolutional algorithm and parallel processing technique, employing a compute shader. The methods greatly accelerate the computeintensive implementation of depth estimation with a single view image from 0.003 frames per second (fps) (implemented in MATLAB) to 53 fps, which is almost twice the realtime standard of 30 fps. In the previous literature, to the best of our knowledge, no paper discusses the optimization of depth estimation using a single image, and the frame rate of our final result is better than that of previous studies using multiple images, whose frame rate is about 20fps.
1. Introduction
A
GPU is installed in most computing devices such as the personal computer (PC) and smart phone, and it can provide almost all the data processing operations required in the field of digital signal processes and digital image processes. Although the modern multicore central processing unit (CPU) has good performance, the speed of the other tasks originally assigned to it may be decreased while it implements the depth estimation task. As regards threedimensional (3D) game graphics, however, GPUs are seldom used most of the time. The best advantage of using the GPU is its parallel processing capability, which means that it is able to process many data at the same time in parallel by using its large supply of powerful arithmetic logic units (ALUs). In
[1]
, the author uses GPUs to accelerate the implementation of parameterized baseband models for two different orthogonal frequency division multiplexing (OFDM) protocols of 802.11a and 802.16 and achieve realtime throughput. In
[2]
, the author shows a design example of a mobile WiMAX terminal implemented on the GPU platform which is nearly 90 times faster than the conventional DSPdriven modem. In
[3]
, a novel scheme that accelerates password recovery of PDF files on GPUs using a compute unified device architecture (CUDA) is proposed, and the experimental results show that this scheme has higher speed performance at low cost.
There are several ways to realize depth estimation, generally divided into two categories: multiple image depth estimation
[4]
[5]
[6]
and single image depth estimation
[7]
[8]
[9]
, the advantage of the former is fast implementation thanks to its simple algorithm but it has the drawback of higher hardware cost of multiple cameras. Accordingly, it is easier to achieve real time by utilizing multiple images to estimate depth density
[5]
[4]
. On the other hand, the latter method has a lower hardware cost with only one camera, but it is quite difficult to achieve real time because of its numerous mathematical calculations. One of the recent trends in this field is to make a system compact, or even portable, and to simplify the optical elements to be attached to the conventional camera
[10]
[11]
and the smart mobile. Consequently, our research focuses on accelerating the depth estimation process with a single image until finally we achieve real time by employing our evolutional algorithm and complete the implementation on a GPU which has powerful computing capability.
Dense depth estimation with either a single view image or multiple images has been explored for decades, and a wealth of literature discusses the speed performance of the latter much more than that of the former. The reason is that it is still a challenge to obtain a good performance by using only a single image to estimate depth because of its complex calculation. The implementation optimized by twodimension (2D) generalization of dynamic programming (DP)
[12]
is precise; however, the optimized scheme is too complex to achieve realtime application. In
[13]
, the authors use single instruction multiple data (SIMD) like MMX and SSE on a CPU to reconstruct depth maps via multiple images, utilizing parallel processing to achieve realtime performance but with lower precision. In
[5]
[4]
, realtime performance utilizing multiple images based on GPU is achieved. The author in
[5]
presents an algorithm relying on the conventional sum of square difference (SSD) dissimilarity measure between correlation windows. The author in
[4]
extends the basic idea of simple plane sweep to generalized search spaces of any geometry and maintains performance.
All the literature discussed in the previous paragraph implements depth estimation using multiple images. To the best of our knowledge, there is still no literature on realtime depth map application using single image. In our approach, we follow the basic idea of depth estimation with a single image presented by
[7]
. The prefix sum
[14]
algorithm included in the box filter method
[15]
is modified to be suited to dataparallel processing in the GPU. With implementation on a compute shader, which is a programmable shader that expands Microsoft Direct3D 11 beyond graphics programming and provides highspeed general purpose computing to take advantage of the large numbers of parallel processors on a GPU, we achieve realtime performance of 53 fps with a single image scenario which is even more than the frame rate (20 fps) achieved by
[4]
with multiple images.
The remainder of this paper is organized as follows. We present a camera with colorfiltered aperture in Section 2. In Section 3, we introduce the depth estimation algorithm. In Section 4, we explain the computational complex of the algorithm and present some methods to reduce it. In Section 5, we take simple examples to illustrate boxfilter and prefix sum methods. Depth estimation results are presented in Section 6. Section 7 concludes.
2. Colorfiltered Aperture
Our camera lens equipped with color filters is shown in
Fig. 1
. The RGB filters displace the captured image of the sensor with respect to the center. According to our placement of RGB filters, if an object is further than the focused depth, the captured image will have a rightward shift in the R plane, a leftward shift in the B plane, and an upward shift in the G plane. If an object is nearer than the focused depth, the captured image will be shifted in the opposite directions.
Fig. 2
shows the displacement of the R (top right), G (bottom right) and B (bottom left) planes, and the red lines highlight the shift between each of two RGB components.
(a) Camera lens equipped with color filters placed in front of the aperture (b) color filter placement
Example of captured image (top left) and its R component (top right),G component (bottom right) and B component (bottom left). The red lines are superimposed to highlight the displacement of each component.
We use a Canon EOS40D DSLR for implementation, and we cut out a disc with a triple squareshaped hole from a piece of black cardboard (the same fabrication process as in
[7]
), and make color filters (Fujifilter SC58, BPB53, and BPB45) stick to it, then attach it in front of a Canon EF 50mm f/1.8 II lens. The fabrication process is simple and takes little time.
3. Depth Estimation Method
The image we have captured is a form of RGB plane consisting of
I_{r}
,
I_{g}
, and
I_{b}
, and because of the RGB colorfilter placed in front of the lens, as shown in
Fig. 1
, we can observe that a scene pixel further than the focused depth has a rightward shift in the R plane, a leftward shift in the B plane, and an upward shift in the G plane. As a result, if we assume that pixel (
x
,
y
) has a disparity
d
, we can find the aligned pixel at
I_{r}
(
x
+
d
,
y
),
I_{g}
(
x
,
y

d
), and
I_{b}
(
x

d
,
y
).
We determine the window size according to the captured image size and we slide the window through the whole captured image from left to right, from top to bottom. We denote
w
(
x
,
y
) as a window around (
x
,
y
) , and consider a set
S
_{I}
(
x
,
y
;
d
) with hypothesized disparity
d
as
S
_{I}
(
x
,
y
;
d
)={(
I_{r}
(
s
+
d
,
t
),
I_{g}
(
s
,
t

d
),
I_{b}
(
s

d
,
t
))(
s
,
t
)∈
w
(
x
,
y
)}. To rectify the misalignment, we should search for
d
to minimize the following color alignment metric
[7]
where
λ
_{0}
,
λ
_{1}
, and
λ
_{2}
are the descending positive eigenvalues of the covariance matrix Σ of
S
_{I}
(
x
,
y
;
d
) and
and
are the variance of
I_{r}
(
x
+
d
,
y
),
I_{g}
(
x
,
y

d
), and
I_{b}
(
x

d
,
y
), respectively, and also the diagonal elements of Σ. According to principal component analysis (PCA)
[20]
, the product of eigenvalues is equal to the determinant of Σ, and the color alignment metric
L
in (1) is smaller when
λ
_{0}
is much larger than the
λ
_{1}
and
λ
_{2}
; then the distribution of the pixel colors within the window in the RGB space will be elongated, which means that the pixels of RGB planes are more correlated after shifting
d
pixels toward the virtual center.
4. Computational Complexity: Explanation and Reduction
For each window (described in Section 3) within the captured image, we have to calculate all the elements of the covariance matrix of each RGB plane with disparity
d
denoted by
S
_{I}
(
x
,
y
;
d
) including variance of
S
_{I}
(
x
,
y
;
d
) and covariance of each pair of
S
_{I}
(
x
,
y
;
d
), to obtain the color alignment metric
L_{d}
with disparity
d
within a local window. We search for all the disparity
d
(5 to 10 in our implementation) to find the smallest color alignment metric
Then we slide the window by one pixel and repeat the above process until we have gone through the whole captured image. It is easy to see that there is a large amount of computation, especially as we have to calculate the variance and covariance a number of times. Assuming R and G are random variables of R plane and G plane within a local window, the variance of R and covariance of R and G can be calculated as follows:
where
E[.]
represents the expectation operator. The four items of Eq. (2) need to obtain the sum of the pixels in either R or G plane within a window; therefore, we employ the box filter
[15]
to improve the speed of calculating the variance and covariance, so the complexity will be reduced. The box filter executes four operations for each output picture pixel and is independent of the box size. Using this method we can calculate the expectations of all pixel values belonging to each window in a captured image at one time. It greatly reduces the cycles of calculation and also significantly accelerates our implementation.
To use the box filter method, all the pixel’s values of each RGB component must first be accumulated, which is a sequential operation. Unfortunately, such an operation is hard to realize in a GPU, which usually processes data in parallel. Therefore, we utilize the prefix sum algorithm, which is a parallel operation for cumulating pixels. The details are discussed in
[14]
[16]
.
5. Explanation of Boxfilter and Prefix Sum
In this section, we explain the effect of the boxfilter
[15]
and prefix sum
[14]
[16]
by means of simple examples.
 5.1 Boxfilter
We take the simple case of a 5×5 image as shown in
Fig. 3
(a), and the numbers inside the frame are assumed to be the intensity of the pixels. The sliding window size is assumed to be 2×2 . First, we obtain the cumulation of
Fig. 3
(a) by a twostep operation which cumulates the value in the horizontal (
Fig. 3
(b)) and vertical directions (
Fig. 3
(c)). If we want to obtain the sum of the values inside the red rectangular window in
Fig. 3
(a), it could be calculated by the following operation:
(a) An example of a5×5 image (b) The cumulation of (a) in the horizontal direction (c) The cumulation of (b) in the vertical direction

3+5+1+8 =17 = 39 16 10 + 4
The numbers on the righthand side are highlighted by a red rectangle in
Fig. 3
(c). It is easy to extend this to calculate the sum of each window, and could be done in parellel by a GPU.
 5.2 Prefix Sum
As described in Section 6.1, we should calculate the cumulation in the horizontal and vertical directions. A prefix sum algorithm is modified such that it can be dataparallel processing in the GPU. In the following, we use a simple example to explain this method.
Fig. 4
illustrates an example of the prefix sum algorithm when the sequence length is eigh. The parameter
d
is one initially, and it grows of a power of two in each stage until it achieves half the sequence length (=4 in
Fig. 4
). The
i
th value (
i
>
d
) of the sequence in one stage is the sum of the
i
th and
j
th (
j
=
i

d
) values of the sequence in the previous stage.
Fig. 4
shows a simple case of eight values. Following the steps described above, we can see the bottom row is a cumulation of the top row. It is also easy to extend the process from 1D to 2D, and we can use the GPU to calculate the prefix sum in parallel, which greatly speeds up this operation
Illustration of prefix sum
6. Depth Estimation Result
In this section, we implement the depth estimation in the hardware environment listed in
Table 1
. For software, we consider three languages: Matlab, C++ and compute shader. We use C++ along with Integrated Performance Primitives (IPP) library which is an extensive library of multicoreready and highly optimized software functions with single instruction multiple data (SIMD) instructions. The compute shader we use is a shader language of DirectX 11. The languages’ platform and version are listed in
Table 2
. The image in our implementation has a pixel resolution of 640×480.
Hardware list
List of languages, platform and version
List of languages, platform and version
As shown in
Table 3
, the CPU loading of the compute shader is almost zero, because the process is running on a GPU, which means that it does not affect the processes executed in a CPU. The frame rates of Matlab employing a box filter algorithm are almost 40 times greater than if it did not use a box filter algorithm. The performance of C++ using the IPP library is 5.3 times faster than it would be in Matlab. Finally, the performance of the compute shader is about 80 times faster than C++ and 420 times faster than Matlab. The implemented compute shader has a frame rate of 52.63fps, which exceeds the recognized realtime standard of 30fps. The standard is based on a wellknown realtime television system, NTSC (National Television System Committee), whose frame rate is 29.97 fps
[17]
[18]
.
Performance comparison of depth estimation among different languages for a 640×480
Performance comparison of depth estimation among different languages for a 640×480
Fig. 5
(a) shows the captured image taken with a lens attached by color filters and the camera focused on the background which is the farthest object from the image. We can see that the nearest object has the largest displacement. For example, the nearest beverage can is more misaligned than the next one, and that one is more misaligned than the farthest one. The depth map is presented in
Fig. 5
(b), where the whiter color shows that the region has a smaller displacement. We can see that the nearer object in
Fig. 5
(a) corresponds to a darker region in
Fig. 5
(b). The depth estimation which is a kind of local optimization has two major limitations, however. The first is that the texture of the local region should be reasonably obvious, as highlighted by yellow circles in
Fig. 5
(b). The second is that objects should not have too high a proportion of single, pure R, G, or B colors, as does the bottle of the nearest beverage highlighted by the red circle in
Fig. 5
(b). Local optimization such as normalized crosscorrelation (NCC)
[19]
and PCA
[20]
cannot solve the problem of misestimation. Global optimization such as graphic cut
[21]
[7]
and dynamic programming
[22]
may partially solve this problem, but there is still the difficulty that objects must not be of one pure color. We would like to further investigate the problem of misestimation in future research.
(a) The captured image. The foreground color is misaligned. (b) Estimated depth (the darker, the nearer)
7. Conclusions
We present an implementation of depth estimation from a single image. It is recognized that computational complexity is the main challenge to implementing depth estimation in real time, especially with a single image. In this paper, we propose an evolutional algorithm and modify it to fit the dataparallel processing of a GPU. The speed of the final result with a compute shader was 52fps. Previous research using multiple images obtained 20fps, so our result is about 2.5 times better. It is noteworthy that we use only a single image with one camera and lens but obtain higher frame rates.
We also compare the performance of several program languages with and without additional algorithms, and the result shows that our evolutional algorithm can improve the performance of frame rates about 40fold. Moreover, the final implementation in a compute shader based on a GPU, which is 80 times faster than one implemented in C++ employing an IPP library and 420 times faster than one implemented in Matlab (all three use our evolutional algorithm), has frame rates up to 52 fps, exceeding the recognized realtime standard of 30 fps. Consequently the implementation of depth estimation with a single image on a GPU with our evolutional algorithm has greatly improved performance and successfully achieved realtime application.
BIO
YuehTeng Hsu received an M.Sc. degree in electrical engineering from the National Taiwan University in 1997 and a Ph.D. in electrical engineering from Chang Gung University in 2007. From 2004 to 2013, he served as R&D manager at Liteon Co. and Mitac Co. at Taiwan. Presently he serves as a Director of R&D at AMTK Co., developing mobile applications with GPUbased algorithms. His research interests include parallel computation algorithms, software basebands of DAB/DVB receivers and computational photography.
ChunChieh Chen received a B.Sc. degree in electronic engineering from the National Taipei University of Technology in 2012, and studied at the Graduate Institute at the National Taipei University of Technology.
ShuMing Tseng received a B.Sc. degree from the National Tsing Hua University, Taiwan, and an M.Sc. and Ph.D. from Purdue University, IN, USA, all in electrical engineering, in 1994, 1995, and 1999, respectively. He worked for the Department of Electrical Engineering, Chang Gung University, Taiwan, from 1999 to 2001. Since 2001, he has worked for the Department of Electronic Engineering, the National Taipei University of Technology, Taiwan, where he is currently a Professor. His research interests are MIMO, OFDM,queueing analysis, software defined radio, cooperative communications, and powerline communications. He has served as an editor for KSII Transactions on Internet and Information Systems, indexed in SCIE, since 2013. Prof. Tseng served as a Technical Program Committee member for symposia of the IEEE VTC Fall 2003, WirelessCom 2005, IWCMC 2006, MCCN 2007, WiCON 2010, ISITA2010/ISSSTA2010, APWCS 2011, etc. He has been listed in Marquis Who’s Who in the World since 2006. His team implemented a realtime PCbased software DAB receiver in 2006 and the realtime PCbased software DVBT receiver in 2012.
Li Rongchun
,
Dou Yong
,
Zhou Jie
,
Li Baofeng
,
Xu Jinbo
2013
“From WiFi to WiMAX: Efficient GPUbased Parameterized Transceiver across Different OFDM Protocols”
KSII Transactions on Internet and Information systems
Article (CrossRef Link)
7
(8)
1911 
1932
Kim J.
,
Hyeon S.
2010
“Implementation of an SDR System Using Graphics Processing Unit”
IEEE Communications Magazine
Article (CrossRef Link)
48
(3)
156 
162
DOI : 10.1109/MCOM.2010.5434388
Kim K.
,
Lee S.
,
Hong D.
,
Ryou J. C.
2011
“GPUAccelerated Password Cracking of PDF Files”
KSII Transactions on Internet and Information Systems
Article (CrossRef Link)
5
(11)
2235 
2253
Woetzel J.
,
Koch R.
2004
“Realtime Multistereo Depth Estimation on GPU with Approximative Discontinuity Handling”
in Proc. of 1st European Conference on Visual Media Production (CVMP)
March
Article (CrossRef Link)
245 
254
Pollefeys R. Yang
,
M.
2003
“Multiresolution Realtime Stereo on Commodity Graphics Hardware”
in Proc. of Conference Society on Computer Vision and Pattern Recognition
June
vol. 1, Article (CrossRef Link)
211 
217
Zach C.
,
Klaus A.
,
Reitinger B.
,
Karner K.
2003
“Optimized Stereo Reconstruction Using 3D Graphics Hardware”
in Proc. of Workshop of Vision, Modeling and Visualization (VMV)
August
Article (CrossRef Link)
119 
126
Bando Y.
,
Chen B.
,
Nishita T.
2008
“Extracting Depth and Matte Using a Colorfiltered Aperture”
ACM Transactions on Graphics (TOG)
Article (CrossRef Link)
27
(5)
1 
9
DOI : 10.1145/1409060.1409087
Liu B.
,
Gould S.
,
Koller D.
2010
“Single Image Depth Estimation from Predicted Semantic Labels”
in Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
June
Article (CrossRef Link)
1253 
1260
Lai S. H.
,
Fu C. W.
,
Chang S.
1992
"A Generalized Depth Estimation Algorithm with a Single Image"
IEEE Transactions on Pattern Analysis ans Machine Intelligence (PAMI)
Article (CrossRef Link)
14
(4)
405 
411
DOI : 10.1109/34.126803
Ng R.
,
Levoy M.
,
Br´e Dif M.
,
Duval G.
,
Horowitz M.
,
Hanrahan P.
2005
“Light Field Photography with Handheld Plenoptic Camera,”
Stanford Computer Science
Tech. Rep. CSTR 200502
Article (CrossRef Link)
Veeraraghavan A.
,
Raskar R.
,
Agrawal A.
,
Mohan A.
,
Tumblin J.
2007
“Dappled photography: mask enhanced cameras for heterodyned light fields and coded aperture refocusing”
ACM Transactions on Graphics (TOG)
Article (CrossRef Link)
26
(3)
1 
12
DOI : 10.1145/1276377.1276463
Kolmogorov V.
,
Zabih R.
2008
“Multicamera Scene Reconstruction via Graph Cuts”
in Proc. of Seventh European Conf. Computer Vision
May
vol 3, Article (CrossRef Link)
82 
96
Hirschmueller H.
2001
“Improvements in Realtime Correlationbased Stereo Vision”
in Proc. of IEEE Workshop on Stereo and Multi Baseline Vision
December
Article (CrossRef Link)
141 
148
Sengupta S.
,
Lefohn A. E.
,
Owens J. D.
2006
“A Workefficient Stepefficient Prefix Sum Algorithm”
in Proc. of the Workshop on Edge Computing Using New Commodity Architecture
May
Article (CrossRef Link)
26 
27
Hillis D. W.
,
Steele G.L.
1986
“Data Parallel Algorithms”
Communications of the ACM
Article (CrossRef Link)
29
(12)
1170 
1183
DOI : 10.1145/7902.7903
Williams Richard
1999
“All in Good Timecode”
Adobe Magazine
Article (CrossRef Link)
57 
59
Li ZeNian
,
Drew Mark S.
2004
“Fundamentals of Multimedia”
China Machine Press
Article (CrossRef Link)
104 
Lewis J. P.
1995
“Fast Template Matching”
in Proc of Vision Interface
May
Article (CrossRef Link)
120 
123
Jolliffe I.T.
1986
Principal Component Analysis
SpringerVerlag
New York
Article (CrossRef Link)
Kim J.
,
Kolmogorov V.
,
Zabih R.
2003
“Visual Correspondence Using Energy Minimization and Mutual Information”
in Proc. of International Conference on Computer Vision (ICCV)
vol.2, Article (CrossRef Link)
1033 
1040
Wang L.
,
Miao L.
,
Minglun G.
,
Yang R.
,
Nister D
2006
“Highquality Realtime Stereo Using Adaptive Cost Aggregation and Dynamic Programming”
in Proc. of Third International Symposium on 3D Data Processing, Visualization and Transmission
Article (CrossRef Link)
798 
805