Real-Time 2D-to-3D Conversion for 3DTV using Time-Coherent Depth-Map Generation Method
Real-Time 2D-to-3D Conversion for 3DTV using Time-Coherent Depth-Map Generation Method
International Journal of Contents. 2014. Sep, 10(3): 9-16
Copyright © 2014, The Korea Contents Association
  • Received : April 14, 2014
  • Accepted : July 14, 2014
  • Published : September 28, 2014
Export by style
Cited by
About the Authors
Seung-Woo, Nam
Hye-Sun, Kim
Yun-Ji, Ban
Sung-Il, Chien

Depth-image-based rendering is generally used in real-time 2D-to-3D conversion for 3DTV. However, inaccurate depth maps cause flickering issues between image frames in a video sequence, resulting in eye fatigue while viewing 3DTV. To resolve this flickering issue, we propose a new 2D-to-3D conversion scheme based on fast and robust depth-map generation from a 2D video sequence. The proposed depth-map generation algorithm divides an input video sequence into several cuts using a color histogram. The initial depth of each cut is assigned based on a hypothesized depth-gradient model. The initial depth map of the current frame is refined using color and motion information. Thereafter, the depth map of the next frame is updated using the difference image to reduce depth flickering. The experimental results confirm that the proposed scheme performs real-time 2D-to-3D conversions effectively and reduces human eye fatigue.
The reliable and real-time conversion of 2D videos to 3D videos is one of the most important 3D image generation methods in the 3D industry. This type of conversion allows many existing 2D videos to be converted to 3D. Several methods can be used to produce 3D video contents. One example method is direct 3D shooting by using a stereo video camera [1] . Considering that the human brain synthesizes two images from the left and right eyes to perceive depth information for objects, 3D videos that are obtained directly from a stereo camera can be used to cause 3D perception. However, 3D content generation from stereo cameras also suffers from synchronization problems in time, color, distortion, vertical and horizontal locations, and convergence between two cameras. Furthermore, 3D content generation from stereo cameras requires time-consuming manual work to match the synchronization problems between two cameras. Another method involves the use of one or more camera and a depth sensor by using depth-image-based rendering (DIBR), which uses depth maps obtained from the depth sensor [2] , [3] . Synchronization and calibration are also required by using this method. However, a RGB camera with the depth sensor is rarely used commercially for 2D-to-3D conversion. Therefore, a fast and cost effective 2D-to-3D conversion technique is needed to revitalize the 3D display industry. A typical real-time 2D-to-3D conversion scheme is shown in Fig. 1 . A depth-map sequence is generated from a 2D input video sequence. Thereafter, a 3D stereoscopic video can be generated by using DIBR [3] . Many depth-map generation methods have been developed by using various depth cues such as color, motion, relative object size, defocus of textured objects, and geometric perspective in the scene.
PPT Slide
Lager Image
Typical real-time 2D-to-3D conversion system
Machine-learning approaches have also been proposed for depth-map generation from 2D images. Harman et al. [4] proposed a fast machine-learning algorithm by using the RGB color values of training input data. However, generating the training depth values is difficult in the specific key-frame image. Furthermore, the algorithm has to be trained again for a new 2D video. Saxena et al. [5] generated a depth image by learning the 3D scene structure for a single image frame. Image edges in perspective view or vanishing lines can provide the depth cue of the scene. Cheng et al. [6] used edge information to segment foreground and background objects and performed region growing and depth assignment based on hypothesized depth gradient model (HDGM). Yu et al. [7] detected main lines and vanishing points to assign the depth map of static backgrounds by using edge information. In this method, the use of edge information is difficult with a fast-moving camera or object because the boundary of the objects is blurred.
Motion information is another important depth cue in the scene. Kim et al. [8] proposed a motion-estimation process by using color segmentation and the Kanade–Lucas–Tomasi feature tracker. Cheng et al. [9] combined the depth from both motion parallax and geometrical perspective by using blockbased motion estimation and edge information, respectively; this method is useful for scenes with moving objects in a static background. Motion parallax is a useful depth but cannot be applied for motion estimation to generate depth if moving objects or camera motion is non-existent in the scene. Lai et al. [10] and Guo et al. [11] proposed a depth-estimation algorithm from the amount of defocus in an image edge. This method cannot be used when an image is captured by using a wide-lens. Therefore, various approaches have been proposed to synthesize two or more depth cues to estimate a high-quality depth map. Multi-cue fusion methods have been proposed to generate depth for 2D-to-3D conversion [12] - [15] . However, these methods may produce faulty or reversed depth when the same objects have different color information. Real-time onthe-fly conversion is required for electronic 3D consumer devices. Real-time conversion algorithms have been proposed by using multi-threaded CPUs or GPUs for parallel processing [12] , [16] . The motion vectors in the MPEG 4 standard are also used to generate real-time depth maps [17] . Moreover, depth flickering should be reduced to mitigate eye fatigue during extended watching sessions. Many researchers have developed automatic or real-time 2D-to-3D conversion methods. However, the lack of a non-fatigue 3D content generation approach is still a dilemma for the 3D display industry [16] .
We present a real-time 2D-to-3D conversion scheme with reduced depth flickering via a temporal image sequence. The overall block diagram of the proposed scheme is shown in Fig. 2 . The proposed method involves three steps: (1) object segmentation and region growing; (2) depth generation and updating; (3) DIBR. Object segmentation and region growingare conducted first in the scheme by using color information. Thereafter, the regions are refined by using an accumulated difference map. An initial depth map is selected among HDGMs, and the depth of each segmented region is assigned by referencing to the initial map. The previous depth map is updated to generate a new depth map in the next frame. Therefore, by using DIBR, we can produce eye-comforting 3D videos from 2D videos as evidenced by the experimental results.
PPT Slide
Lager Image
Overall block diagram of the proposed 2D-to-3D conversion system
The remainder of this paper is organized as follows. Section II describes the proposed 2D-to-3D conversion system, which improves depth consistency by using color information and difference map. Section III describes the experimental results and discussions. Section IV concludes.
This section covers the details of the proposed system. The depth generation method for 2D to 3D conversion is based on color information and the {n-to-(n − 1)} difference map [16] . The input 2D video is divided into cuts by using a color histogram [19] . The first frame image of each cut is segmented by using color information, and the depth of the segment is calculated from an HGDM. Both the segment and depth of the next frame image is updated by using a difference map. After depth maps are generated from the input 2D sequence image, each depth map is blurred by a bilateral filter [9] , [18] to produce stereoscopic images by DIBR [3] .
- 2.1 Cutting of Scene Boundary by using Color Histogram and Assigning of the Initial Depth Gradient Model
The possibility of having the same initial depth for different consecutive shots (cuts) is very low. Each cut of the input video is usually taken in a different place or view. If the shot of the video is changed, a new global depth map is assigned as the initial depth among HDGMs. Therefore, cutting an input video into separate shots is necessary to assign the initial depth map of the divided shot. The boundary detection algorithm is based on the difference of the color histogram between frames as a measure of discontinuity. The histogram difference is computed as follows:
PPT Slide
Lager Image
where hi is the color histogram with N bins of frame i corresponding to the input video sequence. We eliminate the 4 least significant bits of every RGB component. By using this quantization method, all possible colors are grouped into 212 different color levels in the RGB space [19] to reduce memory requirements and processing times. If hdiff is higher than a threshold value, the video is divided as a new cut. The threshold value is defined heuristically by experimental analysis as 60 percent of the color histogram changes. Fig. 3 shows 7 divided cuts.
PPT Slide
Lager Image
Detected-cuts based on the difference of the color histogram. The “Wild Life” short video is divided into seven cuts according to the threshold value.
An automatic algorithm is proposed to select an initial depth map of the cut by using motion parallax information. In our algorithm, the first frame image of the cut is divided into 3 × 3 blocks. An average motion vector of each block is estimated by using a sampling of 16 × 9 pixels on the block, and the magnitude of the average vector is saved in the buffer corresponding to each block. The magnitudes of the vector in the 3 × 3 sized buffer are used for training. The ground truth values of each cut are used as training data by using hundreds of first frame images. Thereafter, the initial depth of each cut is selected by using the trained classifier among six HGDMs. If the camera and objects of the cut do not move, the default map or the bottom-front model is assigned ( Fig. 2 ). The bottomfront model is the model wherein the nearest part of the image is at the bottom and the depth changes smoothly from white at the bottom to black at the top. The analysis results indicate that the bottom-front model is assigned most frequently in real situations [6] .
- 2.2 Object Segmenting and Depth Generation from the Initial Depth Map
The separation of the object and background in the image is important to assign the depth value. Segmentation is also important to group the object or the background. The input image is divided into segments by using only the color information, thus increasing the possibility of wrong separation. The same object is separated into several segments. Thereafter, the background depth value is assigned to a segment when the object and depth reverse is recognized. However, the segments are the same objects located in the front relative to the background. Therefore, we propose a new method by using motion history and color information to separate the object and background accurately. A seed region growing (SRG) algorithm is used first for segmentation in our system [20] . Seed points on the first frame image are selected randomly, and eight neighboring pixels of the seed points are checked to grow the region. The segmented images are shown in Fig. 4 (c) and 4 (d) by using SRG for the original images shown in Fig. 4 (a) and 4 (b). A depth map of the cut at the first frame is generated by assigning the depth value at the corresponding position to the center of each segmented region on the initial depth map. The generated depth maps from an initial map (DMFIM) are shown in Fig. 4 (e) and 4 (f) by using the bottom-front model and right-front model as initial maps. The depth value of each segment is assigned as follows:
PPT Slide
Lager Image
PPT Slide
Lager Image
PPT Slide
Lager Image
where D(x, y) is a depth value of segment Sn on position (x, y) , the center position of Sn is
PPT Slide
Lager Image
N is the number of pixels in Sn , and Dinit is the initial depth map.
PPT Slide
Lager Image
Depth image generation by using initial map from 2D input image. (a) “See Sun,” (b) “Life Force,” (c) segmented image from “See Sun,” (d) segmented image from “Life Force,” (e) depth map by using an HDGM of the bottom-front model, and (f) generated depth map by using an HDGM of the right-front model.
Considering color segmentation, one object can be divided into several small regions. Therefore, the depth of the segmented regions of the same objects can be assigned to different depth values. To remove this problem, we use motion information by using the accumulated difference map (ADM). The difference map is the color difference between previous and current frame images. The difference image di is calculated as follows in the RGB space:
PPT Slide
Lager Image
where i is the frame number. The difference images are accumulated to record the history of the moving objects. The ADM can be represented as follows:
PPT Slide
Lager Image
An ADM is converted to a binary map consisting of motion and non-motion regions by using a suitable threshold. The white and black regions in the binary ADM are the motion and non-motion regions ( Fig. 5 (c)). Even though the same object is divided into several regions because of the color difference, these separated regions in the motion region are merged into a single region; thus, these regions have the same depth values. The face of an actor is segmented into several parts of S1, S2, S3, and so on ( Fig. 5 (a)) by using color information; however, the face should be segmented as a single region. The depth of the blue-lined zone is also different from the depth of the red-lined zone. This type of error causes dizziness in humans. The final depth is the RDM ( Fig. 5 (d)), which reduces the depth-reverse problem. This modification is used only when motion is observed in an image sequence.
PPT Slide
Lager Image
The generated depth map is refined by using ADM. (a) Depth-map generation by using an HDGM; (b) ADM; (c) binary map of the ADM that is accumulated in the difference map for 40 frames; (d) refined depth map (RDM).
- 2.3 Depth Updating by using Difference Maps
A depth flickering problem occurs when an SRG algorithm is used for segmentation and DMFIM is used to calculate a depth map for the next consecutive images. Thus, maintaining the consistency between the depths of the corresponding input video sequence during 3DTV watching is important. We update the depth values of parts that exhibit color changes between consecutive input images. We detect the color-changed region (CCR) by using the color difference between a previous image frame and current image frame (Eq. (5)). Thereafter, we convert CCR or a red region to a binary map or difference map ( Fig. 6 (b)). The CCR is usually detected when a video sequence includes big color changes from fast motions, shadow changes, and lighting changes. We update the depth map in the second frame image by using DMFIM for the CCR only, and assign the depth of the previous frame for the other region except the CCR. We propose a time-coherent depth-updating method (TCDUM) to improve the depth coherence in time while reducing depth flickering. Fig. 6 shows the update process of the segmented region and generated depth by using the difference map. The red region is newly segmented, and the depth map is also updated ( Fig. 6 (c) and 6 (d)). This proposed method is effective in generating depth maps rapidly and in reducing depth flickering frame by frame ( Fig. 7 ). The depth maps of the zone represented by the red dashed line display flickering ( Fig. 7 (a) to 7 (c)). The depth maps are generated by using TCDUM exhibit better performance than the depth maps generated by using DMFIM ( Fig. 7 (d) to 7 (f)). We then blur the generated depth map by using the bilateral filter to preserve the boundary of the area [6] . The final 3D video is generated by using DIBR [3] . The bilateral filtered depth map minimizes the size of the hole caused by the interocular distance of the virtual stereoscopic camera.
PPT Slide
Lager Image
(a) “Wild Life” video at frame numbers 20 and 21; (b) difference maps; (c) updated segmented images; (d) updated depth maps.
PPT Slide
Lager Image
Generated depth maps (a)-(c) show flickering without difference maps and (d)-(f) reduced flickering with difference maps.
In this section, we describe the experimental results for evaluating the algorithm for real-time processing, the depth consistency on the time line, and the visual comfort for several test videos. We use a 1280 × 720 video (i.e., “Wild Life”) and six 1920 × 1080 videos (i.e., “Life Force,” “Death Valley,” “See Sun,” and “Life Master”).
- 3.1 Real-Time Processing
The proposed algorithm is optimized to minimize the computation. The difference image is calculated with reduced number of bits per RGB from 8-bits to 4-bits for hardware implementation. The computational complexity is reduced almost 10 times when the difference image is calculated. For the 900-frame “Wild Life” video, the number of processed pixels is approximately 25 × 106 and 23 × 107 pixels by using TCDUM and DMFIM, respectively. When the proposed system is implemented by considering 8-core parallel processing on a computer with a 3.46 GHz dual-core GPU, 8GB RAM, and 64-bit OS, the frame rate is approximately 25 and 15 fps for 1280 × 720 and 1920 × 1080 videos, respectively. The delay time is about 1.5 second because of the accumulation of the difference image. Therefore, a consumer can check the result of the output in real time with a small delay time while watching 3DTV for half-sized HD videos. In the future, the proposed method will be able to convert a full-HD movie in real-time when the method is applied to a chip on a 3DTV.
- 3.2 Depth-Map Coherence
We evaluate seven test videos already mentioned and show results for the five test sets among the seven videos in Fig. 9 . The first four column images are the depth maps of the four consecutive frame images, and last column images are stereoscopic side-by-side images for the five test videos ( Fig. 8 ). Depth flickering is observed in the blue dashed region by using DMFIM ( Fig. 8 (a)), whereas depth coherence is observed by using TCDUM ( Fig. 8 (b)). We show the similar results for the other four test videos ( Fig. 8 (c)-(j)). However, this study has some limitations. An error in the depth map in the first frame propagates in the next frame as a red dashed line ( Fig. 8 (h)). Thus, our system is sensitive to the first frame of the depth map in the time line. The rendered 3D videos by the time-coherent depth sequences are more comfortable to the human eye than the 3D videos rendered by using flickering depth sequences.
PPT Slide
Lager Image
Generated depth sequence by comparing DMFIM (a), (c), (e), and (i) with TCDUM (b), (d), (f), and (h).
PPT Slide
Lager Image
Two-symptom questionnaire about eye -fatigue and head-ache.
- 3.3 Visual Quality Assessment
The side-by-side stereoscopic videos generated by using DIBR were displayed on a 55-inch LCD polarized 3D display. The evaluation of 7 stereoscopic videos was performed by 20 people. We use the symptom questionnaires shown in Fig. 9 (a) to 9 (b) [21] . The participants watch stereoscopic videos in random and are then asked 2 questions that have 5 grades from 20 points to 100 points. The experimental results show that the proposed algorithm provides comfortable 3D videos ( Fig. 10 and Fig. 11 ). Our method obtains better 3D visual quality than other methods in figure 10 and better visual comfort than the previous method without using difference images. Considering that the output 3D images of TriDef DDD [6] has a small 3D effect, participants have assessed that the visual comfort of TriDef DDD is similar to that of our results ( Fig. 11 ).
PPT Slide
Lager Image
Evaluation result of eye -fatigue for 20 participants.
PPT Slide
Lager Image
Evaluation result of head-ache for 20 participants.
A video with still objects but with big color differences within one object to be segmented is unfit for 2D-to-3D conversion for the proposed system.
We propose a time-coherent depth-map generation method to convert the 1280 × 720 sized 2D video to 3D video in real time. We have presented a depth-updating algorithm by using the difference image between frames to reduce the flickering of generated depth sequences. Consumers can feel comfortable while watching converted 3D videos without depth flickering on 3DTV via our method. We have also proposed a depthrefinement algorithm to correct the depth-reverse errors caused by color information for segmenting the region. For moving objects in the scene, the refinement algorithm shows a better quality in the 3D video by using motion history than by using only color information. In future works, we will work on generating multi-view images by using single-generated depth maps.
This work was supported by the IT R&D program of MSIP/IITP. [R2012030006, 3D Scene Analysis and Model Reconstitution Techniques in Stereoscopic 3D Creation and Synthetic Techniques]
Seung-Woo Nam
He received the BS and MS degrees in electronics engineering from Kyungpook National University, Daegu, South Korea, in 1996 and 1998, respectively. He is currently a senior researcher at the Electronics and Telecommunications Research Institute, Daejeon, South Korea. His research interests include 3D video processing, human–computer interface, and computer graphics.
Hye-Sun Kim
She received the BS and MS degrees in computer science from Pusan National University, Pusan, South Korea, in 1999 and 2001, respectively. She is currently a senior researcher at the Electronics and Telecommunications Research Institute, Daejeon, South Korea. Her research interests include 3D video processing and computer graphics.
Yoon-Ji Ban
She received the BS and MS degrees in electronics engineering from Kyungpook National University, Daegu, South Korea, in 1996 and 1998, respecitvely. He is currently a senior researcher at the Electronics and Telecommunications Research Institute, Daejeon, South Korea. His research interests include 3D video processing, human–computer interface, and computer graphics.
Sung-Il Chien
He received his BS degree from Seoul National University, Seoul, South Korea, in 1977, his MS degree from KAIST, Daejeon, South Korea, in 1981, and his PhD degree in electrical and computer engineering from Carnegie Mellon University in 1988. Since 1981, he has been with the School of Electronics Engineering, Kyungpook National University, Daegu, South Korea, where he is currently a professor. His research interests include computer vision, pattern recognition, and color image processing.
Park J-I. , Lee G. M. , Ahn C-H. , Ahn C. 2004 “Virtual Control of Optical Axis of the 3DTV Camera for Reducing Visual Fatigue in Stereoscopic 3DTV,” ETRI Journal 26 (6) 597 - 604    DOI : 10.4218/etrij.04.0603.0024
Kim S-Y. , Lee S-B. , Ho Y-S. 2006 “Three Dimensional Natural Video System Based on Layered Representation of Depth Maps,” IEEE Trans. Consumer Electronics 52 (3) 1035 - 1042    DOI : 10.1109/TCE.2006.1706504
Zhang L. , Tam W. J. 2005 “Stereoscopic Image Generation Based on Depth Images for 3D TV,” IEEE Trans. Broadcasting 51 (2) 191 - 199    DOI : 10.1109/TBC.2005.846190
Harman P. , Flack J. , Fox S. , Dowley M. 2002 “Rapid 2D to 3D conversion,” Proc. SPIE 4660 78 - 86
Ashutosh X. , Min S. , Andrew Y. N. 2009 “Make 3D: Learning 3D Scene Structure from A Single Still Image,” IEEE Trans. Pattern Analysis and Machine Intelligence 31 (5) 824 - 840    DOI : 10.1109/TPAMI.2008.132
Cheng C.-C. , Li C.-T. , Huang P.-S. , Li T.-K. , Tsai Y.-M. , Chen L.-G. 2010 “A Novel 2D-to-3D Conversion System Using Edge Information,” IEEE Trans. Consumer Electronics 56 (3) 1739 - 1745    DOI : 10.1109/TCE.2010.5606320
Yu F. , Liu J. , Ren Y. , Sun J. , Gao Y. , Liu W. 2011 “Depth generation method for 2D to 3D conversion,” Proc. IEEE 3DTV Conference 1 - 4
Kim D. , Min D. , Sohn K. 2008 “Stereoscopic Video Generation Method Using Stereoscopic Display Characterization and Motion Analysis,” IEEE Trans. Broadcasting 54 (2) 188 - 197    DOI : 10.1109/TBC.2007.914714
Cheng C.-C. , Li C.-T. , Huang P.-S. , Lin T.-K. , Tsai Y.-M. , Chen L.-G. 2009 “A Block-based 2D to 3D conversion system with bilateral filter,” Proc. IEEE International Conference on Consumer Electronics 1 - 2
Lai S. H. , Fu C. W. , Chang S. 1992 “A Generalized Depth Estimation Algorithm with A Single Image,” IEEE Trans. on Pattern Analysis and Machine Intelligence 14 (4) 405 - 411    DOI : 10.1109/34.126803
Gau G. , Zhang N. , Huo L. , Gao W. 2008 “2D to 3D conversion based on edge defocus and segmentation,” Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing 2181 - 2184
Tsai S.-F. , Cheng C.-C. , Li C.T. , Chen L.-G. 2011 “A Real-time 1080p 2D-to-3D Video Conversion System,” IEEE Trans. Consumer Electronics 57 (2) 915 - 922    DOI : 10.1109/TCE.2011.5955240
Zhang Z. , Wang Y. , Jiang T. , Gao W. 2011 “Visual pertinent 2D-to-3D video conversion,” Proc. IEEE International Conference on Image Processing 909 - 912
Po L.-M. , Xu X. , Zhu Y. , Zhang S. , Cheung K.-W. , Ting C.-W. 2010 “Automatic 2D-to-3D video conversion technique based on depth-from-motion and color segmentation,” Proc. IEEE International Conference on Signal Processing 1000 - 1003
Chang Y.-L. , Chen W.-Y. , Chang J.-Y. , Tsai Y.-M. , Lee C.-L. , Chen L.-G. 2008 “Priority depth fusion for the 2D-to-3D conversion system,” Proc. SPIE 6805
Nam S.-W. , Kim H.-S. , Ban Y.-J. , Chien S.-I. 2013 “Realtime 2D to 3D conversion for 3DTV using time coherent depth map generation method,” Proc. IEEE International Conference on Consumer Electronics 187 - 188
Ideses I. , Yaroslavsky L. P. , Fishbain B. 2007 “Real-time 2D to 3D Video Conversion,” Journal of Real-time Image Processing 2 (1) 3 - 9    DOI : 10.1007/s11554-007-0038-9
Tomasi C. , Manduchi R. 1998 “Bilateral filtering for gray and color images,” Proc. IEEE International Conference on Computer Vision 839 - 846
Mas J. , Fernandez G. 2003 “Video shot boundary detection based on color histogram,” Proc. TREC Video Retrieval Evaluation Conference
Adams R. , Bischof L. 1994 “Seeded Region Growing,” IEEE Trans. Pattern Analysis and Machine Intelligence 16 (6) 641 - 647    DOI : 10.1109/34.295913
Shibata T. , Kim J. , Hoffman D. M. , Banks M. S. 2011 “The Zone of Comfort: Predicting Visual Discomfort With Stereo Displays,” Journal of Vision 11 (8) 1 - 29    DOI : 10.1167/11.8.11