3D conversion technology has been studied over past decades and integrated to commercial 3D displays and 3DTVs. The 3D conversion plays an important role in the augmented functionality of three-dimensional television (3DTV), because it can easily provide 3D contents. Generally, depth cues extracted from a static image is used for generating a depth map followed by DIBR (Depth Image Based Rendering) rendering for producing a stereoscopic image. However except some particular images, the existence of depth cues is rare so that the consistent quality of a depth map cannot be accordingly guaranteed. Therefore, it is imperative to make a 3D conversion method that produces satisfactory and consistent 3D for diverse video contents. From this viewpoint, this paper proposes a novel method with applicability to general types of image. For this, saliency as well as edge is utilized. To generate a depth map, geometric perspective, affinity model and binomic filter are used. In the experiments, the proposed method was performed on 24 video clips with a variety of contents. From a subjective test for 3D perception and visual fatigue, satisfactory and comfortable viewing of 3D contents was validated.
The generation of a stereoscopic image from a 2D image has been investigated over the past decades due to the success of 3D TVs and displays
. Most of conversion methods derive a depth map of each frame and then use DIBR (Depth Image Based Rendering) to synthesize the stereoscopic view.
To generate a depth map from a given 2D image, diverse methods have been proposed according to the principle of a human visual system. Some of them are depth from motion
, depth from defocus
, depth from geometrical linear perspective and gradient plane assignment
, depth from shadow
and so forth. Most depth estimation algorithms make use of the combination of aforementioned multiple monocular depth cues. Therefore, they are expected to work well for some particular images that contain suitable cues. In other words, if the depth cues do not deliver sufficient information, the algorithms might fail, yielding the production of uncomfortable 3D images. Furthermore, the accurate detection of a depth cue type is also needed and is somehow a difficult task. In case of incorrect type classification, a wrong depth map would be also obtained. This observation requires the necessity of the design of a global conversion method that produces a satisfactory 3D perception regardless of image contents as well as depth cues.
The proposed method is composed of four main components; (1) visual saliency estimation, (2) affinity model and binomic filter, (3) edge modeling, and (4) depth generation. The overall block diagram is shown in
. Given an RGB image, an edge map is obtained from a grayscale image. As well, a saliency map is extracted from an RGB image. Due to the lack of distance information in the saliency map, we incorporate a geometrical perspective cue to the saliency map. Then, a binomic filter as well as an affinity model is applied on the saliency map to reduce the saliency discontinuities present at neighboring pixels. This result is combined with an edge map and the transformed edge map are binomic-filtered with the saliency map. Finally a depth map is generated and subsequently left and right images are constructed by a DIBR method.
제안방법의 블록 다이어그램 Fig. 1. Block diagram of the proposed method
Ⅱ. Saliency Map Generation
The saliency generation has gained much interest over past decades in the diverse fields
. Contrast is an important factor that affects visual attention in static images. Whether an object is perceived as salient or not depends greatly on the distinctiveness between itself and the background. Color is one of main features for saliency detection. For colors, red/green and blue/yellow are two strong contrast color pairs. Recently, 2D-to-3D conversion researchers have applied the saliency to depth map generation
. In this paper, a global contrast-based method proposed by Zhai and Shah
is adopted because it is suitable to real-time processing unlike other complex methods. This method is simple to implement, but yet efficient for a baseline saliency production. Moreover, the performance is comparative to other methods.
The important fact is that only saliency containing dis tance (or depth) information is useful data. As observed in
, natural scenes contain the arbitrary locations and unknown relative distances of salient objects in the image. For instance, the saliency objects (manually marked in yellow box) can be located at the top or bottom as well as in the left, middle and right. Note that the location of saliency objects depends on human decision especially in complex scenes. Therefore an additional procedure is needed to compensate for the lack of the geometric information shortage of the saliency map.
각 영상별 관심객체의 다른 위치를 보여주는 관심객체의 예들 (수작업으로 지정). 개인마다 다른 관심객체를 선택할 수 있다. Fig. 2. Examples of images showing the salient objects (manually marked in yellow box). Individual human can select different salient objects. The different positions of the saliency objects are observed.
Most of natural images have general properties that a top area is far from a camera and that a bottom region is close to the camera due to inherent geometric perspective. The examples are also shown in
. The upper regions of the four example images have a larger distance compared with lower regions. Such characteristic will be utilized in the saliency construction. Furthermore, the location of salient objects imposes the uncertainty on background regions. For instance, the yellow boxes considered as salient objects in each image are in the left, middle, middle, and right, respectively. Therefore to deal with all possible cases, the three regions Ⅰ, Ⅱ, and Ⅲ are combined to compensate for the locational uncertainty of salient objects as illustrated in
. Given a W x H image, three regions Ⅰ, Ⅱ, and Ⅲ are configured in the upper region.
관심객체의 가능한 위치를 고려한 3개의 영역 Fig 3. Regions Ⅰ, Ⅱ, Ⅲ are used to consider possible locations of salient objects. Region Ⅰ= [0,0]×[τ,H/2], Region Ⅱ= [τ,0]×[W-τ,H/2], and Region Ⅲ = [W-τ,0]×[W-1,H/2]
The weights of the three regions are illustrated in
, where an weight w[i] is assigned to ith region. The mathematical formula are defined by
3개의 영역에 대응하는 가중치 함수들 Fig. 4. Weight functions w, w and w are associated with regions Ⅰ, Ⅱ, and Ⅲ, respectively
One of saliency methods is to directly use Red, Green, Blue channels. In
, a mean value is computed from an entire image. On the contrary, in our method, the mean values of R, G and B channels are computed only from upper three regions.
is a mean of x.
The purpose of computing a mean value only from an upper region is to obtain higher saliency in the lower region. Then weighted saliency maps of the three channels are computed by
The first saliency map S
can be made by either the maximum of three saliency maps or their average.
In the first method, R, G and B are directly utilized. The usage of a single saliency is not enough for getting satisfactory outcome. To solve this, we adopt another approach. We utilize the variation of RGB, from which a different saliency can be produced. Using Eq. (5), the transformed colors a, b, c are derived from the RGB.
Similar to Eq. (2), the averages of a b, c are derived for the three regions.
For each channel, the saliency map is made by an weighted average of transformed colors.
The second saliency map S
is obtained by either the maximum of A, B, C or their average.
To obtain the best saliency map, we tested a variety of combinations of S
. From this experiment, we found that the following relation outperforms other combinations. A final saliency map S is obtained from S
multiplied by normalized S
and is expressed by
where a maximum value of S
is a maximum value of S
We performed the saliency generation methods on diverse images, whose scene complexity varies from low to high and the results are shown in
제안 방법으로 얻어진 관심도맵. (a) RGB 입력, (b)는 식 (4), (c)는식 (8)에서 얻어진 관심도맵, 및 (d)는 식 (9)로부터 얻어진 관심도맵 Fig. 5. Diverse saliency maps obtained by the proposed method. (a) is RGB input image. (b) and (c) are obtained from Eqs. (4) and (8), respectively, and (d) is a final saliency made by Eq. (9)
Ⅲ. Binomic Filter and Affinity Model
As observed in
, the lack of consistent saliencies are present in most of images. For instance, different values in the inner area of a woman in the first row of
(a) are noticed and thus an additional processing is needed. We summarize the problems as follows: (1) an identical object has different saliency values. Especially the inconsistent values at the boundary are apparent, (2) the saliency of the inner region of a foreground object is not consistent within its boundary, and (3) a background is relatively homogeneous except some particular regions, but still needs constant values. Such problems prevent the saliency maps from being used as a depth. To solve this, we employ a binomic filter
and an affinity model
. The binomic filter is performed on the saliency map to solve the inconsistency between the inner region and the boundary. The aim of the binomic filter is to fill in the inner region using the boundary values. As well, the affinity model is to soothe the discontinuities of the close-by pixels.
- A) Binomic Filter
Since the saliency uses a color, different saliency values might spread over an identical object. To alleviate such problem, one of effective methods is a binomic filter
, where its elements are binomic numbers which are created as a sum of the corresponding two numbers in Pascal’s triangle. The effect of the binomic filter is illustrated in
. Suppose that the distribution of S along x axis is like
(a). Then its distribution is changed to
(b) after the filter is applied. It is observed that the large variation of the pixels values is much reduced.
주변 픽셀의 차이를 줄여주는 바이노믹 필터의 특성 Fig. 6. The binomic filter lessens the difference between neighboring pixels
We expand this filter to an image as follows; Based on an N x N pixel block, then we convolve S(i,j) with its scaled image S
is the scaled version. For the sake of clarity, the scaling used here is the value scaling. For instance of 1-D, if S = , S
becomes [30/3,140/3,80/3] in the scale of τ = 1/3. τ varies at [0,1]. The larger it is, the more the output is saturated. The result is shown in
. It is observed that saliency discontinuities of
(a) are much alleviated in
바이노믹 필터의 결과. (a) 입력 영상, (b) 바이노믹 필터를 적용한 출력 영상, 및 (c) 근점 모델을 적용한 출력 영상 Fig. 7. The results obtained after a binomic filter. (a) input image, (b) image after binomic filtering (τ=0.5) and (c) image after affinity modeling (N=4).
- A) Affinity Model
Pixels with same RGB have identical saliency values. Therefore, two neighboring pixels with different colors, but belonging to the same object might have different saliency values that produce the inconsistent depth. To solve this problem, we employ an affinity model in order to alleviate the boundary discontinuities at the identical object region.
In the usage of saliency data, defining an affinity model gained by integrating local grouping local cues such as saliency and boundary is of importance. As mentioned, a single object can be represented by multiple different saliency data, resulting in different depth values. The affinity model used in the segmentation can solve the aforementioned problem
. Close-by pixels with similar saliency values likely belong to the same segment. The color-based affinity model is defined by an exponential function.
denote the position and saliency values of pixel i respectively and, σ
control the weights of the two factors.
The combined model is used to design a better affinity model. Two models can be simply combined with a parameterαto produce a combined model
where α is a weight of [0, 1].
We apply the affinity model to the binomic-filtered image using the convolution.
The resulting image is shown in
(c). As observed, the filtered output shows more smoothed outputs. Two close-up images are shown in
, where the surface of resulting images after the affinity model become more smooth.
그림 7의 두 번째 영상의 확대영상. (a) 입력영상 및 (b) 픽셀값 변화를 감소시키는 근접모델 Fig. 8 Close-up of the second image in Fig. 7. (a) input image and (b) affinity model reduces the pixel value variation
Ⅳ. Edge Map Transformation
dge plays an important role in the proposed method
. The presence of connected edge boundary in the object provides useful information. On the other hand, note that the edge does not provide any depth information. Based on this fact, the edge processing helpful to depth generation focuses on the smooth edge preservation as well as the edge adaptation to saliency map.
The edge processing procedure is shown in
. Given an edge map, we decompose it into multiple subimages in the vertical direction. Then, the Bezier surface modeling is applied to entire image. Considering the edge ratio of a subimage, we adapt the edge map to the surface model and then a transformed edge map is derived.
에지맵 변환 방법 Fig. 9. Edge map transformation
An image is decomposed into K subimages in the vertical direction as in
. Then a maximum saliency value is computed for each kth subimage, SB
수직방향으로 K개의 서브영상으로 분할. Fig. 10. An image is decomposed into K subimages in the vertical direction
Then, an edge ratio E
is derived from each subimage.
Since edge contains no depth information, the 3D depth is diminished if we use edge values. Therefore, if an edge ratio is large, we decrease the saliency value and vice versa. For each subimage, the edge ratio
is computed by
is the number of pixels and E
is the number of edge pixels.
is at [0,1].
For each subimage, we compute a saliency maximum value Q
that is a maximum saliency value multiplied by
will act as a control point for surface modeling. From this, it is apparent that the saliency is more dominant at the dense edge subimage and less at the sparse region. As verified in the experiment, this is expected to add more 3D perception. The surface is modeled by Bezier curve or surface using K control points and then continuous surface SB is generated. Finally, a binomic filter is applied to saliency and edge map and a final depth map is constructed.
Ⅴ. Experimental Results
The proposed method has been tested on twenty four video clips as in
. The stereoscopic images of some test images from a video are shown in
. 3D formats are interlaced and anaglyph. The resolution of all the sequences is FHD (Full High Definition) 1920 x 1080. The duration of video clips is 300 ~ 10,000 frames.
3D 주관적 평가 결과Table 1. 3D subjective evaluation results
3D 주관적 평가 결과 Table 1. 3D subjective evaluation results
생성된 입체영상. (좌) 인터레이스 영상 및 (우) 아나글리프 영상 Fig. 11. Output stereoscopic images. (left) interlaced images and (right) anaglyph images
To measure 3D subjective tests, we examined 3D perception grade as well as visual fatigue. Twenty subjects participated in the experiment. They are 3D experts from industrial and academic fields who have much experience in viewing and evaluating 3D contents. The test videos were captured from commercial movies, TV dramas, sports, animation movies to demonstrate that the evaluation is independent from contents types. The duration of viewing time of each sequence is proportional to the number of frames. The viewing distance is 3 meter from display monitor.
SSCQS (Single Stimulus Continuous Quality Scale) subjective test was performed. Human subjects observed the stereoscopic videos on LG FHD 40” 3DTV and evaluated the 3D perception and visual fatigue
. The scale of the grade is
. For the 3D perception, the grade of 5 is very good and 1 is bad. The grade of 5 is no fatigue and 1 is severe fatigue for the visual fatigue. As shown in Table 1, the 3D grade is 3.68 and visual fatigue is 3.49 on the average. 3.68 of the 3D perception grade is between mild and good 3D. Regarding the performance limitation of automatic 3D conversion, the grade is satisfactory. As well, 3.49 of the visual fatigue is at the range of mild and less fatigues. One of the functionalities of 3D conversion is the ability to control a depth range. Therefore, if viewers feel any visual discomfort, they can control the strength of depth by adjusting a maximum parallax.
In this paper, a novel 3D conversion method was proposed. The proposed method stems from the fact that depth cues are not enough except some particular images, thus motivating the necessity of a general conversion method. Our method is designed to meet such requirement using saliency and edge modeling. For this, two saliency maps are utilized and are fused into a single saliency map that is a baseline for depth map generation. Additional geometric perspective is integrated into the saliency map in order to follow general natural scenes. Edge modeling is based on the saliency map as well surface representation. By combining the saliency and edge surface, satisfactory depth map could be obtained, leading to high 3D effect and low visual fatigue. An important aspect of all conversion methods is to provide stable 3D perception with reduced visual discomfort that was verified in our intensive video testing.
김 만 배
- 1983년 : 한양대학교 전자공학과 학사
- 1986년 : University of Washington, Seattle 전기공학과 공학석사
- 1992년 : University of Washington, Seattle 전기공학과 공학박사
- 1992년 ~ 1998년 : 삼성종합기술원 수석연구원
- 1998년 ~ 현재 : 강원대학교 컴퓨터정보통신공학과 교수
- orcid : http://orcid.org/0000-0002-4702-8276
- 주관심분야 : 3D영상처리, 깊이맵처리, 입체변환
la Cascia M.
“3D Stereoscopic Image Pairs by Depth-Map Generation,”
Proceedings of 3DPVT
“3D-TV Content Generation: 2D-To-3D Conversion,”
Proc. of IEEE ICME
“Stereoscopic image generation based on depth images for 3DTV,”
IEEE Trans. On Broadcasting
DOI : 10.1109/TBC.2005.846190
“3D conversion of 2D video using depth layer partition,”
Journal of Broadcast Engineering
“Real-time 2D to 3D video conversion,”
Journal of Real-Time Image Processing
DOI : 10.1007/s11554-007-0038-9
“2D-to-3D Conversion Based on Motion and Color Mergence,”
3DTV Conference: The True Vision-Capture, Transmission and Display of 3D Video
“Improved depth perception of single view images”
ECTI Transactions on Electrical Engineering, Electronics and Communications
“Depth from defocus by changing camera aperture: a spatial domain approach,”
“Visual attention detection in video sequences using spatiotemporal cues,“
Proceedings of the 14th annual ACM Int’ Conf. on Multimedia
Achanta S. Hemami
“Frequency-tuned Salient Region Detection,”
IEEE Conf. on Computer Vision and Pattern Recognition
“Stereoscopic visual attention model for 3D video”, Advances in Multimedia Modeling
“2D-to-3D image/video conversion by using visual attention analysis,”
Le Meur O.
“Adaptive 3D rendering based on region-of-interest”
in Proceedings of SPIE, vol.7524
“Computing visual attention from scene depth”
IEEE International Conference on Pattern Recognition
“Learning what matters: combining probabilistic models of 2D and 3D saliency cues,” Computer Vision Systems
Le Callet P.
“Computational Model of Stereoscopic 3D Visual Saliency,”
IEEE Transactions on Image Processing
DOI : 10.1109/TIP.2013.2246176
Image Processing, Analysis and Machine Vision
“Learning full pairwise affinities for spectral segmentation,”
IEEE Trans. PAMI
“View Synthesis Algorithm in View Synthesis Reference Software 2.0 (VSRS2.0)” ISO/IEC JTC1/SC29/WG11 M16090