Advanced
A Fast and Accurate Face Tracking Scheme by using Depth Information in Addition to Texture Information
A Fast and Accurate Face Tracking Scheme by using Depth Information in Addition to Texture Information
Journal of Electrical Engineering and Technology. 2014. Mar, 9(2): 707-720
Copyright © 2014, The Korean Institute of Electrical Engineers
  • Received : April 12, 2013
  • Accepted : July 22, 2013
  • Published : March 01, 2014
Download
PDF
e-PUB
PubReader
PPT
Export by style
Article
Author
Metrics
Cited by
TagCloud
About the Authors
Dong-Wook Kim
Corresponding Author: Dept. of Electronic Materials Engineering, Kwangwoon University, Korea. (dwkim@kw.ac.kr)
Woo-Youl Kim
Dept. of Electronic Materials Engineering, Kwangwoon University, Korea. (wykim@kw.ac.kr)
Jisang Yoo
Dept. of Electronic Engineering, Kwangwoon University, Korea. (jsyoo@kw.ac.kr)
Young-Ho Seo
College of Liberal Arts, Kwangwoon University, Korea. (yhseo@kw.ac.kr)

Abstract
This paper proposes a face tracking scheme that is a combination of a face detection algorithm and a face tracking algorithm. The proposed face detection algorithm basically uses the Adaboost algorithm, but the amount of search area is dramatically reduced, by using skin color and motion information in the depth map. Also, we propose a face tracking algorithm that uses a template matching method with depth information only. It also includes an early termination scheme, by a spiral search for template matching, which reduces the operation time with small loss in accuracy. It also incorporates an additional simple refinement process to make the loss in accuracy smaller. When the face tracking scheme fails to track the face, it automatically goes back to the face detection scheme, to find a new face to track. The two schemes are experimented with some home-made test sequences, and some in public. The experimental results are compared to show that they outperform the existing methods in accuracy and speed. Also we show some trade-offs between the tracking accuracy and the execution time for broader application.
Keywords
1. Introduction
Detection and/or tracking one or more objects (especially parts of the human body) have been researched for a long time. Their application areas have been diversely expanded from the computer vision area to security or surveillance systems, visual systems for robots, video conferencing, etc. One of the biggest applications is HCI (human-computer interface) by detecting and tracking human hand(s), body, face, or eyes, and is used in various areas, such as smart home systems [1] . In this paper, the target object is restricted to the human face(s).
Many previous works included both face detection and tracking, such that face detection would be a preprocessing to face tracking, as in this paper, but they are reviewed separately here. For face detection, the most frequently used or referred method is the so called Adaboost algorithm [2 - 4] . This method includes a training process, Haar-like feature extraction and classification, and cascade application of the classifiers, although it only applies to gray images. Many subsequent researches have referenced them [5 - 10] . [5 , 6] , and [9] proposed new classifiers to apply to a color image, and [7] designed a classifier that included skin color and eye-mouth features, as well as Haar-like features. [8] focused on the asymmetric features, to make a classifier for them. In [10] , the result from local normalization and Gabor wavelet transform to solve the color variation problem was applied to Adaboost. Some also used Haar-like features, but designed different classifiers, and refined them by a support vector machine [11] . Many other methods have used some features on the face, such as nose and eyes [12] , skin-color histogram [13] , and edges of the components of the face [14] . Also, many works used skin color to detect the face. Most of them used chrominance components. [15] and [16] proposed chrominance distribution models of the face, and [17] proposed a statistical model of skin color. In addition, [18] focused on detecting various poses of face, and [19] proposed a method to detect any angle of the face, with multi-view images.
Most face tracking methods so far also used the factors that have been used in the face detection method, such as the component features in a face [20 - 23] , appearance of the face [24 - 29] , and skin color [30 - 34] . Among the featurebased tracking methods, [20] used the eyes, mouth, and chins as the landmarks, [21] tracked some features individually characterized by Gabor wavelets, and [22] used an inter-frame motion inference algorithm to track the features. [23] used silhouette features, as well as semantic features to on-line track the features. Appearance-based tracking basically used face shape or appearance [24 , 25] . But [26] additionally used the contour of the face to cover large motions of the face, and [27] proposed constraints to temporally match the face. [28] proposed a method to learn the appearance models online, and [29] used a condensation method for efficiency. Many face tracking schemes also used skin color [30 - 34] . [30] proposed a modeling method based on skin color, and [31] constructed a condensation algorithm based on skin color. [32] used color distribution models to overcome the problem caused by varying illumination. Some methods used other facial factors, in addition to the skin color, such as facial shape [33] . Also, some methods used a statistical model of skin color, by adopting a neural network to calculate the probability of skin color, with adaptive mean shift method for condensation [34] . Besides, [35] tracked various poses of face, by using a statistical model. [36] used a template matching method to track a face, by using depth as the template. It first found hands, and then found the face to be tracked, by using the hands as the support information.
This paper proposes a combination of a face detection scheme and a face tracking scheme, to find and track the face(s) seamlessly, even in the case that the human goes out of the image, or the scene changes. In our scheme, the face detection is performed at the beginning of face tracking, or when the tracked face disappears. It is a hybrid scheme that uses features, skin color, and motion. It basically uses the method in [2 - 4] , the Adaboost method or Viola & Jones method. But we reduce the search area to apply the Adaboost method, with motion and skin color. Our face tracking scheme uses a template matching method with only depth information, as in [36] . But it directly tracks the face, without any auxiliary information. Also, it includes a template resizing scheme, to adapt to the change of the distance of the face. In addition, it includes an early termination scheme, to reduce the execution time to run in more than a real time, with minimizing the tracking accuracy. Thus, we will show that it can be used adaptively, by considering the complementation between the execution time and the tracking error.
This paper consists of six chapters. The next chapter explains the overall operation of our scheme. The proposed face detection scheme and face tracking scheme are explained in more detail separately, in Chapter 3 and Chapter 4, respectively. Chapter 5 is devoted to finding the necessary parameters, and experimenting the proposed schemes. Finally, Chapter 6 concludes this paper, based on the experimental results.
2. Overall Operation
The global operation of the face tracking algorithm proposed in this paper is shown as a flow graph in Fig. 1 . As mentioned before, the main proposal is for a face tracking algorithm, but we also propose a scheme to reduce the calculation time of face detection. That is, the proposed face tracking method uses a template of the face(s), which consists of the position and depth information of the face, and the first template is extracted by the face detection process. But afterward the template is updated by the tracking scheme itself, except for the case when the face being tracked disappears from the image.
PPT Slide
Lager Image
Process flow of global operation
At the very first, the face detection process is performed. It uses both RGB image and depth information. Basically, it uses an existing method, the Adaboost algorithm [2 - 4] , but the search area is reduced, by finding movement of the human, and skin color. Once a face is detected, its position (x-y coordinate), and the corresponding depth information (segment of depth map), are taken as the template to be used in the tracking process.
The tracking process uses a template matching method that finds a block matching the template. Here, we use only depth information for this process. It also includes a scheme to resize the template, as well as actual calculation for template matching. Also an early termination process is incorporated, to reduce the execution time.
The result from the tracking process, if it is successful, is the template of the current frame, and is used as the template for the next frame. But if it fails, it goes back to the detection process, to find a new face. This case is when the scene change or when the face being tracked disappears from the scope of the image. That is, in most cases, the scene does not change, or the tracked person remains in the scope of the image, so the detection process is performed only once, at the very start.
The image data we need for face detection and tracking is both RGB images and depth images. Here, we assume that the two kinds of images are given with the same resolution, such as the ones by Kinect from Microsoft.
3. Face Detection Algorithm
The processing flow of the proposed face detection scheme is as shown in Fig. 2 . Basically it uses the Adaboost algorithm [2 - 4] , but the area to search for the human face(s) is restricted, by using two consecutive depth images (i-1 th and i th ), and the current RGB image (i th ).
PPT Slide
Lager Image
The proposed face detection procedure
To do this, for the i th RGB image, if it contains skin color region is examined with Eq. (1), which was taken from the skin color reference in [37] . Here, only C b and C r components are used in the YC b C r color format.
PPT Slide
Lager Image
The result by Eq. (1) is a binary image (we call it a skin image), in which the pixels with ‘1’ value indicate human skin region. But a skin region may not be found, even if it exists (false negative error), which is mostly due to the illumination condition. Thus, in this case, we try once more, after adjusting the color distribution by a histogram equalization method [38] . If the first attempt, or the re-trial of skin color detection, finds any skin color pixel, we define the skin region as in Fig. 3 . From the skin image Si ( x , y ) , the vertical skin region image
PPT Slide
Lager Image
( x , y ) (horizontal skin
PPT Slide
Lager Image
Procedure to define a skin region
region image
PPT Slide
Lager Image
( x , y )) is obtained, such that if any pixel in a column j (row k ) in Si ( x , y ) has value‘1’, all the pixels in the column j (in the row k ) have ‘1’ if ( j,k )∈S i ( x,y )( j,k ). Then, the final skin region image SRi ( x, y ) is obtained, by taking the common parts of
PPT Slide
Lager Image
( x, y ) and
PPT Slide
Lager Image
( x, y ). Fig. 4 (a) shows the scheme to find the skin region image, where the horizontal and vertical gray regions correspond to
PPT Slide
Lager Image
( x, y ) and
PPT Slide
Lager Image
( x, y ) respectively, and the red-boxed regions are the defined skin region.
PPT Slide
Lager Image
Examples of results from face detection processes: (a) skin region; (b) movement region; (c) detected face
Meanwhile, for the depth information, a depth difference image DDi ( x, y ) between the i -1 th and i th depth images is found as Eq. (2), where Di ( x, y ) is the depth value at ( x, y ) in the depth image i . The resulting image DDi ( x, y ) is also a binary one, where the region with ‘1’ defines the region with movement.
PPT Slide
Lager Image
From the extracted depth difference image, the region map MRi ( x, y ) can also be found in the same way as Fig. 3 with DDi ( x, y ) ,
PPT Slide
Lager Image
( x, y ) , and
PPT Slide
Lager Image
( x, y ) , instead of Si ( x, y ) ,
PPT Slide
Lager Image
( x, y ),
PPT Slide
Lager Image
( x, y ), respectively. An example to find the movement region corresponding to Fig. 4 (a) is shown in Fig. 4 (b) , where the horizontal and vertical gray regions correspond to
PPT Slide
Lager Image
( x, y ) and
PPT Slide
Lager Image
( x, y ) , respectively, and the white-boxed region is the defined movement region.
With existence of the skin color region or movement region, there are four cases to define the area for the Adaboost algorithm to search face(s). When both skin color region and movement region exist, the search area is defined by the common regions of the skin region and the movement region. But when only skin region (movement region) exists, the skin region (the movement region) itself is defined as the search area. But when neither the skin region nor movement region exist, the process takes the next image frame, and performs the whole process again. Finally, the Adaboost algorithm is applied to the defined search area. An example of an RGB image segment of the finally detected face is shown in Fig. 4 (c) , which corresponds to Figs. 4 (a) and (b) .
The data from the face detection process is the coordinate and the size of the detected face, and its corresponding depth image segment. Because our purpose is more for face tracking, rather than face detection itself, the face information nearest to the camera is sent to the face tracking process, when more than one faces is detected in the detection process.
4. Face Tracking Algorithm
By taking the information of the detected face from the face detection process, or the previous face tracking process as template, the face tracking process is performed, as in Fig. 5 . Because the proposed tracking process uses only depth information, it takes the next frame of the depth image as another input. Each step is explained in the following.
PPT Slide
Lager Image
The proposed face tracking procedure
- 4.1 Template and search area re-sizing
The first step in the proposed face tracking scheme is to resize the template, and the search area, which is the area to find the face being tracked. Among the two, the template size is processed first, because the size of the search area depends on the resized template.
- 4.1.1 Template Re-sizing
If a human moves horizontally or vertically, the size of the face is nearly the same, but for back-and-forth movement it changes. So, the template that has been detected or updated in the detection process or the previous tracking process needs to be resized to fit to the face in the current frame.
- (1) Relationship between depth and size
Because the size of an object in an image depends totally on its depth, change of the size of an object according to its depth is explained first. To do this it is necessary to define the way to express a depth. In general, a depth camera provides a real depth value in a floating-point form. Also, it usually has a distance range, within which the estimated value is reliable. Let’s define the range as (z R,min , z R,max ), and the depth value is expressed digitally by an n-bit word. Then, a real depth value z corresponds to a digital word depth is explained first. To do this it is necessary to define the way to express a depth. In general, a depth camera provides a real depth value in a floating-point form. Also, it usually has a distance range, within which the estimated value is reliable. Let’s define the range as (z R,min , z R,max ), and the depth value is expressed digitally by an n-bit word. Then, a real depth value z corresponds to a digital word Z', as Eq. (3)
PPT Slide
Lager Image
But in a typical depth map, a closer point has a larger value, by converting it as Eq. (4) or (5), and this paper also uses this value, Z.
PPT Slide
Lager Image
PPT Slide
Lager Image
Now, when an object, whose real size is s and depth is z, has the size S sensor in the image sensor of a camera whose focal length is f, the relationship between S sensor and z or Z is as Eq. (6).
PPT Slide
Lager Image
If we assume that the pixel pitch in the image sensor is P sensor , and number of pixels corresponding to S sensor is N, the number of pixels in the real image is the same as N, and is found as Eq. (7).
PPT Slide
Lager Image
Fig. 6 shows an example plots for the relationship in Eq. (7), and its measured result. Here, the dashed line is the plot from Eq. (7), the dots are the measured values, and the un-dashed line is the trend line. The error between the dashed line, and the measured value or the trend line, is because of the digitizing error (from the above equations, a function to delete the fractions should be applied), and the measuring errors.
PPT Slide
Lager Image
Plotting for the relationship between size and depth
- (2) Face depth estimation and template re-sizing
To resize the template according to Fig. 6 or Eq. (7), the depth of a face in the current frame must be re-determined. For this, we define the depth template area as all the pixels in the i th frame D i ( x,y ) , corresponding to the ones in the previous template Ti-1 , as Eq. (8).
PPT Slide
Lager Image
The scheme is shown in Fig. 7 . The template and are divided into p × q blocks (each block has a × b resolution), and average values
PPT Slide
Lager Image
and
PPT Slide
Lager Image
for each block (block (j, k)) in the template and the depth template area, respectively, are calculated, to find the maximum values TA i-1 max and DTA i max , respectively, as Eqs. (9) and (10). In this paper, p and q are determined empirically.
PPT Slide
Lager Image
Defining the search area
PPT Slide
Lager Image
PPT Slide
Lager Image
Then, the size of the updated template
PPT Slide
Lager Image
is calculated with the size of the previous template
PPT Slide
Lager Image
as Eq. (11), and the template is resized accordingly ( X is hor or ver , representing horizontal and vertical, respectively).
PPT Slide
Lager Image
- 4.1.2 Search area re-sizing
Although the template size has been updated, it is necessary to find the exact location of the face in the current frame, by searching the appropriate area, which we call the search area. The search area must be determined by considering the depth value of
PPT Slide
Lager Image
, and the maximum amount of face movement. The first has just been considered above. For the second, we have measured the maximum movement empirically with proper test sequences, and this will be explained in the experiment chapter. By considering both factors, we determine some extension of the template size
PPT Slide
Lager Image
as the size of the search area, as shown in Fig. 7 , where
PPT Slide
Lager Image
means the smallest integer not less than x , and X is hor (horizontal) or ver (vertical).
- 4.2 Template matching
Once the resized template (we call it just ‘template’, from now on) and the corresponding search area are determined, the exact face location is found by a template matching. For this, we use the SAD (sum of absolute differences) value per pixel (PSAD) as the cost value, as Eq. (12), where K is the number of pixels in the template, ( cx , cy ) is the current position to be examined in the search area, and DT ( i , j ) ( DSA ( i , j ) ) is the pixel value at ( i , j ) in the template (search area).
PPT Slide
Lager Image
The final location of the matched face template SPopt is determined as the pixel location satisfying Eq. (13)
PPT Slide
Lager Image
- 4.2.1 Early termination
The process of Eq. (13) needs to repeat the calculation of Eq. (12) as many times as the number of pixels in the search area, which might take too much time. The means to reduce this search time in our method is an early termination scheme, whereby the search process is terminated when a certain criterion is satisfied. Because the amount of the face movement in the assumed circumstances is usually small or none, search from smaller movement to larger is more proper. So we take a spiral search scheme, as in Fig. 8 , which is an example with a search area of 5×5 [pixel 2 ]. The dashed arrows show the direction of search, and the numbers in the blocks (pixels) are the search order. Thus, if the early termination scheme is not applied, each of the full search sequence SS FS is examined to find the pixel SPopt , as Eq. (14), where the size of the search area is assumed as m × n [pixel 2 ].
PPT Slide
Lager Image
Spiral search and early termination scheme
PPT Slide
Lager Image
To terminate the examination earlier than the last pixel, the PSAD value of a pixel satisfies Eq. (14),
PPT Slide
Lager Image
where, TET is the threshold value of PSAD that we assume for when the pixel SPET satisfying Eq. (14) is close enough to SPopt , and first { x } means the first position satisfying x. T ET is determined empirically.
- 4.2.2 Sparse search and refinement
Even though the early termination scheme reduces the search time, it could be reduced more, if we might sacrifice accuracy a little more. This can be done by hopping several pixels, from the currently examined pixel, to the next one to be examined. In this case, the search sequence is as in Eq. (16), when the number of intervals between the current and the next is p, which is called the ‘hopping distance’.
PPT Slide
Lager Image
If the early termination scheme is not applied, and p =1, it is the same as SSFS . If the early termination scheme is applied, and p =1, it is the same as when this sparse search scheme is not applied. Fig. 8 shows a case of p =3, where the dark pixels are the ones in the SS 3 .
One more step is included in this sparse search. When p > 1, and a pixel ( SPqp ) in the sequence satisfies Eq. (15) (early terminated), the immediate neighbor pixels are additionally examined. We call it a refinement process, and two cases are considered. The first is called 2-pixel refinement, and additionally examines two pixels ( SPqp −1 , SPqp +1 ) just before and just after SPqp (horizontally striped),); while the second is called 4-pixel refinement, and two more pixels just outside of SPqp (vertically striped) are examined. In either case, the pixel with the lowest PSAD value would be the final pixel, SPopt . If a sparse search does not find a pixel satisfying Eq. (14), the one with minimum PSAD value is selected as the final result.
- 4.2.3 Feedback to face detection
If the face can no longer be tracked, in such a case that the human goes out of the screen, or is hidden by another object, our algorithm goes back to the face detection algorithm, to find another face. This is decided by a PSAD threshold value T FB , as in Eq. (17)
PPT Slide
Lager Image
Eq. (17) is applied to any search sequence, no matter whether the early termination scheme, or sparse search, is applied, or not.
5. Experiments and Results
We have implemented the proposed face detection algorithm and the face tracking algorithm, and conducted experiments with various test sequences. In these experiments, we used Microsoft Visual Studio 2010, and OpenCV Library 2.4.3 in the Windows operating system. The computer used in the experiments has an Intel Core i7 3.4GHz CPU, with 16GB RAM.
- 5.1 Determining parameters for face tracking
First, we have empirically determined the parameters defined in the previous chapter. For this, we used three home-made contents, whose names indicate the directions of movements. The information is in Table 1 , and two representative images for each sequence are shown in Fig. 9 . The speed of movement in each content was conducted maximally, to cover more than enough movement speed, compared with the assumed circumstances. All three contents are captured by Kinect® from Microsoft, so the resolution of both RGB image and depth image is 640×480.
Contents used for parameter determination
PPT Slide
Lager Image
Contents used for parameter determination
PPT Slide
Lager Image
Representative images in each sequence: in LR: (a) 18th frame; (b) 119th frame; in UD; (c) 36th; (d) 166th, in BF; (e) 96th frame; (f) 130th frame
- 5.1.1 Template segmentation for template resizing
In Fig. 7 , we have segmented the template and the corresponding region of the current frame into p × q blocks. Because there can be so many combinations of p and q, we only take the cases of p = q that are an odd number. The purpose of this segmentation is to resize the template to match the current face size. So, we have estimated the relative error of the resized template. In this experiment, we have used the faces extracted by applying the face detection algorithm, as the reference templates. The relative error, named as template resizing error, was calculated as Eq. (18).
PPT Slide
Lager Image
Fig. 10 shows the experimental results, where (a) shows the average values of the template resizing errors for various segmentations, and for the three test sequences. As can be seen by considering all the three test sequences, 3×3 segmentation is the best. Fig. 10 (b) shows the change of the template resizing error throughout the sequences, for 3×3 segmentation. Any frame of any sequence does not exceed 3% in error. Other experimental results showed that all the segmentations have very similar execution time. So, we decided on 3×3 segmentation as our segmentation scheme.
PPT Slide
Lager Image
Template resizing error for template segmentation: (a) average values for various segmentations, and (b) values for 3×3 segmentation
- 5.1.2 Size of the search area
The next parameter is the extension ratio α in Fig. 7 , to determine the size of search area. To do this, we performed two experiments. The first was to measure the actual maximum amount of movement between two consecutive frames. In this experiment, we have 25.8%, 23.9%, and 18.2% of the size of the template as the maximum movement in for the UD, LR, and BF sequence, respectively. From this experiment, the value of α should be 0.258.
The second experiment was to measure the amount of displacement between the found template T found within the given search area, and the best solution of template T best found in the whole image. Here, both templates were found by a full search, without early termination, or sparse search. We converted it to displacement error, which is calculated by Eq. (19). size (T) and position (T) means the size and the position of T, respectively.
PPT Slide
Lager Image
The result is shown in Fig. 11 . The values in α<21% of sequence UD were more than 20%, because the face could not be tracked properly. From this experiment, it is clear that the result by estimating maximum movement between two consecutive frames (α =0.258) is not enough. To make the displacement error almost 0 (less than 0.1%), α≥0.41 should be maintained. Therefore, we decided on α as 0.41.
PPT Slide
Lager Image
Displacement errors for the three test sequences to the value of α
- 5.1.3 Threshold value for early termination
The PSAD threshold value TET for early termination was also determined by an empirical method. Fig. 12 is the result, where both displacement errors (D-Error) by Eq. (19), and execution times (Time), are shown with average values. The BF sequence shows the lowest displacement errors on almost all the threshold values, but for the execution time, the UD sequence shows the lowest. Because our aim was to track the face in real time, with appropriately low displacement error, we have chosen T ET as 2, which makes the displacement error lower than 4%, with execution time lower than 40ms.
PPT Slide
Lager Image
Experimental results to determine TET
- 5.1.4 Hopping distance and refinement search for sparse search
Another scheme to reduce the execution time is to hop from the current search pixel to the next, in the spiral sequence. The hopping distance was determined empirically, as well. We have measured the displacement error and execution time by increasing the hopping distance, fixing on T ET =2. The result is shown in Fig. 13 (a) , where all the displacement errors and execution times are the average values. As the hopping distance increases, the displacement error increases, and the execution time decreases, as expected. Because of our aim, we took 3 as the hopping distance, which makes the displacement error less than 5%, and the execution time less than 30ms.
PPT Slide
Lager Image
Experimental results to determine the hopping distance and refinement process: (a) hopping distance; (b) refinement process.
One more thing we decided was the refining process that examines the neighboring pixels to the early-terminated one (refer to Fig. 8 ). This also has two options, 2-pixel refinement and 4-pixel refinement. Because we have chosen T ET =2 and hopping distance=3, we used them in this experiment. The result is shown in Fig. 13 (b) . The displacement error of UD dramatically decreased in 2-pixel refinement and 4-pixel refinement, although other sequences did not change much for any refinement. Also, the increase in execution time is negligible for all the sequences, and all the refinement processes. So, we will apply 2-pixel refinement or 4-pixel refinement.
- 5.1.5 Threshold value to return to the face detection process
The final parameter to be decided is the PSAD threshold for the tracking process to return to the face detection process, by assuming that the tracking process could not track the face correctly any more. For this we prepared a few special sequences, where a human walks out of the screen, or behind an object, etc. Fig. 14 (a) shows an example of the experimental result of PSAD values when the human being tracked walks out of the screen. As you can see in the graph, the PCAD value changes dramatically from the 75 th frame to the 76 th frame. Images for the two frames are shown in Figs. 14 (b) and (c) , respectively. Other sequences showed similar results, so, it is quite reasonable to choose T FB =15[dB].
PPT Slide
Lager Image
Experimental result to determine TFB: (a) PSAD value change; (b) 75th frame image; (c) 76th frame image.
- 5.2 Experimental results and comparison
We have experimented using the proposed face detection scheme and the face tracking scheme with some test sequences. They are listed in Table 2 , where the first three sequences were tested only for the detection scheme, because they were used to extract the parameters for the tracking scheme. Lovebird1 is a test sequence from MPEG for multi-view, and its resolution is 1,920×1,080. The last two were home-made with Kinect®. In Lovebird1, two persons walk side-by-side, from far away, to very near the camera. The WL sequence has only one person sitting on the screen, and moving various ways. In the S&J sequence, two persons appear at first, but the one near the camera walks out of the screen. Then, the other is moving this and that way, while sitting on a chair. Three representative images for each sequence are shown in Fig. 15 .
Applying test sequences
PPT Slide
Lager Image
Applying test sequences
PPT Slide
Lager Image
The representative images for Lovebird1: (a) 18th frame; (b) 90th frame; (c) 119th frame; WL: (d) 116th frame; (e) 303th frame; (f) 416th frame; and S&J: (g) 144th frame; (h) 182th frame; (i) 328th frame.
- 5.2.1 Face detection
The experimental results for the proposed face detection scheme are summarized in Table 3 , which includes the true positive rate (TP), false positive rate (FP), false negative rate (FN), and execution time per frame for each test sequence. In this table, our results are compared with the Adaboost method in [3] (V&J). TP, FP, and FN were calculated as Eqs. (20-1), (20-2), and (20-3), respectively, where, a true positive face is a face that was detected and truly resides in the image, a false positive face is one that was detected but is not truly a face truly, and a false negative face is a face that resides in the image, but was not detected.
The experimental results and comparison of the face detection methods
PPT Slide
Lager Image
The experimental results and comparison of the face detection methods
As can be seen in the table, V&J is a little better in FN, but ours are much better in TP. In particular, ours showed 0% of FP. Also the execution time of ours was about 1/10 that of V&J. This means that our scheme reduces the search area to about 1/10 of the whole image, with a little sacrifice in the FP.
Table 4 compares some previous methods with ours. Note that the test sequences for each method are different. Because ours uses depth information, and the others do not contain their depth information, those test sequences could not be used. Also, we failed to get the implemented results for the existing methods. Thus, it seems unclear to compare them; but we still think it is worthy, because it makes some indirect comparison possible. Also note that the ‘-’ in a cell indicates that there is no information in the corresponding paper. As can be seen in the table, ours outperforms any existing method in TP, FP, FN, and even in the execution time.
Comparison with existing methods for face detection
PPT Slide
Lager Image
Comparison with existing methods for face detection
- 5.2.2 Face tracking
The experimental results of the proposed face tracking scheme for the last three sequences in Table 2 are shown in Fig. 16 , where displacement error rate and execution time per frame are graphed to the frames for each sequence. Also, the average values are shown in Table 5 . Fig. 16 and Table 5 include two refinement schemes, 2-pixel refinement and 4-pixel refinement. In this experiment, we let the scheme track the nearest person to the camera, when more than one person resided in the screen (Lovebird1 sequence, and the front part of the S&J sequence).
PPT Slide
Lager Image
Face tracking experimental results for displacement and execution time for the sequence of: (a) Lovebird 1; (b) WL; (c) S&J.
Experimental results for the proposed face tracking scheme
PPT Slide
Lager Image
Experimental results for the proposed face tracking scheme
Because the man in Lovebird1 walks toward the camera, his face gets larger with moving left and right repeatedly, and whenever the direction changes, he remains still for a couple of frames. It is projected exactly to the execution time, as shown in Fig. 16 (a) . Also, the execution time becomes larger, as the sequence progresses, because the size of face becomes larger. That is, as the size of face increases, the execution time increases. This can be also found, by comparing the times between before and after, about the 190th frame of S&J. The woman is nearer than the man. For reference, the distances from the camera to the man in WL, and the woman and the man in S&J are about 120cm, 110cm, and 160cm, respectively.
PPT Slide
Lager Image
PPT Slide
Lager Image
PPT Slide
Lager Image
As in Table 5 , the average displacement error was less than 3%, with less than 5ms of the execution time per frame. For reference, Fig. 17 shows three example texture images corresponding to the depth template images, with 0%, 4%, and 7.8% of displacement error, respectively. By considering them, it is obvious that 3% of the average displacement error is quite acceptable. Also, the proposed tracking scheme takes about 5ms in average to track the face per frame, and even for the Lovebird1 sequence, it is less than 8ms, which is more than enough speed for real time tracking.
PPT Slide
Lager Image
Texture images corresponding to the depth templates, with displacement error ratios of: (a) 0%; (b) 4%; (c) 7.8%
Table 6 compares our method with some existing methods. Because they do not provide enough information for fair and clear comparison, we have used the data from the papers without modification or recalculation. Instead, we fitted our data to them as close as possible. Most tracking methods used some sequences made by their respective authors as the test sequences, and so did we. Among the three schemes in Table 5 , we have chosen the 2-pixel refinement scheme, because both its error rate and execution time are in the middle. In the table, we entered depth+RGB as the property of the sequence of ours, but actually, only depth was used. For tracking rate, ours showed 100%, because ours changes the execution mode to face detection, when it fails to track the face. It happened only once in the S&J sequence, as explained before. In the tracking error, [29] and [36] provided the displacement amount and root mean square error (RMSE), as the information for how accurate they are, respectively, while we have provided the displacement error rate in the previous table. So, we have changed our data corresponding to them, and showed it in parenthesis. From the data in the table, it is clear that our method outperforms in tracking accuracy, and execution time.
Comparison with existing methods for face tracking
PPT Slide
Lager Image
Comparison with existing methods for face tracking
6. Conclusion
In this paper, we have proposed a combination of a face detection scheme and face tracking scheme, to track a human face. The face detection scheme basically uses the Viola & Jones method [2 - 4] , but we have reduced the area to be searched, with skin color and depth information. The proposed tracking scheme basically uses a template matching method, but it only uses depth information. It includes a template resizing scheme, to adapt to the change of the size or depth of the face being tracked. It also incorporates an early termination scheme, by a spiral search for template matching, with a threshold, and a refinement scheme with the neighboring pixels. If it fails to track the face by deciding with a threshold, it automatically returns to the face detection scheme, to find a new face to track.
Experimental results for the face detection scheme showed 97.6% of the true positive detection rate on average, with 0% and 2.1% of false positive rate and false negative rate, respectively. The execution time was about 44[ms] per frame, with 640×480 resolution. From the comparison with the existing methods, it was clear that ours is better, in both detection accuracy and execution time.
The experimental results for the proposed face tracking scheme showed that the displacement error rate is about 2.5%, with almost 100% tracking rate. Also, the tracking time per frame with 640×480 resolution was as low as about 2.5 [ms]. These results much outperform the previous face tracking schemes. Also, we have shown some trade-offs between tracking accuracy and execution time for the size of the search area, PSAD threshold value for early termination, hopping distance, and refinement scheme.
Therefore, we can conclude that the proposed face detection scheme and face tracking scheme, or their combination, can be used in an application that needs fast and accurate face detection and/or tracking. Because our tracking scheme can also provide higher speed with sacrificing a little tracking accuracy, by increasing the early termination threshold value and hopping distance, or decreasing the search area and taking a simpler refinement scheme, its applicable area can be found much more broadly.
Acknowledgements
This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (MEST). (2010-0026245).
BIO
Dong-Wook Kim He has received his M.S degree in 1985 from Dept. of Electronic Engineering of Hangyang University in Seoul, Korea and his Ph.D degree in 1991 from Dept. of Electrical Engineering of Georgia institute of Technology in GA, U.S.A. He is a Professor of Dept. of Electronic Materials Engineering at Kwnagwoon University in Seoul, Korea.
Woo-Youl Kim He has received his B.S degree in 2012 from Dept. of Information & Communication Engineering of Anyang University in Anyang, Korea. He is currently pursuing M.S degree in Dept. of Electronic Materials Engineering of Kwangwoon University in Seoul, Korea. His research interests include digital image processing, digital holography, image compression.
Jisang Yoo He received the B.S. and M.S. degrees from Seoul national university, Seoul, Korea in 1985 and 1987, all in electronics engineering, and Ph.D. degree from Purdue University, West Lafayette, IN, USA, in electrical engineering in 1993, respectively. From September 1993 to august 1994, he worked as a senior research engineer in industrial electronics R&D center at Hyundai Electronics Industries Co., Ltd, Inchon, Korea, in the area of image compression and HDTV. He is currently a professor with the department of electronics engineering, Kwangwoon University, Seoul, Korea. His research interests are in signal and image processing, nonlinear digital filtering, and computer vision. He is now leading 3DTV broadcast trial project in Korea.
Young-Ho Seo He has received his M.S and Ph.D degree in 2000 and 2004 from Dept. of Electronic Materials Engineering of Kwangwoon University in Seoul, Korea. He was a researcher at Korea Electrotechnology Research Institute (KERI) in 2003 to 2004. He was a research professor of Dept. of Electronic and Information Engineering at Yuhan College in Buchon, Korea. He was an assistant professor of Dept. of Information and Communication Engineering at Hansung University in Seoul, Korea. He is now an associated professor of College of Liberal Arts at Kwangwoon University in Seoul, Korea and a director of research institute in TYJ Inc. His research interests include realistic media, digital holography, SoC design and contents security.
References
Yang M.-H. , Kriegman D. J. , Ahuja N. 2002 “Detecting Faces in Images; A Survey,” IEEE Trans. On Pattern Analysis and Machine Intelligence 24 (1) 34 - 58    DOI : 10.1109/34.982883
Viola P. , Jones M. 2001 “Rapid Object Detection using a Boosted Cascade of Simple Features,” Conf. on Computer Vision and Pattern Recognition 1 - 9
Viola P. , Jones M. 2001 “Robust Real-time Object Detection,” Intl. Workshop on Statistical and computational Theories of Vision-Modeling, Learning, Computing, and Sampling 1 - 25
Viola P. , Jones M. 2004 “Robust Real-Time Face Detection,” J. of Compter Vision 57 (2) 137 - 154    DOI : 10.1023/B:VISI.0000013087.49260.fb
Erdem C. E. , Ulukaya S. , Karaali A. , Erdem A. T. 2011 “Combining Haar Feature and Skin Color Based Classifiers for Face Detection,” Conf. Acoustics, Speech and Signal Processing 1497 - 1500
Inalou S. A. , Kasaei S. 2010 “AdaBoost-based Face Detection in Color Images with Low False Alarm,” Conf. Computer Modeling and Simulation 107 - 111
Tu Y. , Yi F. , Chen G. , Jiang S. , Huang Z. 2010 “Fast Rotation Invariant Face Detection in Color Image Using Multi-Classifier Combination Method,” Conf. EDT 211 - 218
Wu J. , Brubaker S. C. , Mullin M. D. , Rehg J. M. 2008 “Fast Asymmetric Learning for Cascade Face Detection,” IEEE Trans. Pattern Analysis and Machine Intelligence 30 (3) 369 - 382    DOI : 10.1109/TPAMI.2007.1181
Wu Y.-W. , Ai X.-Y. 2008 “Face detection in color images using AdaBoost algorithm based on skin color information,” Workshop on Knowledge Discovery and Data Mining 339 - 342
Tie Y. , Guan L. 2009 “Automatic face detection in video sequence using local normalization and optimal adaptive correlation techniques,” J. Pattern Recognition 42 1859 - 1868    DOI : 10.1016/j.patcog.2008.11.026
Shih P. , Liu C. 2006 “Face detection using discriminating feature analysis and support vector machine,” J. Pattern Recognition 39 260 - 276    DOI : 10.1016/j.patcog.2005.07.003
Colombo A. , Cusano C. , Schettini R. 2006 “3D face detection using curvature analysis” J. Pattern Recognition 39 444 - 455    DOI : 10.1016/j.patcog.2005.09.009
Waring C. A. , Liu X. 2005 “Face Detection Using Spectral Histograms and SVMs,” IEEE trans. Syst., Man, And Cybernetics 35 (3) 467 - 476    DOI : 10.1109/TSMCB.2005.846655
Tsao W.-K. , Lee A. J. T. , Liu Y.-H. , Chang T.-W. , Lin H.-H. 2010 “A Data mining Approach to Face Detection,” J. Pattern Recognition 43 1039 - 1049    DOI : 10.1016/j.patcog.2009.09.005
Sagheer A. , Aly S. 2012 “An Effective Face Detection Algorithm based on Skin Color Information,” Conf. Signal Image Technology and Internet Based Systems 90 - 96
Fan H. , Zhou D. , Nie R. , Zhao D. 2012 “Target Face Detection using Pulse Coupled Neural Network and Skin Color Model,” Conf. Computer Science & Service System 2185 - 2188
Kherchaoui S. , Houacine A. 2010 “Face Detection Based on a Model of the Skin Color with Constraints and Template Matching,” Conf. Machine and Web Intelligence 469 - 472
Chen H.-Y. , Huang C.-L , Fu C.-M. 2008 “Hybridboost Learning for Multi-pose Face Detection and Facial Expression Recognition,” J. Pattern Recognition 41 1173 - 1185    DOI : 10.1016/j.patcog.2007.08.010
Huang C. , Ai H. , Li Y. , Lao S. 2007 “High- Performance Rotation Invariant Multiview Face Detection IEEE Trans. Pattern Analysis and Machine Intelligence 29 (4) 671 - 686    DOI : 10.1109/TPAMI.2007.1011
Maurer T. , Malsburg C. 1996 “Tracking and Learning Graphs and Pose on Image Sequences of Faces,” International Conf. Automatic Face and Gesture Recognition 176 - 181
McKenna S. , Gong S. , Wurtz R. , Tanner J. , Banin D. 1997 “Tracking Facial Feature Points with Gabor Wavelets and Shape Models,” International Conf. on Audio and Video-based Biometric Person Authentication 35 - 42
Wang Q. , Zhang W. , Tang X. , Shum H.-Y. 2006 “Realtime Bayesian 3-D Pose Tracking,” IEEE Trans. Circuits Syst. Video Techn. 16 (12) 1533 - 1541    DOI : 10.1109/TCSVT.2006.885727
Zhang W. , Wang Q. , Tang X. 2008 “Real Time Feature Based 3-D Deformable Face Tracking,” ECCV 720 - 732
Cootes T. , Edwards G. 2001 “Active Appearance Models,” IEEE Trans. Pattern Anal. Mach. Intell. 23 (6) 681 - 685    DOI : 10.1109/34.927467
Cootes T. , Wheeler G. , Walker K. , Taylor C. 2002 “View-based Active Appearance Models,” Image Vision Comput 20 (9) 657 - 664    DOI : 10.1016/S0262-8856(02)00055-0
Sung J. -W. , Kim D. 2006 “Large Motion Object Tracking using Active Contour Combined Active Appearance Model,” International Conference on Computer Vision Systems 31 -
Zhou M. , Liang L. , Sun J. , Wang Y. 2010 “AAM based Face Tracking with Temporal Matching and Face Segmentation,” IEEE Conf. on Computer Vision and Pattern Recognition 701 - 708
Wang P. , Ji Q. 2008 “Robust Face Tracking via Collaboration of Generic and Specific Models,” IEEE Trans. Image Processing 17 (7) 1189 - 1199    DOI : 10.1109/TIP.2008.924287
Lui Y. , Beveridge J. , Whitley L. 2010 “Adaptive Appearance Model and Condensation Algorithm for Robust Face Tracking,” IEEE Tran. Syst., Man, and Cybernetics 40 (3) 437 - 448    DOI : 10.1109/TSMCA.2010.2041655
Raja Y. , McKenna S. , Gong S. 1998 “Colour Model Selection and Adaptation in Dynamic Scenes,” European Conference on Computer Vision 460 - 474
Jang G. , Kweon I. 2000 “Robust Real-time Face Tracking using Adaptive Color Model,” International Symposium on Mechatronics and Intelligent Mechanical System for 21 Century
Stern H. , Efros B. 2005 “Adaptive Color Space Switching for Tracking under Varying Illumination,” Image Vision Computation 23 (3) 353 - 364    DOI : 10.1016/j.imavis.2004.09.005
Lee H. , Kim D. 2007 “Robust Face Tracking by Integration of Two Separate Trackers: Skin Color and Facial Shape,” J. pattern recognition 40 3225 - 3235    DOI : 10.1016/j.patcog.2007.03.003
Vadakkepat P. , Lim P. , Silva L. , Jing L. , Ling L. 2008 “Multimodal Approach to Human-Face Detection and Tracking” IEEE Trans. Industrial Electronics 55 (3) 1385 - 1393    DOI : 10.1109/TIE.2007.903993
Qian R. , Sezan M. , Matthews K. 1998 “A Robust Realtime Face Tracking Algorithm,” International Conference on Image Processing 131 - 135
Suau X. , Ruiz-Hidalgo J. , Casas J. 2012 “Real-Time Head and Hand Tracking Based on 2.5D Data” IEEE Trans. Multimedia 14 (3) 575 - 585    DOI : 10.1109/TMM.2012.2189853
Chai D. , et al. 1998 “Locating Facial Region of a Headand -Shoulders Color Image,” Int’l Conf. Automatic Face and Gesture Recognition et al. 124 - 129
Gonzalez R. C. , Woods R. E. 2008 Digital Image Processing 3rd Ed. Pearson Ed. Inc. Upper Saddle River, NJ