This paper proposes a face tracking scheme that is a combination of a face detection algorithm and a face tracking algorithm. The proposed face detection algorithm basically uses the Adaboost algorithm, but the amount of search area is dramatically reduced, by using skin color and motion information in the depth map. Also, we propose a face tracking algorithm that uses a template matching method with depth information only. It also includes an early termination scheme, by a spiral search for template matching, which reduces the operation time with small loss in accuracy. It also incorporates an additional simple refinement process to make the loss in accuracy smaller. When the face tracking scheme fails to track the face, it automatically goes back to the face detection scheme, to find a new face to track. The two schemes are experimented with some home-made test sequences, and some in public. The experimental results are compared to show that they outperform the existing methods in accuracy and speed. Also we show some trade-offs between the tracking accuracy and the execution time for broader application.
Detection and/or tracking one or more objects (especially parts of the human body) have been researched for a long time. Their application areas have been diversely expanded from the computer vision area to security or surveillance systems, visual systems for robots, video conferencing, etc. One of the biggest applications is HCI (human-computer interface) by detecting and tracking human hand(s), body, face, or eyes, and is used in various areas, such as smart home systems
. In this paper, the target object is restricted to the human face(s).
Many previous works included both face detection and tracking, such that face detection would be a preprocessing to face tracking, as in this paper, but they are reviewed separately here. For face detection, the most frequently used or referred method is the so called Adaboost algorithm
. This method includes a training process, Haar-like feature extraction and classification, and cascade application of the classifiers, although it only applies to gray images. Many subsequent researches have referenced them
proposed new classifiers to apply to a color image, and
designed a classifier that included skin color and eye-mouth features, as well as Haar-like features.
focused on the asymmetric features, to make a classifier for them. In
, the result from local normalization and Gabor wavelet transform to solve the color variation problem was applied to Adaboost. Some also used Haar-like features, but designed different classifiers, and refined them by a support vector machine
. Many other methods have used some features on the face, such as nose and eyes
, skin-color histogram
, and edges of the components of the face
. Also, many works used skin color to detect the face. Most of them used chrominance components.
proposed chrominance distribution models of the face, and
proposed a statistical model of skin color. In addition,
focused on detecting various poses of face, and
proposed a method to detect any angle of the face, with multi-view images.
Most face tracking methods so far also used the factors that have been used in the face detection method, such as the component features in a face
, appearance of the face
, and skin color
. Among the featurebased tracking methods,
used the eyes, mouth, and chins as the landmarks,
tracked some features individually characterized by Gabor wavelets, and
used an inter-frame motion inference algorithm to track the features.
used silhouette features, as well as semantic features to on-line track the features. Appearance-based tracking basically used face shape or appearance
additionally used the contour of the face to cover large motions of the face, and
proposed constraints to temporally match the face.
proposed a method to learn the appearance models online, and
used a condensation method for efficiency. Many face tracking schemes also used skin color
proposed a modeling method based on skin color, and
constructed a condensation algorithm based on skin color.
used color distribution models to overcome the problem caused by varying illumination. Some methods used other facial factors, in addition to the skin color, such as facial shape
. Also, some methods used a statistical model of skin color, by adopting a neural network to calculate the probability of skin color, with adaptive mean shift method for condensation
tracked various poses of face, by using a statistical model.
used a template matching method to track a face, by using depth as the template. It first found hands, and then found the face to be tracked, by using the hands as the support information.
This paper proposes a combination of a face detection scheme and a face tracking scheme, to find and track the face(s) seamlessly, even in the case that the human goes out of the image, or the scene changes. In our scheme, the face detection is performed at the beginning of face tracking, or when the tracked face disappears. It is a hybrid scheme that uses features, skin color, and motion. It basically uses the method in
, the Adaboost method or Viola & Jones method. But we reduce the search area to apply the Adaboost method, with motion and skin color. Our face tracking scheme uses a template matching method with only depth information, as in
. But it directly tracks the face, without any auxiliary information. Also, it includes a template resizing scheme, to adapt to the change of the distance of the face. In addition, it includes an early termination scheme, to reduce the execution time to run in more than a real time, with minimizing the tracking accuracy. Thus, we will show that it can be used adaptively, by considering the complementation between the execution time and the tracking error.
This paper consists of six chapters. The next chapter explains the overall operation of our scheme. The proposed face detection scheme and face tracking scheme are explained in more detail separately, in Chapter 3 and Chapter 4, respectively. Chapter 5 is devoted to finding the necessary parameters, and experimenting the proposed schemes. Finally, Chapter 6 concludes this paper, based on the experimental results.
2. Overall Operation
The global operation of the face tracking algorithm proposed in this paper is shown as a flow graph in
. As mentioned before, the main proposal is for a face tracking algorithm, but we also propose a scheme to reduce the calculation time of face detection. That is, the proposed face tracking method uses a template of the face(s), which consists of the position and depth information of the face, and the first template is extracted by the face detection process. But afterward the template is updated by the tracking scheme itself, except for the case when the face being tracked disappears from the image.
Process flow of global operation
At the very first, the face detection process is performed. It uses both RGB image and depth information. Basically, it uses an existing method, the Adaboost algorithm
, but the search area is reduced, by finding movement of the human, and skin color. Once a face is detected, its position (x-y coordinate), and the corresponding depth information (segment of depth map), are taken as the template to be used in the tracking process.
The tracking process uses a template matching method that finds a block matching the template. Here, we use only depth information for this process. It also includes a scheme to resize the template, as well as actual calculation for template matching. Also an early termination process is incorporated, to reduce the execution time.
The result from the tracking process, if it is successful, is the template of the current frame, and is used as the template for the next frame. But if it fails, it goes back to the detection process, to find a new face. This case is when the scene change or when the face being tracked disappears from the scope of the image. That is, in most cases, the scene does not change, or the tracked person remains in the scope of the image, so the detection process is performed only once, at the very start.
The image data we need for face detection and tracking is both RGB images and depth images. Here, we assume that the two kinds of images are given with the same resolution, such as the ones by Kinect from Microsoft.
3. Face Detection Algorithm
The processing flow of the proposed face detection scheme is as shown in
. Basically it uses the Adaboost algorithm
, but the area to search for the human face(s) is restricted, by using two consecutive depth images (i-1
), and the current RGB image (i
The proposed face detection procedure
To do this, for the i
RGB image, if it contains skin color region is examined with Eq. (1), which was taken from the skin color reference in
. Here, only C
components are used in the YC
The result by Eq. (1) is a binary image (we call it a skin image), in which the pixels with ‘1’ value indicate human skin region. But a skin region may not be found, even if it exists (false negative error), which is mostly due to the illumination condition. Thus, in this case, we try once more, after adjusting the color distribution by a histogram equalization method
. If the first attempt, or the re-trial of skin color detection, finds any skin color pixel, we define the skin region as in
. From the skin image
) , the vertical skin region image
) (horizontal skin
Procedure to define a skin region
)) is obtained, such that if any pixel in a column
) has value‘1’, all the pixels in the column
(in the row
) have ‘1’ if (
). Then, the final skin region image
) is obtained, by taking the common parts of
Fig. 4 (a)
shows the scheme to find the skin region image, where the horizontal and vertical gray regions correspond to
) respectively, and the red-boxed regions are the defined skin region.
Examples of results from face detection processes: (a) skin region; (b) movement region; (c) detected face
Meanwhile, for the depth information, a depth difference image
) between the
depth images is found as Eq. (2), where
) is the depth value at (
) in the depth image
. The resulting image
) is also a binary one, where the region with ‘1’ defines the region with movement.
From the extracted depth difference image, the region map
) can also be found in the same way as
) , and
) , instead of
), respectively. An example to find the movement region corresponding to
Fig. 4 (a)
is shown in
Fig. 4 (b)
, where the horizontal and vertical gray regions correspond to
) , respectively, and the white-boxed region is the defined movement region.
With existence of the skin color region or movement region, there are four cases to define the area for the Adaboost algorithm to search face(s). When both skin color region and movement region exist, the search area is defined by the common regions of the skin region and the movement region. But when only skin region (movement region) exists, the skin region (the movement region) itself is defined as the search area. But when neither the skin region nor movement region exist, the process takes the next image frame, and performs the whole process again. Finally, the Adaboost algorithm is applied to the defined search area. An example of an RGB image segment of the finally detected face is shown in
Fig. 4 (c)
, which corresponds to
Figs. 4 (a)
The data from the face detection process is the coordinate and the size of the detected face, and its corresponding depth image segment. Because our purpose is more for face tracking, rather than face detection itself, the face information nearest to the camera is sent to the face tracking process, when more than one faces is detected in the detection process.
4. Face Tracking Algorithm
By taking the information of the detected face from the face detection process, or the previous face tracking process as template, the face tracking process is performed, as in
. Because the proposed tracking process uses only depth information, it takes the next frame of the depth image as another input. Each step is explained in the following.
The proposed face tracking procedure
- 4.1 Template and search area re-sizing
The first step in the proposed face tracking scheme is to resize the template, and the search area, which is the area to find the face being tracked. Among the two, the template size is processed first, because the size of the search area depends on the resized template.
- 4.1.1 Template Re-sizing
If a human moves horizontally or vertically, the size of the face is nearly the same, but for back-and-forth movement it changes. So, the template that has been detected or updated in the detection process or the previous tracking process needs to be resized to fit to the face in the current frame.
- (1) Relationship between depth and size
Because the size of an object in an image depends totally on its depth, change of the size of an object according to its depth is explained first. To do this it is necessary to define the way to express a depth. In general, a depth camera provides a real depth value in a floating-point form. Also, it usually has a distance range, within which the estimated value is reliable. Let’s define the range as (z
), and the depth value is expressed digitally by an n-bit word. Then, a real depth value z corresponds to a digital word depth is explained first. To do this it is necessary to define the way to express a depth. In general, a depth camera provides a real depth value in a floating-point form. Also, it usually has a distance range, within which the estimated value is reliable. Let’s define the range as (z
), and the depth value is expressed digitally by an n-bit word. Then, a real depth value z corresponds to a digital word Z', as Eq. (3)
But in a typical depth map, a closer point has a larger value, by converting it as Eq. (4) or (5), and this paper also uses this value, Z.
Now, when an object, whose real size is s and depth is z, has the size S
in the image sensor of a camera whose focal length is f, the relationship between S
and z or Z is as Eq. (6).
If we assume that the pixel pitch in the image sensor is P
, and number of pixels corresponding to S
is N, the number of pixels in the real image is the same as N, and is found as Eq. (7).
shows an example plots for the relationship in Eq. (7), and its measured result. Here, the dashed line is the plot from Eq. (7), the dots are the measured values, and the un-dashed line is the trend line. The error between the dashed line, and the measured value or the trend line, is because of the digitizing error (from the above equations, a function to delete the fractions should be applied), and the measuring errors.
Plotting for the relationship between size and depth
- (2) Face depth estimation and template re-sizing
To resize the template according to
or Eq. (7), the depth of a face in the current frame must be re-determined. For this, we define the depth template area as all the pixels in the
) , corresponding to the ones in the previous template Ti-1 , as Eq. (8).
The scheme is shown in
. The template and are divided into
blocks (each block has
resolution), and average values
for each block (block (j, k)) in the template and the depth template area, respectively, are calculated, to find the maximum values TA
, respectively, as Eqs. (9) and (10). In this paper,
are determined empirically.
Defining the search area
Then, the size of the updated template
is calculated with the size of the previous template
as Eq. (11), and the template is resized accordingly (
, representing horizontal and vertical, respectively).
- 4.1.2 Search area re-sizing
Although the template size has been updated, it is necessary to find the exact location of the face in the current frame, by searching the appropriate area, which we call the search area. The search area must be determined by considering the depth value of
, and the maximum amount of face movement. The first has just been considered above. For the second, we have measured the maximum movement empirically with proper test sequences, and this will be explained in the experiment chapter. By considering both factors, we determine some extension of the template size
as the size of the search area, as shown in
means the smallest integer not less than
, and X is
- 4.2 Template matching
Once the resized template (we call it just ‘template’, from now on) and the corresponding search area are determined, the exact face location is found by a template matching. For this, we use the SAD (sum of absolute differences) value per pixel (PSAD) as the cost value, as Eq. (12), where K is the number of pixels in the template, (
) is the current position to be examined in the search area, and
) ) is the pixel value at (
) in the template (search area).
The final location of the matched face template
is determined as the pixel location satisfying Eq. (13)
- 4.2.1 Early termination
The process of Eq. (13) needs to repeat the calculation of Eq. (12) as many times as the number of pixels in the search area, which might take too much time. The means to reduce this search time in our method is an early termination scheme, whereby the search process is terminated when a certain criterion is satisfied. Because the amount of the face movement in the assumed circumstances is usually small or none, search from smaller movement to larger is more proper. So we take a spiral search scheme, as in
, which is an example with a search area of 5×5 [pixel
]. The dashed arrows show the direction of search, and the numbers in the blocks (pixels) are the search order. Thus, if the early termination scheme is not applied, each of the full search sequence SS
is examined to find the pixel
, as Eq. (14), where the size of the search area is assumed as
Spiral search and early termination scheme
To terminate the examination earlier than the last pixel, the PSAD value of a pixel satisfies Eq. (14),
is the threshold value of PSAD that we assume for when the pixel
satisfying Eq. (14) is close enough to
} means the first position satisfying x. T
is determined empirically.
- 4.2.2 Sparse search and refinement
Even though the early termination scheme reduces the search time, it could be reduced more, if we might sacrifice accuracy a little more. This can be done by hopping several pixels, from the currently examined pixel, to the next one to be examined. In this case, the search sequence is as in Eq. (16), when the number of intervals between the current and the next is p, which is called the ‘hopping distance’.
If the early termination scheme is not applied, and
=1, it is the same as
. If the early termination scheme is applied, and
=1, it is the same as when this sparse search scheme is not applied.
shows a case of
=3, where the dark pixels are the ones in the
One more step is included in this sparse search. When
> 1, and a pixel (
) in the sequence satisfies Eq. (15) (early terminated), the immediate neighbor pixels are additionally examined. We call it a refinement process, and two cases are considered. The first is called 2-pixel refinement, and additionally examines two pixels (
) just before and just after
(horizontally striped),); while the second is called 4-pixel refinement, and two more pixels just outside of
(vertically striped) are examined. In either case, the pixel with the lowest PSAD value would be the final pixel,
. If a sparse search does not find a pixel satisfying Eq. (14), the one with minimum PSAD value is selected as the final result.
- 4.2.3 Feedback to face detection
If the face can no longer be tracked, in such a case that the human goes out of the screen, or is hidden by another object, our algorithm goes back to the face detection algorithm, to find another face. This is decided by a PSAD threshold value T
, as in Eq. (17)
Eq. (17) is applied to any search sequence, no matter whether the early termination scheme, or sparse search, is applied, or not.
5. Experiments and Results
We have implemented the proposed face detection algorithm and the face tracking algorithm, and conducted experiments with various test sequences. In these experiments, we used Microsoft Visual Studio 2010, and OpenCV Library 2.4.3 in the Windows operating system. The computer used in the experiments has an Intel Core i7 3.4GHz CPU, with 16GB RAM.
- 5.1 Determining parameters for face tracking
First, we have empirically determined the parameters defined in the previous chapter. For this, we used three home-made contents, whose names indicate the directions of movements. The information is in
, and two representative images for each sequence are shown in
. The speed of movement in each content was conducted maximally, to cover more than enough movement speed, compared with the assumed circumstances. All three contents are captured by Kinect® from Microsoft, so the resolution of both RGB image and depth image is 640×480.
Contents used for parameter determination
Contents used for parameter determination
Representative images in each sequence: in LR: (a) 18th frame; (b) 119th frame; in UD; (c) 36th; (d) 166th, in BF; (e) 96th frame; (f) 130th frame
- 5.1.1 Template segmentation for template resizing
, we have segmented the template and the corresponding region of the current frame into
blocks. Because there can be so many combinations of p and q, we only take the cases of
that are an odd number. The purpose of this segmentation is to resize the template to match the current face size. So, we have estimated the relative error of the resized template. In this experiment, we have used the faces extracted by applying the face detection algorithm, as the reference templates. The relative error, named as template resizing error, was calculated as Eq. (18).
shows the experimental results, where (a) shows the average values of the template resizing errors for various segmentations, and for the three test sequences. As can be seen by considering all the three test sequences, 3×3 segmentation is the best.
Fig. 10 (b)
shows the change of the template resizing error throughout the sequences, for 3×3 segmentation. Any frame of any sequence does not exceed 3% in error. Other experimental results showed that all the segmentations have very similar execution time. So, we decided on 3×3 segmentation as our segmentation scheme.
Template resizing error for template segmentation: (a) average values for various segmentations, and (b) values for 3×3 segmentation
- 5.1.2 Size of the search area
The next parameter is the extension ratio α in
, to determine the size of search area. To do this, we performed two experiments. The first was to measure the actual maximum amount of movement between two consecutive frames. In this experiment, we have 25.8%, 23.9%, and 18.2% of the size of the template as the maximum movement in for the UD, LR, and BF sequence, respectively. From this experiment, the value of α should be 0.258.
The second experiment was to measure the amount of displacement between the found template T
within the given search area, and the best solution of template T
found in the whole image. Here, both templates were found by a full search, without early termination, or sparse search. We converted it to displacement error, which is calculated by Eq. (19).
(T) means the size and the position of T, respectively.
The result is shown in
. The values in α<21% of sequence UD were more than 20%, because the face could not be tracked properly. From this experiment, it is clear that the result by estimating maximum movement between two consecutive frames (α =0.258) is not enough. To make the displacement error almost 0 (less than 0.1%), α≥0.41 should be maintained. Therefore, we decided on α as 0.41.
Displacement errors for the three test sequences to the value of α
- 5.1.3 Threshold value for early termination
The PSAD threshold value TET for early termination was also determined by an empirical method.
is the result, where both displacement errors (D-Error) by Eq. (19), and execution times (Time), are shown with average values. The BF sequence shows the lowest displacement errors on almost all the threshold values, but for the execution time, the UD sequence shows the lowest. Because our aim was to track the face in real time, with appropriately low displacement error, we have chosen T
as 2, which makes the displacement error lower than 4%, with execution time lower than 40ms.
Experimental results to determine TET
- 5.1.4 Hopping distance and refinement search for sparse search
Another scheme to reduce the execution time is to hop from the current search pixel to the next, in the spiral sequence. The hopping distance was determined empirically, as well. We have measured the displacement error and execution time by increasing the hopping distance, fixing on T
=2. The result is shown in
Fig. 13 (a)
, where all the displacement errors and execution times are the average values. As the hopping distance increases, the displacement error increases, and the execution time decreases, as expected. Because of our aim, we took 3 as the hopping distance, which makes the displacement error less than 5%, and the execution time less than 30ms.
Experimental results to determine the hopping distance and refinement process: (a) hopping distance; (b) refinement process.
One more thing we decided was the refining process that examines the neighboring pixels to the early-terminated one (refer to
). This also has two options, 2-pixel refinement and 4-pixel refinement. Because we have chosen T
=2 and hopping distance=3, we used them in this experiment. The result is shown in
Fig. 13 (b)
. The displacement error of UD dramatically decreased in 2-pixel refinement and 4-pixel refinement, although other sequences did not change much for any refinement. Also, the increase in execution time is negligible for all the sequences, and all the refinement processes. So, we will apply 2-pixel refinement or 4-pixel refinement.
- 5.1.5 Threshold value to return to the face detection process
The final parameter to be decided is the PSAD threshold for the tracking process to return to the face detection process, by assuming that the tracking process could not track the face correctly any more. For this we prepared a few special sequences, where a human walks out of the screen, or behind an object, etc.
Fig. 14 (a)
shows an example of the experimental result of PSAD values when the human being tracked walks out of the screen. As you can see in the graph, the PCAD value changes dramatically from the 75
frame to the 76
frame. Images for the two frames are shown in
Figs. 14 (b)
, respectively. Other sequences showed similar results, so, it is quite reasonable to choose T
Experimental result to determine TFB: (a) PSAD value change; (b) 75th frame image; (c) 76th frame image.
- 5.2 Experimental results and comparison
We have experimented using the proposed face detection scheme and the face tracking scheme with some test sequences. They are listed in
, where the first three sequences were tested only for the detection scheme, because they were used to extract the parameters for the tracking scheme. Lovebird1 is a test sequence from MPEG for multi-view, and its resolution is 1,920×1,080. The last two were home-made with Kinect®. In Lovebird1, two persons walk side-by-side, from far away, to very near the camera. The WL sequence has only one person sitting on the screen, and moving various ways. In the S&J sequence, two persons appear at first, but the one near the camera walks out of the screen. Then, the other is moving this and that way, while sitting on a chair. Three representative images for each sequence are shown in
Applying test sequences
The representative images for Lovebird1: (a) 18th frame; (b) 90th frame; (c) 119th frame; WL: (d) 116th frame; (e) 303th frame; (f) 416th frame; and S&J: (g) 144th frame; (h) 182th frame; (i) 328th frame.
- 5.2.1 Face detection
The experimental results for the proposed face detection scheme are summarized in
, which includes the true positive rate (TP), false positive rate (FP), false negative rate (FN), and execution time per frame for each test sequence. In this table, our results are compared with the Adaboost method in  (V&J). TP, FP, and FN were calculated as Eqs. (20-1), (20-2), and (20-3), respectively, where, a true positive face is a face that was detected and truly resides in the image, a false positive face is one that was detected but is not truly a face truly, and a false negative face is a face that resides in the image, but was not detected.
The experimental results and comparison of the face detection methods
The experimental results and comparison of the face detection methods
As can be seen in the table, V&J is a little better in FN, but ours are much better in TP. In particular, ours showed 0% of FP. Also the execution time of ours was about 1/10 that of V&J. This means that our scheme reduces the search area to about 1/10 of the whole image, with a little sacrifice in the FP.
compares some previous methods with ours. Note that the test sequences for each method are different. Because ours uses depth information, and the others do not contain their depth information, those test sequences could not be used. Also, we failed to get the implemented results for the existing methods. Thus, it seems unclear to compare them; but we still think it is worthy, because it makes some indirect comparison possible. Also note that the ‘-’ in a cell indicates that there is no information in the corresponding paper. As can be seen in the table, ours outperforms any existing method in TP, FP, FN, and even in the execution time.
Comparison with existing methods for face detection
Comparison with existing methods for face detection
- 5.2.2 Face tracking
The experimental results of the proposed face tracking scheme for the last three sequences in
are shown in
, where displacement error rate and execution time per frame are graphed to the frames for each sequence. Also, the average values are shown in
include two refinement schemes, 2-pixel refinement and 4-pixel refinement. In this experiment, we let the scheme track the nearest person to the camera, when more than one person resided in the screen (Lovebird1 sequence, and the front part of the S&J sequence).
Face tracking experimental results for displacement and execution time for the sequence of: (a) Lovebird 1; (b) WL; (c) S&J.
Experimental results for the proposed face tracking scheme
Experimental results for the proposed face tracking scheme
Because the man in Lovebird1 walks toward the camera, his face gets larger with moving left and right repeatedly, and whenever the direction changes, he remains still for a couple of frames. It is projected exactly to the execution time, as shown in
Fig. 16 (a)
. Also, the execution time becomes larger, as the sequence progresses, because the size of face becomes larger. That is, as the size of face increases, the execution time increases. This can be also found, by comparing the times between before and after, about the 190th frame of S&J. The woman is nearer than the man. For reference, the distances from the camera to the man in WL, and the woman and the man in S&J are about 120cm, 110cm, and 160cm, respectively.
, the average displacement error was less than 3%, with less than 5ms of the execution time per frame. For reference,
shows three example texture images corresponding to the depth template images, with 0%, 4%, and 7.8% of displacement error, respectively. By considering them, it is obvious that 3% of the average displacement error is quite acceptable. Also, the proposed tracking scheme takes about 5ms in average to track the face per frame, and even for the Lovebird1 sequence, it is less than 8ms, which is more than enough speed for real time tracking.
Texture images corresponding to the depth templates, with displacement error ratios of: (a) 0%; (b) 4%; (c) 7.8%
compares our method with some existing methods. Because they do not provide enough information for fair and clear comparison, we have used the data from the papers without modification or recalculation. Instead, we fitted our data to them as close as possible. Most tracking methods used some sequences made by their respective authors as the test sequences, and so did we. Among the three schemes in
, we have chosen the 2-pixel refinement scheme, because both its error rate and execution time are in the middle. In the table, we entered depth+RGB as the property of the sequence of ours, but actually, only depth was used. For tracking rate, ours showed 100%, because ours changes the execution mode to face detection, when it fails to track the face. It happened only once in the S&J sequence, as explained before. In the tracking error,
provided the displacement amount and root mean square error (RMSE), as the information for how accurate they are, respectively, while we have provided the displacement error rate in the previous table. So, we have changed our data corresponding to them, and showed it in parenthesis. From the data in the table, it is clear that our method outperforms in tracking accuracy, and execution time.
Comparison with existing methods for face tracking
Comparison with existing methods for face tracking
In this paper, we have proposed a combination of a face detection scheme and face tracking scheme, to track a human face. The face detection scheme basically uses the Viola & Jones method
, but we have reduced the area to be searched, with skin color and depth information. The proposed tracking scheme basically uses a template matching method, but it only uses depth information. It includes a template resizing scheme, to adapt to the change of the size or depth of the face being tracked. It also incorporates an early termination scheme, by a spiral search for template matching, with a threshold, and a refinement scheme with the neighboring pixels. If it fails to track the face by deciding with a threshold, it automatically returns to the face detection scheme, to find a new face to track.
Experimental results for the face detection scheme showed 97.6% of the true positive detection rate on average, with 0% and 2.1% of false positive rate and false negative rate, respectively. The execution time was about 44[ms] per frame, with 640×480 resolution. From the comparison with the existing methods, it was clear that ours is better, in both detection accuracy and execution time.
The experimental results for the proposed face tracking scheme showed that the displacement error rate is about 2.5%, with almost 100% tracking rate. Also, the tracking time per frame with 640×480 resolution was as low as about 2.5 [ms]. These results much outperform the previous face tracking schemes. Also, we have shown some trade-offs between tracking accuracy and execution time for the size of the search area, PSAD threshold value for early termination, hopping distance, and refinement scheme.
Therefore, we can conclude that the proposed face detection scheme and face tracking scheme, or their combination, can be used in an application that needs fast and accurate face detection and/or tracking. Because our tracking scheme can also provide higher speed with sacrificing a little tracking accuracy, by increasing the early termination threshold value and hopping distance, or decreasing the search area and taking a simpler refinement scheme, its applicable area can be found much more broadly.
This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (MEST). (2010-0026245).
Dong-Wook Kim He has received his M.S degree in 1985 from Dept. of Electronic Engineering of Hangyang University in Seoul, Korea and his Ph.D degree in 1991 from Dept. of Electrical Engineering of Georgia institute of Technology in GA, U.S.A. He is a Professor of Dept. of Electronic Materials Engineering at Kwnagwoon University in Seoul, Korea.
Woo-Youl Kim He has received his B.S degree in 2012 from Dept. of Information & Communication Engineering of Anyang University in Anyang, Korea. He is currently pursuing M.S degree in Dept. of Electronic Materials Engineering of Kwangwoon University in Seoul, Korea. His research interests include digital image processing, digital holography, image compression.
Jisang Yoo He received the B.S. and M.S. degrees from Seoul national university, Seoul, Korea in 1985 and 1987, all in electronics engineering, and Ph.D. degree from Purdue University, West Lafayette, IN, USA, in electrical engineering in 1993, respectively. From September 1993 to august 1994, he worked as a senior research engineer in industrial electronics R&D center at Hyundai Electronics Industries Co., Ltd, Inchon, Korea, in the area of image compression and HDTV. He is currently a professor with the department of electronics engineering, Kwangwoon University, Seoul, Korea. His research interests are in signal and image processing, nonlinear digital filtering, and computer vision. He is now leading 3DTV broadcast trial project in Korea.
Young-Ho Seo He has received his M.S and Ph.D degree in 2000 and 2004 from Dept. of Electronic Materials Engineering of Kwangwoon University in Seoul, Korea. He was a researcher at Korea Electrotechnology Research Institute (KERI) in 2003 to 2004. He was a research professor of Dept. of Electronic and Information Engineering at Yuhan College in Buchon, Korea. He was an assistant professor of Dept. of Information and Communication Engineering at Hansung University in Seoul, Korea. He is now an associated professor of College of Liberal Arts at Kwangwoon University in Seoul, Korea and a director of research institute in TYJ Inc. His research interests include realistic media, digital holography, SoC design and contents security.
Kriegman D. J.
“Detecting Faces in Images; A Survey,”
IEEE Trans. On Pattern Analysis and Machine Intelligence
DOI : 10.1109/34.982883
“Rapid Object Detection using a Boosted Cascade of Simple Features,”
Conf. on Computer Vision and Pattern Recognition
“Robust Real-time Object Detection,”
Intl. Workshop on Statistical and computational Theories of Vision-Modeling, Learning, Computing, and Sampling
Erdem C. E.
Erdem A. T.
“Combining Haar Feature and Skin Color Based Classifiers for Face Detection,”
Conf. Acoustics, Speech and Signal Processing
Inalou S. A.
“AdaBoost-based Face Detection in Color Images with Low False Alarm,”
Conf. Computer Modeling and Simulation
“Fast Rotation Invariant Face Detection in Color Image Using Multi-Classifier Combination Method,”
Brubaker S. C.
Mullin M. D.
Rehg J. M.
“Fast Asymmetric Learning for Cascade Face Detection,”
IEEE Trans. Pattern Analysis and Machine Intelligence
DOI : 10.1109/TPAMI.2007.1181
“Face detection in color images using AdaBoost algorithm based on skin color information,”
Workshop on Knowledge Discovery and Data Mining
“Automatic face detection in video sequence using local normalization and optimal adaptive correlation techniques,”
J. Pattern Recognition
DOI : 10.1016/j.patcog.2008.11.026
“Face detection using discriminating feature analysis and support vector machine,”
J. Pattern Recognition
DOI : 10.1016/j.patcog.2005.07.003
Waring C. A.
“Face Detection Using Spectral Histograms and SVMs,”
IEEE trans. Syst., Man, And Cybernetics
DOI : 10.1109/TSMCB.2005.846655
Lee A. J. T.
“A Data mining Approach to Face Detection,”
J. Pattern Recognition
DOI : 10.1016/j.patcog.2009.09.005
“An Effective Face Detection Algorithm based on Skin Color Information,”
Conf. Signal Image Technology and Internet Based Systems
“Target Face Detection using Pulse Coupled Neural Network and Skin Color Model,”
Conf. Computer Science & Service System
“Face Detection Based on a Model of the Skin Color with Constraints and Template Matching,”
Conf. Machine and Web Intelligence
“Hybridboost Learning for Multi-pose Face Detection and Facial Expression Recognition,”
J. Pattern Recognition
DOI : 10.1016/j.patcog.2007.08.010
“High- Performance Rotation Invariant Multiview Face Detection
IEEE Trans. Pattern Analysis and Machine Intelligence
DOI : 10.1109/TPAMI.2007.1011
“Tracking and Learning Graphs and Pose on Image Sequences of Faces,”
International Conf. Automatic Face and Gesture Recognition
“Tracking Facial Feature Points with Gabor Wavelets and Shape Models,”
International Conf. on Audio and Video-based Biometric Person Authentication
“Realtime Bayesian 3-D Pose Tracking,”
IEEE Trans. Circuits Syst. Video Techn.
DOI : 10.1109/TCSVT.2006.885727
“Real Time Feature Based 3-D Deformable Face Tracking,”
“Active Appearance Models,”
IEEE Trans. Pattern Anal. Mach. Intell.
DOI : 10.1109/34.927467
Sung J. -W.
“Large Motion Object Tracking using Active Contour Combined Active Appearance Model,”
International Conference on Computer Vision Systems
“AAM based Face Tracking with Temporal Matching and Face Segmentation,”
IEEE Conf. on Computer Vision and Pattern Recognition
“Robust Face Tracking via Collaboration of Generic and Specific Models,”
IEEE Trans. Image Processing
DOI : 10.1109/TIP.2008.924287
“Adaptive Appearance Model and Condensation Algorithm for Robust Face Tracking,”
IEEE Tran. Syst., Man, and Cybernetics
DOI : 10.1109/TSMCA.2010.2041655
“Colour Model Selection and Adaptation in Dynamic Scenes,”
European Conference on Computer Vision
“Robust Real-time Face Tracking using Adaptive Color Model,”
International Symposium on Mechatronics and Intelligent Mechanical System for 21 Century
“Adaptive Color Space Switching for Tracking under Varying Illumination,”
Image Vision Computation
DOI : 10.1016/j.imavis.2004.09.005
“Robust Face Tracking by Integration of Two Separate Trackers: Skin Color and Facial Shape,”
J. pattern recognition
DOI : 10.1016/j.patcog.2007.03.003
“Multimodal Approach to Human-Face Detection and Tracking”
IEEE Trans. Industrial Electronics
DOI : 10.1109/TIE.2007.903993
“A Robust Realtime Face Tracking Algorithm,”
International Conference on Image Processing
“Real-Time Head and Hand Tracking Based on 2.5D Data”
IEEE Trans. Multimedia
DOI : 10.1109/TMM.2012.2189853
“Locating Facial Region of a Headand -Shoulders Color Image,”
Int’l Conf. Automatic Face and Gesture Recognition
Gonzalez R. C.
Woods R. E.
Digital Image Processing
Pearson Ed. Inc.
Upper Saddle River, NJ