Tracking of 2D or 3D Irregular Movement by a Family of Unscented Kalman Filters
Tracking of 2D or 3D Irregular Movement by a Family of Unscented Kalman Filters
Journal of information and communication convergence engineering. 2012. Sep, 10(3): 307-314
Copyright ©2012, The Korean Institute of Information and Commucation Engineering
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( 3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • Received : June 20, 2012
  • Accepted : July 18, 2012
  • Published : September 30, 2012
Export by style
Cited by
About the Authors
Junli, Tao
Reinhard, Klette

This paper reports on the design of an object tracker that utilizes a family of unscented Kalman filters, one for each tracked object. This is a more efficient design than having one unscented Kalman filter for the family of all moving objects. The performance of the designed and implemented filter is demonstrated by using simulated movements, and also for object movements in 2D and 3D space.
Object tracking is an important task for many applications, such as for robot navigation, surveillance, automotive safety, and video content indexing. Based on trajectories obtained through tracking, some advanced behaviour analysis can be applied. For instance, a pedestrian’s trajectory can be analysed to warn a driver if the trajectories of the vehicle and of the pedestrian are potentially intersecting.
For multiple object tracking, tracking-by-detection methods are the most popular algorithms. A detector is used in each image frame to obtain candidate objects. Then, with a data-association procedure, all the candidates are matched to the existing trajectories as known up to the previous frame. Any unmatched candidate starts a new trajectory. Since there is no perfect detector that detects all objects without any false positives and false negatives, sometimes objects are missed (i.e., they appear in the image but are not detected), or background windows are incorrectly detected as being objects. Such false-positive or false-negative detections increase the difficulty of tracking.
Occlusion by other objects or the background is one of the main reasons for detection to fail and it also increases the difficulty of tracking (e.g., identity switch). Some algorithms [1 , 2] propose tracking objects in the 2D image plane. The occlusion problem is handled either using part detectors and tracking detected body parts, or adopting instance-specific classifiers to improve performance of data assignment. However, tracking in the 2D image plane increases the ambiguity of data association. A tall person nearby, and a small person far away, for example, may appear very close to each other in the image, and, possibly, in some frames the tall person occludes the small person. But they are actually several meters away from each other. Thus, often, and also in this paper, stereo information is adopted to improve the tracking performance [3 - 5] , and multiple pedestrians are tracked in 3D coordinates.
Tracking objects with irregular movements in 3D space is a challenging task due to the totally unknown speed and direction. In this paper, the application of an unscented Kalman filter (UKF), which can also handle nonlinear?in fact, fully irregular trajectories in 3D space, is demonstrated. For the original paper on UKF see [6] . Similar work is proposed in [5] . However, instead of modelling the motion of the vehicle and the pedestrians separately, we straightforwardly model the relative motion between them, and no ground plane is assumed, so that objects moving with 6 degrees of freedom can be tracked properly. Different types of models are tested and compared in both simulation and real sequences.
Multiple object tracking has attracts a great deal of attention recently in computer vision research. Today, an update of the review [7] from 2006 should also include work such as in [2 - 4 , 8 - 13] .
Kalman filters (KF) have been extensively adopted to deal with tracking tasks. A KF is a recursive Bayesian filter, firstly, using motion information to predict the possible position, followed by fusing the observation (detection) and predicted position. A linear KF is used for tracking (e.g., [7] ) when movement is such that linear models may be used for approximation. Obviously, a linear model is not suitable for most cases. The extended Kalman filter (EKF) was designed [14] for handling a nonlinear model by linearizing functions using the Taylor expansion extensively. For example, an EKF has been used for simultaneous localization and mapping [15] , and for pedestrian tracking [16] . A particle filter was used to handle the task in [17] . Performance similar to an EKF is reported in [3] .
The UKF can handle a nonlinear model by using the unscented transform to estimate the first and second order moments of sigma points, which represent the distribution of a predicted state and predicted observations, and it appears that the UKF does this better than the EKF [18] . Thus, in this paper, a UKF is used for tracking multiple, irregularly moving objects in 3D space, which is a highly nonlinear problem.
The unscented transform (UT) is the core component that enables the UKF able to be able to handle nonlinear models. Let L be the dimensionality of the system state x t-1|t-1 at time t -1. If the system noise (process noise Q and measurement noises R ) is not additive noise, the state is augmented before UT. In our case, random acceleration is introduced as process noise; thus, the state augmented with a process noise vector, is denoted by
PPT Slide
Lager Image
and called vector for short. The dimension of the augmented vector depends on the process model, which is illustrated in Section IV. Let x t|t-1 denote the predicted state at time t when passing x t-1|t-1 through process function f . Let y t |t -1 be the predicted observation at time t when passing x t |t -1 through observation function h .
The UT works by sampling 2L +1 sigma vectors Xi (a) in the augmented state space (following [6] ), forming a matrix χ (a) . The covariance matrix in augmented state space is denoted by P (a) . Let
PPT Slide
Lager Image
be the state covariance matrix (i.e., describing dependencies between components of a state χ ). Formally,
PPT Slide
Lager Image
where λ is a positive real, used as a scaling parameter. These sigma vectors can be passed through a nonlinear function (e.g., f or h) one by one, thus defining trans-formed (i.e., new) sigma vectors such as
PPT Slide
Lager Image
Means x t|t-1 or y t|t-1 and covariance matrices
PPT Slide
Lager Image
are obtained as follows; take h for example:
PPT Slide
Lager Image
with constant weights Wi (.) . Details are given in [6] .
The UKF is illustrated as follows. At first we initialize the state x=x 0 and state covariances P (xx) = P 0 (xx) . For the augmented vectors, let
PPT Slide
Lager Image
where Q denotes the process-noise covariance matrix. Details about Q are given in Section IV. For t ϵ (1, … , ∞), we calculate sigma vectors as follows:
PPT Slide
Lager Image
PPT Slide
Lager Image
The process update is as follows:
PPT Slide
Lager Image
We update the sigma vectors using
PPT Slide
Lager Image
and update the measurement covariance matrix as follows:
PPT Slide
Lager Image
where R is the assumed measurement noise covariance, depending on the observation model selected. Details are given in Section IV.
Altogether, the UKF is defined by
PPT Slide
Lager Image
Following tracking-by-detection methods, which are popular for solving multiple-object tracking tasks, a detector is applied in each frame to generate object candidates which are outputs of the detector. One UKF is adopted for tracking one object separately; thus a group of detected pedestrians defines a family of UKFs to be processed simultaneously. Each UKF tracks one detected object. The predicted state of a UKF is used for data association; when an observation (of the tracked object) is available in the current frame then we update the predicted state by using the corresponding UKF.
- A. Detection
Detection-by-tracking methods rely on evaluating rectangular regions of interest, and we call them object boxes if positively identified as containing an object of interest. For pedestrian tracking, we adopt the popular histogram of oriented gradients (HOG) feature method and a support vector machine (SVM) classifier, originally introduced in [19] . HOG features describe the human profile by an oriented gradient histogram. An SVM classifier is able to handle high-dimensional and nonlinear features (such as HOG features). It projects sample features into a high
PPT Slide
Lager Image
The depth map on top uses a colour code for calculated distances; depth values are only shown at pixels where the mode filter accepts the given value. The lower images show detected (coloured) object boxes.
dimensional space, and then finds a hyperplane to separate two classes. Instead of using a sliding window, regions of interest (i.e., inputs to the classifier) are selected by analysing calculated stereo information (depth and disparity maps), as proposed in [20] .
Fig. 1 shows several detection results in pedestrian sequence, dots (cyan) denote the boxes’ centre that are recognized as pedestrians, and the red rectangles denote the final detection results. As can be seen in the results, the object boxes may contain background, shift from the object, or miss the pedestrians.
For the detection of Drosophila larvae (an example of 2D movement), thresholds and connected components are adopted to obtain one object box for each larva. Several larva detection results are shown in Fig. 2 . As the scene is certain, the detection results are more reliable when compared to the pedestrian sequence. However, no depth information is available here.
PPT Slide
Lager Image
Larvae detection results shown by (cyan) object boxes.
- B. UKF-based Object Tracking
As there is an unknown number of objects in a scene, the state-dimensionality would expand significantly if we would have decided to track all pedestrians in one UKF; in this case, the speed of tracking reduces dramatically when the scene is crowded with many detected objects. Thus, we decided on one UKF for each detected object for tracking.
Choosing a proper model is important. In this subsection we offer three models for possible selection: 3DVT means that 3D position (world coordinates) with velocity is observed, 3DT means that 3D position without velocity is observed, and 2DT means that 2D position (image coordinates) without velocity is observed. These models are compared in Section V.
- 1) The Two 3D Models
In the 3DVT model, the object is tracked in 3D world coordinates. Its 3D position ( x, y, z ) is the first part of the state. We also include the velocity ( vx, vy, vz ). Thus, a state x=( x, y, z , vx, vy, vz ) T is 6-dimensional.
  • a) Process model:We assume constant velocity between adjacent frames, with Gaussian distributed noise accelerationnaϵN(0, Σna). The diagonal elements in Σnaare set to be equal and denoted by
PPT Slide
Lager Image
PPT Slide
Lager Image
whereΔ t is thetime interval between subsequent frames.
  • b) Observation model:An observation consists of the position (i0,j0) (say, the centroid of the detected object box in the left camera), disparitydof the detected object, and velocity (vox, voy, voz) in 3D coordinates. The usual pinhole camera projection model is used 4to map 3D points onto the image plane,
PPT Slide
Lager Image
where f denotes focal length, and b denotes the length of the baseline between two rectified stereo cameras. In this case,
PPT Slide
Lager Image
For the disparity d we select the mode in the disparity map in a fixed (e.g., 20 × 20) neighbourhood around the centroid ( i 0 , j 0 ) of the detected object box. 3D scene flow ( vox, voy, voz ) can be obtained by combining optic flow and stereo information [21] .
As it is difficult to obtain high-quality scene flow as required for 3DVT, 3DT simplifies the 3DVT model by excluding the scene flow from the observation, and has the same process model as 3DVT. In this case,
PPT Slide
Lager Image
- 2) The 2D Model
If only monocular recording is available, the object is tracked in the 2D image plane only. The state x=( i,j,vi,vj ) T consists of position ( i, j ) and velocity ( vi, vj ).
a) Process model: The process model is the same as for the 3D models. We assume a constant velocity between subsequent frames with a Gaussian noise distribution for acceleration n a :
PPT Slide
Lager Image
b) Observation model: An observation consists of the central position ( i 0 , j 0 ) of an object box only, i 0 = i and j 0 = j , resulting in R = diag 2 nmp 2 nmp ) for this case.
- C. Data Association
As each object is tracked independently, data association by matching candidates to existing trajectories becomes important. If no match is found then we decide to initialize a new tracker.
Since object movements are continuous, the estimated velocity in the UKF can be used as a cue to localize the search area in order to find the match object. For each trajectory, the possible location (i.e., ( xp, yp, zp ) for 3D, and ( ip, jp ) for 2D) of the object in the current frame is predicted by a process model used in the EKF. This location is used as a reference for searching potentially matching candidates in the current frame. Currently we simply match candidates based on the shortest Euclidean distance and a given threshold T .
One candidate might be matched with several trajectories if the Euclidean distance is below T . Trajectories compete for the candidates, and in the end, the closest one wins. If a candidate is not matched to any trajectory, a new tracker is initialized. If a trajectory does not win any of the candidates, the tracker is propagated with the given prediction, and the new state is the predicted state, without being updated by an observation (because it is not available).
No object appearance description is used here for assigning an object to a trajectory. In general, the inclusion of appearance representation (e.g., a colour histogram, or an instance-specific shape model) improves the performance. However, this is out of the scope of this paper, where we are focusing on the combination of different data association methods.
In this section, first, our three models (3DVT, 3DT, and 2DT) are tested in a simulated environment with different parameter sets. Second, our multiple-object tracking method is tested on real video sequences where (3D example) pedestrians are walking in inner-city scenes, or (2D example) larvae are moving on a flat culture dish.
- A. Simulated Tracking
The three models defined in Section IV are tested in a simulation environment in OpenGL (SGI, Fremont, CA, USa). A cub is moving on a circular path around a 3D point with constant speed, as shown in Figs. 3 and 4 . Acceleration noise n a with different covariance (e.g., σ 2 na = 0.0001, 0.01, 1), and measurement noise n m with different covariance (e.g., σ 2 nmp = 10, 50, 100, σ 2 nmv = 50, 100, 150), are used to test and compare the three models’ performance. The simulation environment is different for 2D and 3D models, where for 2D, positions are integral pixel coordinates in the image plane, but for 3D, position coordinates are reals. The radius of the circle in the 3D models is 10, while in the 2D model it is 50. In both environments, measurements are degraded by noise before being sent to the UKF.
Fig. 3 demonstrates the effect of σ 2 na , having fixed σ 2 nmp and σ 2 nmv . Experiments show that larger σ 2 na values result in more unstable trajectories. A large σ 2 na means that the process model produces a predicted state that is fluctuating with large magnitudes. Results show that σ 2 na = 0.0001 is a reasonable choice for 3D models. For the 2D model, a smaller σ 2 na yields smooth estimation, but the shifts are significant.
PPT Slide
Lager Image
Simulation results for variations in the variance of acceleration noise. From left to right, σ2na = 0.0001, 0.01, or 1, with fixed values σ2nm = σ2np = 50, and σ2nv = 100. From top to down, the tracking model is 3DVT, 3DT and 2DT, respectively.
A larger σ 2 na value produces estimations that are closer to the true positions, but fluctuations are significant, for an experiment with σ 2 na =1 for the 2D case. In general, 3DT and 3DVT converge better than 2DT, while 3DT and 3DVT show a similar performance. 3D models use stereo information rather than just a single image as for the 2D model, which also proves that stereo information can help to improve the tracking performance. As the measured 3D position is noisy, the measure of velocity is even noisier; this appears to be the main reason for the observation that the inclusion of velocity cannot improve the performance.
Fig. 4 shows results for our models for different covariance values σ 2 nmp and σ 2 nmv of measurement noise. Significantly increasing measurement noise (i.e., higher uncertainty of observations) reduces the performance only slightly. This demonstrates that, to some degree, the UKF is a robust tracker, which is not vulnerable to detection uncertainties. As before, 3DT and 3DVT converge better than 2DT, while 3DT and 3DVT show a similar performance.
PPT Slide
Lager Image
Simulation results for variable variance of measurement noise. From left to right, σ2nmp = 10, 50, or 100, σ2nmv = 50, 100, 150, respectively, with fixed σ2na = 0.0001 for 3D models, σ2na = 1 for 2D models. From top to down, the tracking models are 3DVT, 3DT and 2DT, respectively.
- B. Multiple Object Tracking in Real Data
In this section we report on the performance of UKFsupported tracking for multiple larvae using the 2DT model, and for multiple pedestrians in traffic scenes using the 3DT model. The larvae and pedestrian sequences are recorded at 30 and 15 frames per second, respectively.
Results for larvae tracking are shown in Fig. 5 . As the velocity in the model is initialized by (0,0), the UKFestimation is “ slower” than the real speed of the larvae in the first 30 frames. The speed of convergence can be improved by increasing σ 2 na , but it should be noted that the larger the σ 2 na value is, the larger the magnitude of fluctuation. The estimated trajectories follow the moving larvae effectively, mainly because all of the larvae are properly detected in all of the frames. However, such a complete detection cannot be expected for pedestrian sequences. Next, we test the UKF for such “ noisy” detection results as pedestrian sequences.
The results for pedestrian tracking are shown in Fig. 6 . Objects are missing or shifting from time to time due to the clustered background (e.g., the car in the traffic scene detected as a pedestrian), illumination variations leaving some pedestrians undetected, or internal variations between objects (i.e., unstable detections). Our experiments verified that UKF predictions are able to follow irregularly moving pedestrians when detection fails for a few frames, and can even correct unstable detections.
PPT Slide
Lager Image
2D Tracking results of larva sequences. From top to bottom: tracking results in Frames 26, 46, and 166 of one sequence. The red lines show the detected track, and the white lines show the unscented Kalman filter-predicted track. The blue lines represent estimated trajectories. The left column is the original intensity image overlaid with the estimated trajectories.
The second frame in Fig. 6 shows that the undetected pedestrian is predicted correctly in the white object box and is successfully matched to a detected position in the third frame. The last frame in Fig. 6 demonstrates that displaced detections are corrected by the UKF. Using only the defined distance rule for data assignment, this appears to be insufficient, especially for the given detection results. A small threshold may lead to a mismatch (i.e., the detection fails to satisfy the rule), and a large threshold may lead to an identity switch (i.e., a pedestrian is matched to another pedestrian).
Assigning one UKF to each detected (moving) object simplifies the design and implementation of UKF prediction of 2D or 3D motion. Experiments demonstrate the robustness of the chosen approach. This tracker only generates shortterm tracks when detection is not reliable; long-term tracking should be possible by also introducing dynamic programming.
PPT Slide
Lager Image
(a) Object boxes (red), unscented Kalman filter (UKF) predictions (white), and UKF estimations (blue) are overlaid on the original intensity image. The black box is an invalid detection excluded by a disparity check. (b) coloured blocks denote the object boxes (red), UKF predictions (white), and UKF estimations (blue) in the XZ-plane in real-world coordinates.
For evaluating the performance in real-world (either 2D or 3D) applications, more extensive tests need to be undertaken, especially for the design and evaluation of quantitative performance measures. For example, the measures discussed in [22] for evaluating visual odometry techniques might also be of relevance for the tracking case.
The authors thankSimon Hermannfor providing his implementation of semi-global matching for stereo analysis,Gabriel Hartmann(both Auckland) for support regarding the unscented Kalman filter, andBenjamin Risse(Münster) for video data with moving Drosophila larvae.
Wu B , Nevatia R 2007 “ Detection and tracking of multiple, partially occluded humans by Bayesian combination of edgelet based part detectors” International Journal of Computer Vision 75 (2) 247 - 266
Breitenstein M. D , Reichlin F , Leibe B , Koller-Meier E , van Gool L 2011 “ Online multiperson tracking-by-detection from a single, uncalibrated camera” IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (9) 1820 - 1833
Ess A , Schindler K , Leibe B , van Gool L 2010 “ Object detection and tracking for autonomous navigation in dynamic environments” International Journal of Robotics Research 29 (14) 1707 - 1725
Leibe B , Schindler K , Cornelis N , van Gool L 2008 “ Coupled object detection and tracking from static cameras and moving vehicles” IEEE Transactions on Pattern Analysis and Machine Intelligence 30 (10) 1683 - 1698
Meuter M , Iurgel U , Park S. B , Kummert A 2008 “ The unscented Kalman filter for pedestrian tracking from a moving host” Proceedings of IEEE Intelligent Vehicles Symposium Eindhoven, Netherlands 37 - 42
Wan E. A , van der Merwe R 2000 “ The unscented Kalman filter for nonlinear estimation” Proceedings of IEEE Adaptive Systems for Signal Processing, Communications, and Control Symposium Lake Louise, AB 153 - 158
Yilmaz A , Javed O , Shah M 2006 “ Object tracking: a survey” ACM Computing Surveys article no. 13 38 (4)
Ess A , Leibe B , Schindler K , van Gool L 2009 “ Robust multiperson tracking from a mobile platform” IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (10) 1831 - 1846
Leven W. F , Lanterman A. D 2009 “ Unscented Kalman filters for multiple target tracking with symmetric measurement equations” IEEE Transactions on Automatic Control 54 (2) 370 - 375
Mitzel D , Horbert E , Ess A , Leibe B 2010 “ Multi-person tracking with sparse detection and continuous segmentation” Computer Vision ECCV 2010 Lecture Notes in Computer Science 6311 397 - 410
Mitzel D , Sudowe P , Leibe B 2011 “ Real-time multi-person tracking with time-constrained detection” Proceedings of the British Machine Vision Conference Dundee, UK 104.1 - 104.11
Mitzel D , Leibe B 2011 “ Real-time multi-person tracking with detector assisted structure propagation” Proceedings of Computational Methods for the Innovative Design of Electrical Devices Barcelona, Spain 974 - 981
Shaikh M. M , Wook B , Lee C , Kim T , Lee T , Kim K , Cho D 2011 “ Mobile robot vision tracking system using unscented Kalman filter” Proceedings of IEEE/SICE International Symposium on System Integration Kyoto, Japan 1214 - 1219
Welch G , Bishop G 1995 “ An introduction to the Kalman filter” University of North Carolina at Chapel Hill Chapel Hill: NC TR 95-041
Huang S , Dissanayake G 2006 “ Convergence analysis for extended Kalman filter based SLAM” Proceedings of IEEE International Conference on Robotics and Automation Orlando, FL 412 - 417
Yazdi H. S , Hosseini S. E 2008 “ Pedestrian tracking using single camera with new extended Kalman filter” International Journal of Intelligent Computing and Cybernetics 1 (3) 379 - 397
Cai Y , de Freitas N , Little J. J 2006 “ Robust visual tracking for multiple targets” Proceedings of the 9th European Conference on Computer Vision Graz, Austria 107 - 118
Hartmann G 2012 “ Unscented Kalman filter sensor fusion for monocular camera localization” University of Auckland New Zealand MSc thesis
Dalal N , Triggs B 2005 “ Histograms of oriented gradients for human detection” Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition SanDiego, CA 886 - 893
Gavrila D. M , Munder S 2007 “ Multi-cue pedestrian detection and tracking from a moving vehicle” International Journal of Computer Vision 73 (1) 41 - 59
Wedel A , Brox T , Vaudrey T , Rabe C , Franke U , Cremers D 2011 “ Stereoscopic scene flow computation for 3D motion understanding” International Journal of Computer Vision 95 (1) 29 - 51
Jiang R , Klette R , Wang S 2011 “ Statistical modeling of longrange drift in visual odometry” Computer Vision ACCV 2010 Workshops, Lecture Notes in Computer Science vol. 6469 214 - 224