This paper reports on the design of an object tracker that utilizes a family of unscented Kalman filters, one for each tracked object. This is a more efficient design than having one unscented Kalman filter for the family of all moving objects. The performance of the designed and implemented filter is demonstrated by using simulated movements, and also for object movements in 2D and 3D space.
I. INTRODUCTION
Object tracking is an important task for many applications, such as for robot navigation, surveillance, automotive safety, and video content indexing. Based on trajectories obtained through tracking, some advanced behaviour analysis can be applied. For instance, a pedestrian’s trajectory can be analysed to warn a driver if the trajectories of the vehicle and of the pedestrian are potentially intersecting.
For multiple object tracking, trackingbydetection methods are the most popular algorithms. A detector is used in each image frame to obtain candidate objects. Then, with a dataassociation procedure, all the candidates are matched to the existing trajectories as known up to the previous frame. Any unmatched candidate starts a new trajectory. Since there is no perfect detector that detects all objects without any false positives and false negatives, sometimes objects are missed (i.e., they appear in the image but are not detected), or background windows are incorrectly detected as being objects. Such falsepositive or falsenegative detections increase the difficulty of tracking.
Occlusion by other objects or the background is one of the main reasons for detection to fail and it also increases the difficulty of tracking (e.g., identity switch). Some algorithms
[1
,
2]
propose tracking objects in the 2D image plane. The occlusion problem is handled either using part detectors and tracking detected body parts, or adopting instancespecific classifiers to improve performance of data assignment. However, tracking in the 2D image plane increases the ambiguity of data association. A tall person nearby, and a small person far away, for example, may appear very close to each other in the image, and, possibly, in some frames the tall person occludes the small person. But they are actually several meters away from each other. Thus, often, and also in this paper, stereo information is adopted to improve the tracking performance
[3

5]
, and multiple pedestrians are tracked in 3D coordinates.
Tracking objects with irregular movements in 3D space is a challenging task due to the totally unknown speed and direction. In this paper, the application of an unscented Kalman filter (UKF), which can also handle nonlinear?in fact, fully irregular trajectories in 3D space, is demonstrated. For the original paper on UKF see
[6]
. Similar work is proposed in
[5]
. However, instead of modelling the motion of the vehicle and the pedestrians separately, we straightforwardly model the relative motion between them, and no ground plane is assumed, so that objects moving with 6 degrees of freedom can be tracked properly. Different types of models are tested and compared in both simulation and real sequences.
II. RELATED WORK
Multiple object tracking has attracts a great deal of attention recently in computer vision research. Today, an update of the review
[7]
from 2006 should also include work such as in
[2

4
,
8

13]
.
Kalman filters (KF) have been extensively adopted to deal with tracking tasks. A KF is a recursive Bayesian filter, firstly, using motion information to predict the possible position, followed by fusing the observation (detection) and predicted position. A linear KF is used for tracking (e.g.,
[7]
) when movement is such that linear models may be used for approximation. Obviously, a linear model is not suitable for most cases. The extended Kalman filter (EKF) was designed
[14]
for handling a nonlinear model by linearizing functions using the Taylor expansion extensively. For example, an EKF has been used for simultaneous localization and mapping
[15]
, and for pedestrian tracking
[16]
. A particle filter was used to handle the task in
[17]
. Performance similar to an EKF is reported in
[3]
.
The UKF can handle a nonlinear model by using the unscented transform to estimate the first and second order moments of sigma points, which represent the distribution of a predicted state and predicted observations, and it appears that the UKF does this better than the EKF
[18]
. Thus, in this paper, a UKF is used for tracking multiple, irregularly moving objects in 3D space, which is a highly nonlinear problem.
III. UNSCENTED KALMAN FILTER
The unscented transform (UT) is the core component that enables the UKF able to be able to handle nonlinear models. Let
L
be the dimensionality of the system state x
_{t1t1}
at time
t
1. If the system noise (process noise
Q
and measurement noises
R
) is not additive noise, the state is augmented before UT. In our case, random acceleration is introduced as process noise; thus, the state augmented with a process noise vector, is denoted by
and called
vector
for short. The dimension of the augmented vector depends on the process model, which is illustrated in Section IV. Let x
_{tt1}
denote the predicted state at time
t
when passing x
_{t1t1}
through process function
f
. Let y
_{t t 1}
be the predicted observation at time
t
when passing x
_{t t 1}
through observation function
h
.
The UT works by sampling
2L
+1 sigma vectors
X_{i}
^{(a)}
in the augmented state space (following
[6]
), forming a matrix χ
^{(a)}
. The covariance matrix in augmented state space is denoted by P
^{(a)}
. Let
be the state covariance matrix (i.e., describing dependencies between components of a state χ ). Formally,
where λ is a positive real, used as a scaling parameter. These sigma vectors can be passed through a nonlinear function (e.g., f or h) one by one, thus defining transformed (i.e., new) sigma vectors such as
Means x
_{tt1}
or y
_{tt1}
and covariance matrices
are obtained as follows; take
h
for example:
with constant weights
W_{i}
^{(.)}
. Details are given in
[6]
.
The UKF is illustrated as follows. At first we initialize the state x=x
_{0}
and state covariances
P
^{(xx)}
=
P
_{0}
^{(xx)}
. For the augmented vectors, let
where
Q
denotes the processnoise covariance matrix. Details about
Q
are given in Section IV. For
t
ϵ (1, … , ∞), we calculate sigma vectors as follows:
The process update is as follows:
We update the sigma vectors using
and update the measurement covariance matrix as follows:
where
R
is the assumed measurement noise covariance, depending on the observation model selected. Details are given in Section IV.
Altogether, the UKF is defined by
IV. MULTIPLE OBJECT TRACKING
Following trackingbydetection methods, which are popular for solving multipleobject tracking tasks, a detector is applied in each frame to generate object candidates which are outputs of the detector. One UKF is adopted for tracking one object separately; thus a group of detected pedestrians defines a family of UKFs to be processed simultaneously. Each UKF tracks one detected object. The predicted state of a UKF is used for data association; when an observation (of the tracked object) is available in the current frame then we update the predicted state by using the corresponding UKF.
 A. Detection
Detectionbytracking methods rely on evaluating rectangular regions of interest, and we call them object boxes if positively identified as containing an object of interest. For pedestrian tracking, we adopt the popular histogram of oriented gradients (HOG) feature method and a support vector machine (SVM) classifier, originally introduced in
[19]
. HOG features describe the human profile by an oriented gradient histogram. An SVM classifier is able to handle highdimensional and nonlinear features (such as HOG features). It projects sample features into a high
The depth map on top uses a colour code for calculated distances; depth values are only shown at pixels where the mode filter accepts the given value. The lower images show detected (coloured) object boxes.
dimensional space, and then finds a hyperplane to separate two classes. Instead of using a sliding window, regions of interest (i.e., inputs to the classifier) are selected by analysing calculated stereo information (depth and disparity maps), as proposed in
[20]
.
Fig. 1
shows several detection results in pedestrian sequence, dots (cyan) denote the boxes’ centre that are recognized as pedestrians, and the red rectangles denote the final detection results. As can be seen in the results, the object boxes may contain background, shift from the object, or miss the pedestrians.
For the detection of
Drosophila
larvae (an example of 2D movement), thresholds and connected components are adopted to obtain one object box for each larva. Several larva detection results are shown in
Fig. 2
. As the scene is certain, the detection results are more reliable when compared to the pedestrian sequence. However, no depth information is available here.
Larvae detection results shown by (cyan) object boxes.
 B. UKFbased Object Tracking
As there is an unknown number of objects in a scene, the statedimensionality would expand significantly if we would have decided to track all pedestrians in one UKF; in this case, the speed of tracking reduces dramatically when the scene is crowded with many detected objects. Thus, we decided on one UKF for each detected object for tracking.
Choosing a proper model is important. In this subsection we offer three models for possible selection:
3DVT
means that 3D position (world coordinates) with velocity is observed,
3DT
means that 3D position without velocity is observed, and
2DT
means that 2D position (image coordinates) without velocity is observed. These models are compared in Section V.
 1) The Two 3D Models
In the 3DVT model, the object is tracked in 3D world coordinates. Its 3D position (
x, y, z
) is the first part of the state. We also include the velocity (
v_{x}, v_{y}, v_{z}
). Thus, a state x=(
x, y, z
,
v_{x}, v_{y}, v_{z}
)
^{T}
is 6dimensional.

a) Process model:We assume constant velocity between adjacent frames, with Gaussian distributed noise accelerationnaϵN(0, Σna). The diagonal elements in Σnaare set to be equal and denoted by
Thus,
whereΔ
t
is thetime interval between subsequent frames.

b) Observation model:An observation consists of the position (i0,j0) (say, the centroid of the detected object box in the left camera), disparitydof the detected object, and velocity (vox, voy, voz) in 3D coordinates. The usual pinhole camera projection model is used 4to map 3D points onto the image plane,
where
f
denotes focal length, and
b
denotes the length of the baseline between two rectified stereo cameras. In this case,
For the disparity
d
we select the mode in the disparity map in a fixed (e.g., 20 × 20) neighbourhood around the centroid (
i
_{0}
,
j
_{0}
) of the detected object box. 3D scene flow (
v_{ox}, v_{oy}, v_{oz}
) can be obtained by combining optic flow and stereo information
[21]
.
As it is difficult to obtain highquality scene flow as required for 3DVT, 3DT simplifies the 3DVT model by excluding the scene flow from the observation, and has the same process model as 3DVT. In this case,
 2) The 2D Model
If only monocular recording is available, the object is tracked in the 2D image plane only. The state x=(
i,j,v_{i},v_{j}
)
^{T}
consists of position (
i, j
) and velocity (
v_{i}, v_{j}
).
a) Process model: The process model is the same as for the 3D models. We assume a constant velocity between subsequent frames with a Gaussian noise distribution for acceleration n
_{a}
:
b) Observation model:
An observation consists of the central position (
i
_{0}
,
j
_{0}
) of an object box only,
i
_{0}
=
i
and
j
_{0}
=
j
, resulting in
R
=
diag
(σ
^{2}
_{nmp}
,σ
^{2}
_{nmp}
) for this case.
 C. Data Association
As each object is tracked independently, data association by matching candidates to existing trajectories becomes important. If no match is found then we decide to initialize a new tracker.
Since object movements are continuous, the estimated velocity in the UKF can be used as a cue to localize the search area in order to find the match object. For each trajectory, the possible location (i.e., (
x_{p}, y_{p}, z_{p}
) for 3D, and (
i_{p}, j_{p}
) for 2D) of the object in the current frame is predicted by a process model used in the EKF. This location is used as a reference for searching potentially matching candidates in the current frame. Currently we simply match candidates based on the shortest Euclidean distance and a given threshold
T
.
One candidate might be matched with several trajectories if the Euclidean distance is below
T
. Trajectories compete for the candidates, and in the end, the closest one wins. If a candidate is not matched to any trajectory, a new tracker is initialized. If a trajectory does not win any of the candidates, the tracker is propagated with the given prediction, and the new state is the predicted state, without being updated by an observation (because it is not available).
No object appearance description is used here for assigning an object to a trajectory. In general, the inclusion of appearance representation (e.g., a colour histogram, or an instancespecific shape model) improves the performance. However, this is out of the scope of this paper, where we are focusing on the combination of different data association methods.
V. EXPERIMENTS
In this section, first, our three models (3DVT, 3DT, and 2DT) are tested in a simulated environment with different parameter sets. Second, our multipleobject tracking method is tested on real video sequences where (3D example) pedestrians are walking in innercity scenes, or (2D example) larvae are moving on a flat culture dish.
 A. Simulated Tracking
The three models defined in Section IV are tested in a simulation environment in OpenGL (SGI, Fremont, CA, USa). A cub is moving on a circular path around a 3D point with constant speed, as shown in
Figs. 3
and
4
. Acceleration noise n
_{a}
with different covariance (e.g., σ
^{2}
_{na}
= 0.0001, 0.01, 1), and measurement noise n
_{m}
with different covariance (e.g., σ
^{2}
_{nmp}
= 10, 50, 100, σ
^{2}
_{nmv}
= 50, 100, 150), are used to test and compare the three models’ performance. The simulation environment is different for 2D and 3D models, where for 2D, positions are integral pixel coordinates in the image plane, but for 3D, position coordinates are reals. The radius of the circle in the 3D models is 10, while in the 2D model it is 50. In both environments, measurements are degraded by noise before being sent to the UKF.
Fig. 3
demonstrates the effect of σ
^{2}
_{na}
, having fixed σ
^{2}
_{nmp}
and σ
^{2}
_{nmv}
. Experiments show that larger σ
^{2}
_{na}
values result in more unstable trajectories. A large σ
^{2}
_{na}
means that the process model produces a predicted state that is fluctuating with large magnitudes. Results show that σ
^{2}
_{na}
= 0.0001 is a reasonable choice for 3D models. For the 2D model, a smaller σ
^{2}
_{na}
yields smooth estimation, but the shifts are significant.
Simulation results for variations in the variance of acceleration noise. From left to right, σ^{2}_{na} = 0.0001, 0.01, or 1, with fixed values σ^{2}_{nm} = σ^{2}_{np} = 50, and σ^{2}_{nv} = 100. From top to down, the tracking model is 3DVT, 3DT and 2DT, respectively.
A larger σ
^{2}
_{na}
value produces estimations that are closer to the true positions, but fluctuations are significant, for an experiment with σ
^{2}
_{na}
=1 for the 2D case. In general, 3DT and 3DVT converge better than 2DT, while 3DT and 3DVT show a similar performance. 3D models use stereo information rather than just a single image as for the 2D model, which also proves that stereo information can help to improve the tracking performance. As the measured 3D position is noisy, the measure of velocity is even noisier; this appears to be the main reason for the observation that the inclusion of velocity cannot improve the performance.
Fig. 4
shows results for our models for different covariance values σ
^{2}
_{nmp}
and σ
^{2}
_{nmv}
of measurement noise. Significantly increasing measurement noise (i.e., higher uncertainty of observations) reduces the performance only slightly. This demonstrates that, to some degree, the UKF is a robust tracker, which is not vulnerable to detection uncertainties. As before, 3DT and 3DVT converge better than 2DT, while 3DT and 3DVT show a similar performance.
Simulation results for variable variance of measurement noise. From left to right, σ^{2}_{nmp} = 10, 50, or 100, σ^{2}_{nmv} = 50, 100, 150, respectively, with fixed σ^{2}_{na} = 0.0001 for 3D models, σ^{2}_{na} = 1 for 2D models. From top to down, the tracking models are 3DVT, 3DT and 2DT, respectively.
 B. Multiple Object Tracking in Real Data
In this section we report on the performance of UKFsupported tracking for multiple larvae using the 2DT model, and for multiple pedestrians in traffic scenes using the 3DT model. The larvae and pedestrian sequences are recorded at 30 and 15 frames per second, respectively.
Results for larvae tracking are shown in
Fig. 5
. As the velocity in the model is initialized by (0,0), the UKFestimation is “ slower” than the real speed of the larvae in the first 30 frames. The speed of convergence can be improved by increasing σ
^{2}
_{na}
, but it should be noted that the larger the σ
^{2}
_{na}
value is, the larger the magnitude of fluctuation. The estimated trajectories follow the moving larvae effectively, mainly because all of the larvae are properly detected in all of the frames. However, such a complete detection cannot be expected for pedestrian sequences. Next, we test the UKF for such “ noisy” detection results as pedestrian sequences.
The results for pedestrian tracking are shown in
Fig. 6
. Objects are missing or shifting from time to time due to the clustered background (e.g., the car in the traffic scene detected as a pedestrian), illumination variations leaving some pedestrians undetected, or internal variations between objects (i.e., unstable detections). Our experiments verified that UKF predictions are able to follow irregularly moving pedestrians when detection fails for a few frames, and can even correct unstable detections.
2D Tracking results of larva sequences. From top to bottom: tracking results in Frames 26, 46, and 166 of one sequence. The red lines show the detected track, and the white lines show the unscented Kalman filterpredicted track. The blue lines represent estimated trajectories. The left column is the original intensity image overlaid with the estimated trajectories.
The second frame in
Fig. 6
shows that the undetected pedestrian is predicted correctly in the white object box and is successfully matched to a detected position in the third frame. The last frame in
Fig. 6
demonstrates that displaced detections are corrected by the UKF. Using only the defined distance rule for data assignment, this appears to be insufficient, especially for the given detection results. A small threshold may lead to a mismatch (i.e., the detection fails to satisfy the rule), and a large threshold may lead to an identity switch (i.e., a pedestrian is matched to another pedestrian).
VI. CONCLUSIONS
Assigning one UKF to each detected (moving) object simplifies the design and implementation of UKF prediction of 2D or 3D motion. Experiments demonstrate the robustness of the chosen approach. This tracker only generates shortterm tracks when detection is not reliable; longterm tracking should be possible by also introducing dynamic programming.
(a) Object boxes (red), unscented Kalman filter (UKF) predictions (white), and UKF estimations (blue) are overlaid on the original intensity image. The black box is an invalid detection excluded by a disparity check. (b) coloured blocks denote the object boxes (red), UKF predictions (white), and UKF estimations (blue) in the XZplane in realworld coordinates.
For evaluating the performance in realworld (either 2D or 3D) applications, more extensive tests need to be undertaken, especially for the design and evaluation of quantitative performance measures. For example, the measures discussed in
[22]
for evaluating visual odometry techniques might also be of relevance for the tracking case.
Acknowledgements
The authors thankSimon Hermannfor providing his implementation of semiglobal matching for stereo analysis,Gabriel Hartmann(both Auckland) for support regarding the unscented Kalman filter, andBenjamin Risse(Münster) for video data with moving Drosophila larvae.
Wu B
,
Nevatia R
2007
“ Detection and tracking of multiple, partially occluded humans by Bayesian combination of edgelet based part detectors”
International Journal of Computer Vision
75
(2)
247 
266
Breitenstein M. D
,
Reichlin F
,
Leibe B
,
KollerMeier E
,
van Gool L
2011
“ Online multiperson trackingbydetection from a single, uncalibrated camera”
IEEE Transactions on Pattern Analysis and Machine Intelligence
33
(9)
1820 
1833
Ess A
,
Schindler K
,
Leibe B
,
van Gool L
2010
“ Object detection and tracking for autonomous navigation in dynamic environments”
International Journal of Robotics Research
29
(14)
1707 
1725
Leibe B
,
Schindler K
,
Cornelis N
,
van Gool L
2008
“ Coupled object detection and tracking from static cameras and moving vehicles”
IEEE Transactions on Pattern Analysis and Machine Intelligence
30
(10)
1683 
1698
Meuter M
,
Iurgel U
,
Park S. B
,
Kummert A
2008
“ The unscented Kalman filter for pedestrian tracking from a moving host”
Proceedings of IEEE Intelligent Vehicles Symposium
Eindhoven, Netherlands
37 
42
Wan E. A
,
van der Merwe R
2000
“ The unscented Kalman filter for nonlinear estimation”
Proceedings of IEEE Adaptive Systems for Signal Processing, Communications, and Control Symposium
Lake Louise, AB
153 
158
Yilmaz A
,
Javed O
,
Shah M
2006
“ Object tracking: a survey”
ACM Computing Surveys
article no. 13
38
(4)
Ess A
,
Leibe B
,
Schindler K
,
van Gool L
2009
“ Robust multiperson tracking from a mobile platform”
IEEE Transactions on Pattern Analysis and Machine Intelligence
31
(10)
1831 
1846
Leven W. F
,
Lanterman A. D
2009
“ Unscented Kalman filters for multiple target tracking with symmetric measurement equations”
IEEE Transactions on Automatic Control
54
(2)
370 
375
Mitzel D
,
Horbert E
,
Ess A
,
Leibe B
2010
“ Multiperson tracking with sparse detection and continuous segmentation” Computer Vision ECCV 2010
Lecture Notes in Computer Science
6311
397 
410
Mitzel D
,
Sudowe P
,
Leibe B
2011
“ Realtime multiperson tracking with timeconstrained detection”
Proceedings of the British Machine Vision Conference
Dundee, UK
104.1 
104.11
Mitzel D
,
Leibe B
2011
“ Realtime multiperson tracking with detector assisted structure propagation”
Proceedings of Computational Methods for the Innovative Design of Electrical Devices
Barcelona, Spain
974 
981
Shaikh M. M
,
Wook B
,
Lee C
,
Kim T
,
Lee T
,
Kim K
,
Cho D
2011
“ Mobile robot vision tracking system using unscented Kalman filter”
Proceedings of IEEE/SICE International Symposium on System Integration
Kyoto, Japan
1214 
1219
Welch G
,
Bishop G
1995
“ An introduction to the Kalman filter”
University of North Carolina at Chapel Hill
Chapel Hill: NC
TR 95041
Huang S
,
Dissanayake G
2006
“ Convergence analysis for extended Kalman filter based SLAM”
Proceedings of IEEE International Conference on Robotics and Automation
Orlando, FL
412 
417
Yazdi H. S
,
Hosseini S. E
2008
“ Pedestrian tracking using single camera with new extended Kalman filter”
International Journal of Intelligent Computing and Cybernetics
1
(3)
379 
397
Cai Y
,
de Freitas N
,
Little J. J
2006
“ Robust visual tracking for multiple targets”
Proceedings of the 9th European Conference on Computer Vision
Graz, Austria
107 
118
Hartmann G
2012
“ Unscented Kalman filter sensor fusion for monocular camera localization”
University of Auckland
New Zealand
MSc thesis
Dalal N
,
Triggs B
2005
“ Histograms of oriented gradients for human detection”
Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition
SanDiego, CA
886 
893
Gavrila D. M
,
Munder S
2007
“ Multicue pedestrian detection and tracking from a moving vehicle”
International Journal of Computer Vision
73
(1)
41 
59
Wedel A
,
Brox T
,
Vaudrey T
,
Rabe C
,
Franke U
,
Cremers D
2011
“ Stereoscopic scene flow computation for 3D motion understanding”
International Journal of Computer Vision
95
(1)
29 
51
Jiang R
,
Klette R
,
Wang S
2011
“ Statistical modeling of longrange drift in visual odometry”
Computer Vision ACCV 2010 Workshops, Lecture Notes in Computer Science
vol. 6469
214 
224