In this study, we present a denoising algorithm for highframerate videos in an ultralow illumination environment on the basis of Kalman filtering model and a new motion segmentation scheme. The Kalman filter removes temporal noise from signals by propagating error covariance statistics. Regarded as the process noise for imaging, motion is important in Kalman filtering. We propose a new motion estimation scheme that is suitable for serious noise. This scheme employs the small motion vector characteristic of highframerate videos. Small changing patches are intentionally neglected because distinguishing details from largescale noise is difficult and unimportant. Finally, a spatial bilateral filter is used to improve denoising capability in the motion area. Experiments are performed on videos with both synthetic and real noises. Results show that the proposed algorithm outperforms other stateoftheart methods in both peak signaltonoise ratio objective evaluation and visual quality.
1. Introduction
A
camera unavoidably suffers from noise in low illumination. A corrupted signal influences the perceptual quality and efficiency of subsequent processing tasks such as video compression and pattern recognition. Thus, denoising is an important issue in image and video processing.
A common technique to enhance image quality in low illumination in photography is to prolong the exposure time. However, motion blurring occurs in images with long exposure time, particularly when moving objects are present in the scene, such as during recording of sports competitions or vehicular collision experiments. Blurring and noise are key challenges in ensuring good video quality in low illumination. Denoising is required when the exposure time is short, whereas deblurring is needed when the exposure time is long. Deblurring is more complex and time consuming for computer vision than denoising, so the latter can be regarded as the preferable option.
A short exposure time produces highframerate videos with crisp and fluid imagery. A judder is also absent unlike in a normalframerate video with a short exposure time.
Fig. 1
shows the comparison between normalframerate and highframerate videos. In this figure, two highframerate cameras were used to simultaneously record a waving hand. The normal video frame rate is approximately 25 fps. However, such a rate is still unable to provide clear images of fastmoving objects. Depending on the velocity of motion, a high frame rate, such as 100 fps, 250 fps, or even higher, is necessary.
Comparison between the 25 fps video and 250 fps video. (a) The 25fps video with a blurred waving hand. (b) The 250 fps video with a clear waving hand.
Numerous studies on video denoising have been conducted in recent decades and are currently available as references. To a certain extent, denoising methods, such as those presented in
[1

7]
, effectively work in low illumination. We take VBM3D
[2]
as an example.
Fig. 2
shows the denoising result of VBM3D for additive white Gaussian noise (AWGN) with standard deviation
σ_{n}
= 30 , which represents light noise in low illumination. The result looks excellent compared with those of the ground truth frame (
Fig. 1
(b)), except for some details, such as the text on the top left corner.
Denoising performance of VBM3D for AWGN with standard deviation σ_{n} = 30 . (a) Noisy frame whose ground truth is Fig. 1(b) (PSNR = 18.59dB). (b) Denoising result of VBM3D (PSNR = 33.57dB).
However, the performance dramatically degrades in ultralow illumination.
Fig. 3
shows the denoising result of VBM3D for a video captured in a real noisy environment in ultralow illumination. The video is captured by a highframerate sensor, i.e., Viimagic 9222B. An image in ultralow illumination has minimal chrominance components. Hence, only the luminance part is taken.
Fig. 3
depicts that VBM3D removes noise to a certain degree, but can visual quality still be improved? This task is difficult in ultralow illumination because a useful signal is nearly submerged in noise.
Denoising performance of VBM3D in ultralow illumination. (a) Noisy frame. (b) Denoising result of VBM3D.
In this study, we address the video denoising problem in ultralow illumination. The target video has a stationary background. This type of video has extensive applications, such as in ubiquitous surveillance cameras.
Our work is based on the Kalman filtering framework. The Kalman filter was proposed by Rudolph E. Kalman
[8]
in 1960. This set of mathematical equations provides an efficient recursive means to estimate the state of a process such that the mean squared error is minimized. The Kalman filter is widely used in denoising and other video processing tasks
[9

12]
. For video denoising, this algorithm removes noise from a signal by propagating error covariance statistics. In this study, motion is modeled as imaging process noise. Estimating motion determines the denoising capability of the Kalman filter. We employ two characteristics of noisy highframerate videos to obtain reliable motion estimation. The first characteristic is the small motion vector. The motion area between two frames with a small motion vector is contained within an area with a large motion vector. However, flickering noise does not exhibit this feature. The second characteristic is the loss of small objects and details in largescale noise in ultralow illumination. The movement of such minute objects and details is difficult to detect. Therefore, we sacrifice them to improve the estimation on major parts, which is valuable for overall performance.
The remainder of this paper is organized as follows. Section 2 discusses related works on video denoising. Section 3 presents our Kalman filtering framework. Section 4 proposes a motion estimation scheme for ultralow illumination. Section 5 discusses the experiments. Finally, Section 6 summarizes the study.
2. Related Work
Video denoising can exploit redundant information from nearby frames. Thus, a better denoising capability compared with singleimage processing can be expected. Determining how to deal with the temporal relationship of frames is the key to video denoising.
In recent years, many algorithms have used 2D similar patch clusters to implicitly estimate motion information
[2
,
4
,
13

15]
. Similar patches have been matched across several frames in spatiotemporal space. In
[13

15]
, weighted averaging on selected patches was conducted after patch matching to denoise the reference patch. In
[4]
, patch matching was followed by an adaptive threshold approach, i.e., SURELET. In VBM3D
[2]
, a twostep Wiener filtering framework was used to handle similar patch clusters.
Meanwhile, some researchers
[3
,
6
,
16
,
17]
found that 3D patches are more appropriate for video denoising than 2D patches. The former can possibly better characterize motionrelated temporal dependency than the latter. In
[6]
and
[16]
, 3D patches were used as atoms of a sparse dictionary. In
[17]
, a Bayesian framework was proposed to process 3D patch clusters. The basic blocks of VBM4D
[3]
, which is an extension of VBM3D
[2]
, are spatiotemporal 3D volumes that form a 4D group.
Some researchers directly conducted 3D transformations on video data without using 3D patches
[5
,
18

21]
. In
[18
,
19]
, two 3D complex wavelet transforms were proposed. In
[5
,
20]
, a 2D discrete shearlet transform
[21]
was extended to have a 3D version.
Motion information is implicitly represented regardless of whether 2D patches, 3D patches, or 3D domain transforms are used for video denoising. However, many researchers have also expected to explicitly estimate motion
[7
,
22

26]
. In
[22]
, a blockbased multiple hypotheses motion estimation method was proposed. In
[23]
and
[24]
, an optical flow method was used to estimate motion. In
[25]
, a hierarchical motion estimation method was discussed. The basic idea of this method is to track matching blocks and filters along the motion trajectory. Although a 2D patch was used in
[22]
and
[25]
, motion information was explicitly obtained. In
[26]
, motion estimation was performed in the wavelet transform domain. In
[7]
, a crosscorrelation algorithm that is robust to Fourier domain noise, called spatiotemporal Gaussian scale mixture (STGSM), was proposed for motion estimation. Similarly, we employed the explicit representation approach to estimate motion in the current work.
All of the aforementioned algorithms can be directly applied to highframerate videos. In this study, however, we exploited several new characteristics to improve denoising performance. The details are discussed in Section 4.
3. The Kalman Filtering Framework for Video Denoising
The discrete instant of time is denoted as
k
. The system state of the previous time step is
. The optional control input is
u
_{k}
. The system working process noise is
w
_{k}
.
A
is the state transition model that operates in system
.
B
is the control input model that operates in
u
_{k}
. The predicted state of the current time step
. The actual measurement
is
, where
H
is the observation model that maps the true state space into the observed space.
is the state of the current time step.
v
_{k}
is the observation noise.
In video imaging systems, a camera directly records input light rays. Thus, the state transition model
A
=
I
, and the mapping observation model
H
=
I
.
I
is the unit matrix in both models. No control input is available for video capturing; hence,
u
_{k}
= 0 . Consequently, the predicted state of the video imaging system is
, and the actual measurement is
.
The working process noise
w
and the observation noise
v
are independent Gaussian random processes. We model
w
to be caused by motion and
v
to be caused by camera noise. Both noises are assumed to have zero mean Gaussian distribution with covariance
Q
and
R
, i.e.,
w
~
N
(0,
Q
) and
v
~
N
(0,
R
).
The Kalman filter works in two steps: priori and updated posteriori state estimations. For the video denoising system, the former is given by
This equation indicates that if process noise is absent (i.e.,
Q
=
0
), then the predicted state is equal to the system state of the previous time step. The latter updated state is expressed as follows:
where
K
_{k}
is the optimal Kalman gain that minimizes the posteriori error covariance. This gain is computed as follows:
where
is the predicted priori estimate covariance. This covariance is computed as follows:
where
P
_{k}
_{1}
is the posteriori estimate covariance of the previous time step. The updated posteriori estimate covariance of the current time step is expressed as follows:
If the noise covariance of a single image pixel is
, then
R
can be written as
for the 2D image matrix. The covariance
Q
of the imaging process noise
w
is estimated as
is the motioncaused deviation between the current frame
k
and the last frame
k
– 1. The computation of this deviation is further discussed in Section 4.
Regarding initialization, the measurement of the first noisy frame
z
_{1}
functions as the posteriori state estimate, i.e.,
. Kalman filtering starts at the second frame. The posteriori estimate covariance of the first frame is set as
P
_{1}
=
R
.
Moreover, the large motion estimation
results in the large value of the Kalman gain
K
_{k}
in the motion area, according to Eqs. (3) and (4). In extreme cases, such as when
K
_{k}
=
I
, the updated posteriori state
based on Eq. (2). Thus, spatial denoising is required for the motion area to improve denoising performance. In our work, the classical edge preserving filter, i.e., the bilateral filter
[27]
, is employed. It can be calculated as
where
is the spatial bilateral denoising result of frame
k
at spatial coordinates
i
, and
z_{s}
,
z_{i}
are the pixel values of noisy frame
z
_{k}
in positions
s
and
i
.
N
(
i
) is a (2
r
+ 1) x (2
r
+ 1) block centered at
i
, and
s
is the coordinate of the block
N
(
i
) , i.e.,
s_{u}
= [
i_{u}

r
,
i_{u}
+
r
] in horizontal direction
u
and
s_{v}
= [
i_{v}

r
,
i_{v}
+
r
] in vertical direction
v
. Two Gaussian filter kernels are used for the bilateral filter. The first, which is the spatial distance kernel
G_{s}
(·) with standard deviation
σ_{s}
, is the same as the Gaussian filter. The second, which is the pixel intensity difference kernel
G_{I}
(·) with standard deviation
σ_{I}
, is used to preserve edges. A large intensity difference ║
z_{s}

z_{i}
║ results in a small weight
G_{I}
. Thus, pixels on different sides can be distinguished.
The spatial denoising result is denoted as
. Kalman temporal denoising
and bilateral spatial denoising
are mixed by weighted averaging. As the Kalman gain
K
_{k}
approaches zero, the reliability of the actual measurement
z
_{k}
decreases, whereas trust on the predicted estimate
increases. Accordingly, we use the Kalman gain
K
_{k}
, which reflects motion degree, as the weight. Thus, the final denoising result is
For the predicted priori state estimate, i.e., Eq. (1), this equation is revised as
.
Fig. 4
shows the Kalman filtering performance without motion estimation (
Q
=
0
). The video is the same one in
Fig. 3
. A total of 350 frames are used. The noise standard deviation
σ_{n}
is set to 100. In the figure, the still background is clearer than that of VBM3D (
Fig. 3
(b)). Our Kalman filtering framework is suitable for videos with a fixed background because it employs all frames by propagating the estimate error covariance
P
_{k}
. The estimation of the original signal is improved when the number of frames is increased. By contrast, only a few adjacent frames are employed in VBM3D. Thus, a good estimation at the background is obtained.
Denoising performance of the Kalman filtering framework without motion estimation
However, the blurring occurs without motion estimation. Nearly no moving objects can be observed in
Fig. 4
, but in fact, a moving man at the left of the visual field can be seen in
Fig. 3
(b). This blurring phenomenon indicates the importance of motion estimation.
4. Motion Estimation for Highframerate Videos in Ultralow Illumination
Our motion estimation method is based on frame difference. In ultralow illumination, the noise is extremely large, such that the direct frame difference cannot provide a good estimation. This problem significantly influences motion estimation [
Fig. 5
(e) and
6
(b)]. A Gaussian filter is employed to preprocess a noisy frame. In
Fig. 5
(f) and
6
(c), a large Gaussian kernel with a window that measures 20 x 20 and a standard deviation
σ_{G}
= 5 is used to suppress noise.
Difference between two successive frames ( k – 1 and k ). (a) Noisefree frames. (b) Noisy frames of (a) polluted by AWGN with standard deviation σ_{n} = 50 . (c) Gaussian prefiltered frames of (b) with a 20×20 Gaussian kernel ( σ_{G} = 5 ). (d) Frame difference of (a) that functions as the ground truth. (e) Frame difference of (b). (f) Frame difference of (c)
Difference between two spaced frames (frames k – 5 and k ). (a) Frame difference of noisefree frames. (b) Frame difference of noisy frames. (c) Frame difference of Gaussian prefiltered frames.
Although the influence of noise is decreased to a considerable extent by the large Gaussian kernel, several undesired patches still affect reliable motion estimation. We further improve our method by applying the small motion vector characteristic and intentionally sacrificing small changing patches.
Before the work, the definition of the small motion vector is given. Suppose the length of the object along the motion direction is
L
, and the movement distance between two successive frames is
M
. Both variables are measured by the number of pixels. Let
M
=
αL
. We define the video to have a small motion vector characteristic when
α
≤ 0.5 . The interframe motion distance is less than half of the object size; this is a rough definition. In reality, several motion objects with different lengths and velocities may simultaneously exist in a scene. In such a case, a user often concentrates on only one moving object within a period of time. We call this movement of the most concerned object as the main movement. To simplify this problem, we consider only the main movement for evaluation. Under this condition, the selection of the main movement is objective. For example, suppose a slow walking man and a rushing car appear in the same scene. The movement of the man satisfies
α
≤ 0.5, whereas the movement of the car does not. If the man is highlighted, the video can be considered as owning a small motion vector characteristic. If the car is selected, the video is thought to not have such a characteristic. When
α
≤ 0.1 , we can assume that the small motion vector characteristic is obvious.
 Small motion vector of highframerate videos
Motion estimation becomes more difficult for a small motion vector than for a large motion vector with largescale noise, as shown in
Fig. 5
. The video in the figure recorded a moving man at 250 fps. Nearly no motion information can be obtained from the difference of the noisy frames [
Fig. 5
(e)]. Even the Gaussian prefilter loses its effect [
Fig. 5
(f)]. However, a large motion vector is easier to detect than a small one (
Fig. 6
). Frame
k
in
Fig. 6
is also frame
k
in
Fig. 5
. The interval is five frames in
Fig. 6
.
Fig. 5
and
Fig. 6
also show another feature of highframerate videos, i.e., based on the same reference frame, the motion area between two close frames is nearly contained between two relatively far frames. This feature results from a small motion vector. Close frames imply a small motion area.
Fig. 7
further illustrates the aforementioned feature with several typical motion modes. The major part of the motion area conforms to this feature. Only some minor parts at the edge of the motion area (in red) violate this rule. When the frame rate is high, this containment rule is obviously reliable. However, the flickering noise does not satisfy this rule.
Motion area containment of several typical motion modes. Only the small red regions violate the rule. (a) Translation. (b) Rotation. (c) Scale variation.
 Neglecting small changing patches in ultralow illumination
Most details and small objects are submerged in noise in ultralow illumination. From the perspective of frequencydomain analysis, these details and small objects are the highfrequency part of the image. Noise also exists in the highfrequency part of the image, and, thus, these elements overlap with one another. The only object that can be easily recognized in ultralow illumination is the main structure of the image, which is located at the lowfrequency part. From the perspective of principal components analysis, the main structure is the principal component of the image, which represents the main feature of such an image. When the main structure changes, it can be easily perceived. When the details change, however, it is difficult to be detected because the change caused by noise seriously interferes with the change caused by motion. This reason also explains why motion estimation is difficult for the small motion vector in
Fig. 5
.
In this study, we do not detect small motions, and we neglect small changing patches. To provide an extreme example, detecting a fly in ultralow illumination is senseless. Our objective is to sacrifice the minor part to improve the major part. Thus, the red regions in
Fig. 7
that violate the containment rule can also be ignored.
Our motion estimation scheme is based on the two aforementioned assumptions. Supposing that the last denoised frame is
x
_{k}
_{1}
, then the current frame is
z
_{k}
. The future adjacent frames
z
_{k+i}
(
i
= 1,2,⋯) are used to help in motion segmentation. A large motion area of
N
_{1}
frames is initially segmented with extra
N
_{2}
frames. Segmentation relies on the containment rule of small motion vectors. Then, motion estimation for each of the two successive images within
N
_{1}
frames is performed in the segmented area. Subsequently, large motion area segmentation is conducted between frames
k
– 1 +
N
_{1}
and
k
– 1 + 2
N
_{1}
, as shown in
Fig. 8
.
Extra N_{2} frames help the large motion area segmentation of N_{1} frames.
The motion area is initially segmented as a blackandwhite map by hard thresholding as follows:
where
i
=
N
_{1}
,
N
_{1}
+ 1,⋯,
N
_{1}
+
N
_{2}
;
G
(·) is the Gaussian prefilter operation, and
t
is the threshold. The intensity difference below
t
can be ignored. For the intensity range 0 to 255, our threshold
t
is fixed as 5 because the human eye nearly cannot discriminate an intensity difference below 5.
The refinement of the motion area between
k
– 1 and
k
– 1 +
N
_{1}
is obtained with an AND operation among other
N
_{2}
adjacent motion segmentation areas according to the containment rule of small motion vectors as follows:
where
i
=
N
_{1}
,
N
_{1}
+ 1,⋯,
N
_{1}
+
N
_{2}
.
Several small changing patches still exist after the AND operation. We ignore these remaining patches by
where the connected region of
is obtained with a ConnRegion(·) operation. The area of each connected regions is then calculated. The area is measured by the number of pixels in a connected region. If the area is smaller than
S
, then the region can be set as 0 to neglect it.
We choose the classical connected region detection algorithm in the book
[28]
for the ConnRegion(·) operation. The main process is as follows. The algorithm is based on 8connectivity. At first, the blackandwhite binary image, whose pixel value is 0 or 1, is scanned pixel by pixel from left to right and top to bottom. Let
x
_{(m,n)}
∈ {0,1} denote the current processing pixel, with the position denoted as (
m
,
n
). Inspect the neighbor left, top left, top, top right pixels of the current white pixel (
x
= 1). If all neighbor pixel values are 0, a new label is assigned to the pixel at (
m
,
n
) to form a label map
L
. If one or more pixel values are notzero, the least label value of these nonzero pixels is assigned to
L
_{(m,n)}
, and the other label values of neighboring nonzero pixels are recorded as equivalent. After the scan, the equivalent relations among labels are searched according to reflexivity, symmetry, and transitivity, and then the equivalent labels are modified to the least one. For example, if label value 1 is equivalent to label value 2, label value 2 is equivalent to label value 6, label value 1 is equivalent to label value 6, and all labels are set to value 1. After the modification, the labels are reassigned from small to large with the use of a natural number index. Finally, the original label map is replaced by the new labels. The pixels that have the same label are in the same connected region.
Fig. 9
shows the process of our motion segmentation method. The video in
Fig. 9
is the same video in
Fig. 5
and
Fig. 6
. The large motion area contains five frames (
N
_{1}
= 5 ), and another four frames are used for segmentation (
N
_{2}
= 4 ).
N
_{1}
and
N
_{2}
are selected according to the motion vector. In our work,
N
_{1}
+
N
_{2}
≤ 1/
α
needs to be satisfied. In the definition of the small motion vector,
α
≤ 0.5. Thus,
N
_{1}
+
N
_{2}
≤ 2.
N
_{1}
and
N
_{2}
should at least be equal to 1. If
N
_{1}
is too small, detecting the motion area for the small motion vector characteristic of highframerate videos is difficult, such as in
Fig. 5
. If
N
_{2}
is too small, it cannot effectively suppress noise influence. We choose
N
_{1}
and
N
_{2}
as
Motion segmentation process
If the velocity is low or the frame rate is high, then a small motion vector is obtained. A large
N
_{1}
and
N
_{2}
can be used to improve the accuracy of motion segmentation results. The small patch threshold
S
is set to 100, which is approximately a 10 x 10 patch that is in line with the noise scale.
Finally, motion estimation for each two successive images within
N
_{1}
frames is calculated as follows:
where
j
= 0,1,⋯,
N
_{1}
1.
5. Experiments and Analysis
The performance of the proposed algorithm is evaluated in this section. The experiments are divided into two parts. The first part is on synthetic noisy videos, whereas the second part is on real noisy videos captured in ultralow illumination. The experiments are conducted in the luminance channel of the video because minimal color can be captured in ultralow illumination. Three stateoftheart video denoising methods are chosen for comparison. These methods are the 2D block matching method VBM3D
[2]
, the 3D domain transformation method 3D shearlets
[5]
, and the explicit motion estimation method STGSM
[7]
. The toolbox of these methods can be downloaded from the websites of the authors
[29
,
30
,
31]
. The objective criterion peak signaltonoise ratio (PSNR) is employed to provide quantitative evaluation, which is defined as
where
L
is the dynamic range of the image. In our experiments,
L
equals 255 for 8bit images.
MSE
is the mean squared error between the original and the corrupted or denoised images. In our experiments, a PSNR score is computed for each frame in the noisy or restored video. The final PSNR for a video is the average of the score of each frame. The denoised pixel intensity range is 0 to 255.
 5.1 Synthetic video
The noisefree videos, which function as ground truth, are captured by our highframerate camera (JVC GCP100). The frame rate is 250 fps, the frame resolution is 640×360, and the duration is 250 frames. Three videos are captured, namely, (1) a man moving from right to left (MovR2L), (2) a man moving from far to near (MovF2N), and (3) a waving hand (Waving). The noise is AWGN with standard deviation
σ_{n}
. Three noise levels are chosen to simulate largescale noise in ultralow illumination.
A common parameter among all methods, including ours, is the noise level estimator. In our experiments, this parameter is equal to
σ_{n}
. The parameters of the other algorithms are set as their default values. For our algorithm, the Gaussian prefilter is a 20×20 kernel with
σ_{G}
= 5 . The intensity threshold
t
= 5 , and the small patch threshold
S
= 100 .
N
_{1}
= 6 and
N
_{2}
= 5 frames are used for motion segmentation. The spatial distance Gaussian kernel of the spatial bilateral filter is 5×5 with a standard deviation of 3. The standard deviation of the pixel intensity difference kernel is 5.
The PSNR comparison is shown in
Table 1
. It indicates that the proposed algorithm is superior to the other methods in the overall evaluation. The processing time is shown in
Table 2
. The configuration of the computer is Intel Core i52430M CPU with 2.40 GHz speed, 8 GB RAM, and a 64bit Windows 7 OS.
Table 2
illustrates that the runtime of our algorithm is faster than that of the other methods.
PSNR (dB) comparison for three large noise levels.
PSNR (dB) comparison for three large noise levels.
Processing time (seconds) comparison for three large noise levels.
Processing time (seconds) comparison for three large noise levels.
In detail, the perframe PSNR is shown in
Fig. 10
. Our algorithm outperforms the other methods after an initialization of approximately 50 frames. The advantage of our method can be attributed to the excellent denoising performance on the background. Even the outline of a bunch of balloons, which is shown in the enlarged patch in
Fig. 11
, is restored. Our robust motion segmentation method ensures good temporal denoising of the Kalman filter. Several segmentation examples are provided in
Fig. 12
.
PSNR comparison for the synthetic videos. (a1) to (a3) MovR2L, MovF2N, and Waving with σ_{n} = 80 . (b1) to (b3) MovR2L, MovF2N, and Waving with σ_{n} = 100 . (c1) to (c3) MovR2L, MovF2N, and Waving with σ_{n} = 120 .
Denoising result for frame 250 of the “MovR2L” video. (a) Noisy input. (b) Ground truth. (c) VBM3D. (d) STGSM. (e) 3D shearlets. (f) Proposed method.
Several examples of our motion segmentation method. The first row is from MovR2L video processing, the second row is from MovF2N video processing, and the third row is from Waving video.
One shortcoming of our algorithm lies in the motion area because only a few frames can be employed for Kalman filtering. Using only the spatial bilateral filter is insufficient to obtain a good result. When the object moves fast or when the frame rate is low, the motion vector or motion area is large. Consequently, the denoising performance of our algorithm is poor. The moving leg in
Fig. 13
has a small motion vector. Thus, the influence of noise is not obvious. In
Fig. 14
(a), the hand stops for an instant when it is turning back, so the motion vector becomes zero, and a good denoising result is achieved. However, the influence of noise appears in a relatively large motion area when the hand is moving, as shown in
Fig. 14
(b). The peaks in
Fig. 10
(a3), (b3), and (c3) result from this instant stopandmove switch. In the extreme case, when the entire visual field is moving, the algorithm degrades to spatial bilateral filtering.
Denoising result for frame 215 of the “MovF2N” video. (a) Noisy input. (b) Ground truth. (c) VBM3D. (d) STGSM. (e) 3D shearlets. (f) Proposed method.
Denoising result of our algorithm for the “Waving” video. (a) Frame 150. (b) Frame 170.
 5.2 Real video
We also test our algorithm on real videos captured in ultralow illumination by using a highframerate sensor (Viimagic 9222B). The frame rate is 240 fps, and the frame resolution is 720 × 480. The illumination during video capture is approximately 0.01lux. We increase the global gain of the sensor to obtain a bright image. Three videos are also used for the experiments, namely, (1) a man moving from left to right (MovL2R), (2) a man moving from far to near (MovF2N), and (3) a waving hand (Waving). The MovL2R video is the same one in
Fig. 3
.
The noise estimator for all methods is set as 100. The settings of the other parameters are the same as those in Subsection 5.1. A sample denoising result is presented in
Fig. 15
. Our algorithm also shows the clearest background and no artifacts, and the moving objects are maintained too.
Denoising results for videos with real noise captured under an illumination of 0.01 lux. (a1) to (a5) MoVL2R video. (b1) to (b5) MovF2N video. (c1) to (c5) Waving video.
The experiments prove that our algorithm is suitable for videos with a still background and a small motion vector. The experiments also demonstrate that neglecting small changing patches in videos with serious noise is valuable. Through this procedure, good motion estimation is obtained, and the main bodies of moving objects are maintained.
6. Conclusion
In this study, we propose a denoising algorithm for highframerate videos in ultralow illumination. Kalman filtering is used as a temporal denoising method. Imaging process noise is modeled as a result of motion, whereas observation noise is caused by the camera. The bilateral filter is used to help denoising in the motion area. Kalman temporal denoising and bilateral spatial denoising are combined by the Kalman gain, which indicates motion degree. Kalman filtering can provide good estimation for the background. However, the motion would be blurred if no motion estimation method is implemented. Reliable motion estimation is the key to an effective Kalman filtering framework.
We exploit two features of highframerate videos in ultralow illumination for motion segmentation. The first feature is the small motion vector of moving objects in highframerate videos. Thus, the motion area between two close frames is contained within the motion area between two relatively far frames (
Fig. 7
). This containment rule is valid when the motion vector is small. The second feature is reasonably neglecting small changing patches in ultralow illumination because detail changes are lost in noise, and only the main structure of the image is detected. A robust motion estimation method for highframerate videos in ultralow illumination is derived by application of the aforementioned features. A large motion area between two spaced frames is initially segmented with the use of several extra frames under the containment rule and through neglect of small patches [Eqs. (8) and (9)]. Then, the motion of two successive images is estimated in the segmented areas [Eq. (10)].
The experiments on videos with synthetic and real noises demonstrate that our algorithm performs better than other stateoftheart methods, particularly at the background. Regarding moving objects, the main body is maintained because of our effective motion segmentation scheme, although the denoising performance is not as good as that in the background. When the motion vector is small, the motion area is also small, and the denoising performance is improved. A small motion vector can be achieved when objects are captured at a normal velocity with the use of a highframerate camera. Our algorithm is suitable for denoising highframerate videos in ultralow illumination.
BIO
Xin Tan received his B.S. degree in Electronics Information from Sichuan University in 2008, and M.S. degrees in Systems Engineering at National University of Defense Technology (NUDT) at 2010.
He is currently working towards the Ph. D. degree at College of Information System and Management at NUDT. His research interests focus on image/video processing and multimedia information systems.
Yu Liu received his B.S. from Northwestern Polytechnical University, Xi’an, China in 2005. He received his MSc in image processing and PhD in computer graphics from the University of East Anglia, Norwich, United Kingdom, in 2007 and 2011, respectively.
He is currently a lecturer in the College of Information System and Management, National University of Defense Technology. His research interests include image/video processing, computer graphics, and visualhaptic technology.
Zheng Zhang was born in Hunan, China, in 1983. He received his M.E. degree from the National University of Defense Technology, China, in 2007,and PhD in Computer Engineering from Nanyang Technological University (Singapore) in 2012. He is currently working at National University of Defense Technology. His research interests include widearea video surveillance and analysis, computational photography, and human motion analysis.
Maojun Zhang received his B.S. and Ph.D. degrees in Systems Engineering at National University of Defense Technology (NUDT) at 1992 and 1997, respectively.
From 1997 to 2003, he was an Assistant Professor with Department of Systems Engineering, NUDT. Since 2003, He has been a Professor of College of Information System and Management at NUDT, Changsha, China. His major interests include computer vision, image/video processing, multimedia information systems, and virtual reality technology.
Dabov K.
,
Foi A.
,
Katkovnik V.
,
Egiazarian K.
2007
“Image denoising by sparse 3D transformdomain collaborative filtering,”
IEEE transactions on Image Processing
16
(8)
2080 
2095
DOI : 10.1109/TIP.2007.901238
Dabov K.
,
Foi A.
,
Egiazarian K.
“Video denoising by sparse 3D transformdomain collaborative filtering,”
in Proc. of 15th European Signal Processing Conf.
September 37, 2007
145 
149
Maggioni M.
,
Boracchi G.
,
Foi A.
,
Egiazarian K.
2012
“Video denoising, deblocking and enhancement through separable 4D nonlocal spatiotemporal transforms,”
IEEE transactions on Image Processing
21
(9)
3952 
3966
DOI : 10.1109/TIP.2012.2199324
Luiser F.
,
Blu T.
,
Unser M.
2010
“SURELET for orthonormal waveletdomain video denoising,”
IEEE Transactions on Circuits and Systems for Video Technology
20
(6)
913 
919
DOI : 10.1109/TCSVT.2010.2045819
Negi P. S.
,
Labate D.
2012
“3D discrete shearlet transform and video processing,”
IEEE transactions on Image Processing
21
(6)
2944 
2954
DOI : 10.1109/TIP.2012.2183883
Protter M.
,
Elad M.
2009
“Image sequence denoising via sparse and redundant representations,”
IEEE transactions on Image Processing
18
(1)
27 
35
DOI : 10.1109/TIP.2008.2008065
Varghese G.
,
Wang Z.
2010
“Video denoising based on a spatiotemporal Gaussian scale mixture model,”
IEEE Transactions on Circuits and Systems for Video Technology
20
(7)
1032 
1040
DOI : 10.1109/TCSVT.2010.2051366
Kalman R. E.
1960
“A new approach to linear filtering and prediction problems,”
Journal of Basic Engineering
82
(1)
35 
45
DOI : 10.1115/1.3662552
Kim M.
,
Park D.
,
Han D. K.
,
Ko H.
“A novel framework for extremely lowlight video enhancement,”
in Proc. of 2014 IEEE Int. Conf. Consumer Electronics
January 1013, 2014
91 
92
Conte F.
,
Germani A.
,
Iannello G.
2013
“A kalman filter approach for denoising and deblurring 3d microscopy images,”
IEEE transactions on Image Processing
22
(12)
5306 
5321
DOI : 10.1109/TIP.2013.2284873
Biloslavo M.
,
Ramponi G.
,
Olivieri S.
,
Albani L.
“Joint kalmanbased noise filtering and motion compensated video coding for low bit rate videoconferencing,”
in Proc. of 2000 Int. Conf. Image Processing
September 1013, 2000
vol. 1
992 
995
Dugad R.
,
Ahuja N.
“Video denoising by combing Kalman and wiener estimates,”
in Proc. 1999 Int. Conf. Image Processing
October 2428, 1999
vol. 4
152 
156
Mahmoudi M.
,
Sapiro G.
2005
“Fast image and video denoising via nonlocal means of similar neighborhoods,”
IEEE Signal Processing Letters
12
(12)
839 
842
DOI : 10.1109/LSP.2005.859509
Buades A.
,
Coll B.
,
Morel J.
2008
“Nonlocal image and movie denoising,”
International Journal of Computer Vision
76
(2)
123 
139
DOI : 10.1007/s1126300700521
Kuang Y.
,
Zhang L.
,
Yi Z.
2014
“An adaptive ranksparsity ksvd algorithm for image sequence denoising,”
Pattern Recognition Letters
45
46 
54
DOI : 10.1016/j.patrec.2014.03.003
Li X.
,
Zheng Y.
2009
“Patchbased video processing: a variational Bayesian approach,”
IEEE Transactions on Circuits and Systems for Video Technology
19
(1)
27 
40
DOI : 10.1109/TCSVT.2008.2005805
Selesnick I. W.
,
Li K. Y.
“Video denoising using 2d and 3d dualtree complex wavelet transforms,”
in Proc. of SPIE 5207, Wavelets: Applications in Signal and Image Processing
November 14, 2003
607 
618
Rabbani H.
,
Gazor S.
2012
“Video denoising in threedimensional complex wavelet domain using a doubly stochastic modelling,”
IET image processing
6
(9)
1262 
1274
DOI : 10.1049/ietipr.2012.0017
Labate D.
,
Negi P. S.
“3D discrete shearlet transform and video denoising,”
in Proc. of SPIE 8138, Wavelets and Sparsity XIV
September, 2011
81381Y 
81381Y11
Easley G.
,
Labate D.
,
Lim W.
2008
“Sparse directional image representations using the discrete shearlet transform,”
Applied and Computational Harmonic Analysis
25
(1)
25 
46
DOI : 10.1016/j.acha.2007.09.003
Tan H.
,
Tian F.
,
Qiu Y.
,
Wang S.
,
Zhang J.
2010
“Multihypothesis recursive video denoising based on separation of motion state,”
IET Image Processing
4
(4)
261 
268
DOI : 10.1049/ietipr.2009.0279
Liu C.
,
Freeman W. T.
“A highquality video denoising algorithm based on reliable motion estimation,”
in Proc. of 2010 European Conf. Computer Vision
September 511, 2010
vol. 6313
706 
719
Portz T.
,
Zhang L.
,
Jiang H.
“Highquality video denoising for motionbased exposure control,”
in Proc. of 2011 IEEE Int. Conf. Computer Vision Workshops
November 613, 2011
9 
16
Cong Z.
,
Gao Z.
,
Zhang X.
“A practical video denoising method based on hierarchical motion estimation,”
in Proc. of 2013 IEEE Int. Symposium on Broadband Multimedia Systems and Broadcasting
June 57, 2013
1 
5
Zlokolica V.
,
Pizurica A.
,
Philips W.
2006
“Waveletdomain video denoising based on reliability measures,”
IEEE Transactions on Circuits and Systems for Video Technology
16
(8)
993 
1007
DOI : 10.1109/TCSVT.2006.879994
Tomasi C.
,
Manduchi R.
“Bilateral filtering for gray and color images,”
in Proc. of 1998 6th Int. Conf. on Computer Vision
January 47, 1998
839 
846
Gonzales R. C.
,
Woods R. E.
2002
Digital Image Processing
2nd Edition
Prentice Hall
New Jersey
VBM3D denoising toolbox [Onlline]
http://www.cs.tut.fi/~foi/GCFBM3D/
STGSM denoising toolbox [Onlline]
https://ece.uwaterloo.ca/~z70wang/research/stgsm/
Shearlet denoising toolbox [Online]
http://www.math.uh.edu/~dlabate/software.html