Video fingerprints generated from global features are usually vulnerable against general geometric transformations. In this paper, a novel video fingerprinting algorithm is proposed, in which a new spatiotemporal gradient is designed to represent the spatial and temporal information for each frame, and a new partition scheme, based on concentric circle and rings, is developed to resist the attacks efficiently. The centroids of spatiotemporal gradient orientations (CSTGO) within the circle and rings are then calculated to generate a robust fingerprint. Our experiments with different attacks have demonstrated that the proposed approach outperforms the stateoftheart methods in terms of robustness and discrimination.
1. Introduction
W
ith the prompt development of the internet and multimedia technology, videos can be easily accessed and distributed. However, this convenience creates copyright control issues, for example, pirated copies of videos can be easily made and propagated over the internet. Video fingerprinting or contentbased copy detection
[1]
provides an efficient way to identify the video features, and is widely used in multimedia industry
[2]
such as video database management, content searching, and content tracking.
Fig. 1
shows the common structure of a video fingerprinting system, which consists three parts: fingerprint extraction, database storage and fingerprint matching.
Overall structure of a complete fingerprinting system
A good video fingerprinting algorithm must satisfy following properties:
Discrimination
: Video fingerprinting algorithm should generate distinct fingerprints for different video contents, so that these fingerprints can distinguish two different media contents.
Robustness
: Fingerprints of a video should remain invariant even if the video suffers several perceptual preserving distortions like frame rotation or brightness change.
Efficiency
: Video fingerprints should have low computational and storage cost to ensure fast retrieval. Efficiency is valuable in industry application.
In general, video fingerprinting algorithms can be categorized into two classes. In the first class, video fingerprints are extracted based on local features. For example, the video fingerprints are constructed by using Harris interest points
[3]
in literature
[4]
, or formed by using the trajectories of the interest points in literature
[5]
. In literature
[6]
, the video fingerprints are generated based on SIFT features
[7]
. One of main drawbacks of these algorithms is that the number and the dimension of the extracted features are usually very large, which incurs large computational and storage cost. In the second class, video fingerprints are obtained based on global video features. Ordinal measure can be used for copy detection in
[8]

[10]
, and the video fingerprint weights based on visual saliency are computed for ordinal measure in literature
[11]
. In literature
[12]
, the gradient of each pixel is obtained by calculating luminance differences along both horizontal and vertical directions and the fingerprints are generated from the centroid of these luminance gradient orientations (CGO). Although the computation cost of these algorithms is low compared to the algorithms of the first class, they are vulnerable to logo insertion and some geometric transforms like frame rotation or mirroring.
In this paper, a novel video fingerprinting algorithm is proposed to enhance the robustness of video fingerprints, by using the centroid of spatiotemporal gradient orientations (CSTGO). The key points of this paper is: (1) In our spatiotemporal gradient, the spatial derivative in eight directions is designed to avoid geometric limitation of traditional horizontal and vertical derivatives, while the temporal derivative is used to enhance the resistance to logo insertion of our fingerprints. (2) A novel partitioning scheme based on concentric circle and rings is also developed to further improve the robustness of the fingerprints. Our analysis and experimental results show that the combination of the spatial information in each frame and temporal information between two consecutive frames can enhance the robustness against attacks like global change as well as immunize against rotation, frame mirroring, and logo insertion.
The rest of the paper is organized as follows: in Section II the proposed fingerprinting algorithm and the design consideration are described in details, the experiment results and analysis are shown in Section III and the conclusions are given in Section IV.
2. The Proposed Fingerprinting Algorithm
In the traditional CGO based algorithm
[12]
only four neighbors in the horizontal and vertical direction are considered when the luminance difference is calculated for each pixel. In fact, when a video frame suffers the attacks like frame rotation or mirroring, the horizontal and vertical neighbors of each pixel would be changed, while as long as the attacks preserve the perceptual quality, its eight neighbors still revolve around it, only their relative position is changed as shown in
Fig. 2
.
Spatial relationship of a pixel’s eight neighbors (a) Original spatial relationship (Each subblock represent a pixel) (b) After a 90 degrees rotation of (a) around the central pixel (c) After vertical mirroring of (a)
In order to keep the spatial information intact and get a good rotation invariance, in our algorithm the luminance differences at each central pixel
f
(
x
,
y
,
k
) in frame k are calculated by using its 8neighborhood values and the maximal absolute value of these differences is used as the spatial luminance difference, and denoted as spatial derivative
G_{s}
:
Block diagram of proposed CSTGO algorithm
On the other hand, the temporal information can provide additional discriminative power for video fingerprints
[10]
. However the CGO based approach
[12]
does not consider temporal relationship between frames. It is noted that the differences between adjacent frames are usually small, and sometimes most of them are closed to zero when the change of scenes is little. This characteristic can be used to improve the robustness of fingerprints when logo insertion occurs in our algorithm. In this case, the differences of the logo region between two consecutive frames would be equal to zero. Here the derivative between two adjacent frames is denoted as temporal derivative
G_{t}
:
The block diagram of the proposed CSTGO based algorithm is shown in
Fig. 3
. The preprocessing performs basic operations including frame rate conversion, grayscale conversion and frame resizing
[12]
. Each frame is then partitioned into a circle and
N
1 concentric rings, and the radius of the circle and rings is
n
×
r
(where
n
= 1,2, …,
N
,
r
is the radius of the innermost circle) as shown in
Fig. 4
, where the frame size is
X
×
Y
, and
N
×
r
is equal to
Y
/2 or
X
/2. If frames are rotated around the centers, the relationship between each pixel and its associated circle or ring would not change. So the robustness of video fingerprint can be enhanced when frames suffer rotation attack, compared with traditional blockbased partition schemes
[8]

[12]
. In addition, assume that in the most common cases, the visual contents are distributed throughout the concentric rings and the importance of these contents decrease with the increasing of their distance to the frame center. Thus, if the logos placed or inserted in the regions outside the largest ring as shown in
Fig. 4
, these regions will be discarded during the feature calculation and the generated fingerprints can resist to the logo insertion without losing important visual content. Moreover, rather than the concentric circlebased partition scheme in
[13]
, in our partition scheme the closer to the central of the frame, the more concentric rings contains, thus different weights are obtained in different regions depended on the importance of visual contents.
Frame partition into concentric rings (N = 8)
The procedure of the CSTGO calculation is as follows:
(1) For each pixel’s value
f
(
x
,
y
,
k
) at location (
x
,
y
) in frame
k
, the Euclidean distance
Dist
(
x
,
y
,
k
) between the pixel and the frame center is defined as:
where
X
and
Y
is the width and height of frame
k
, respectively.
(2) For each
f
(
x
,
y
,
k
), if (
n
1) ×
r
≤
Dist
(
x
,
y
,
k
) ≤
n
×
r
(1≤
n
≤
N
) is satisfied, then the gradient magnitude
r
(
x
,
y
,
k
) and orientation
θ
(x, y, k) is given respectively by:
where
G_{s}
and
G_{t}
is calculated by (1) and (2), respectively.
(3) The centroid of gradient
c_{n,k}
is calculated as follows:
(4) Since each frame contains a circle and
N
1 concentric rings, the fingerprint vector
C_{k}
is obtained finally as follows:
and
N
is the dimension of the fingerprints.
In the fingerprint matching stage, each fingerprint vector in a fingerprint sequence of
K
consecutive frames is first normalized by the means and standard deviation of the sequence as in
[12]
. Then the similarity between two video clips is measured by the Euclidean distance
d
as follows:
where
P
= {
p_{n,k}

n
= 1,…,
N
,
k
= 1,…,
K
} and
Q
= {
q_{n,k}

n
= 1,…,
N
,
k
= 1,…,
K
} are normalized fingerprint sequences extracted from two video clips, respectively. Given a threshold
T
, if
d
(
P
,
Q
) ≤
T
, the two video clips are considered as similar.
The performance of the algorithm can be measured by the false positive rate
P_{fp}
and the false negative rate
P_{fn}
,
P_{fp}
is a probability which treats different videos as same one, while
P_{fn}
is a probability which treats the same videos as dissimilar. Usually,
P_{fn}
is hard to obtain directly, thus, an evaluation method in
[12]
is used to estimate the
P_{fp}
and
P_{fn}
here. In fact, since the fingerprints are extracted independently from video clips, the distance
d
is subject to a normal distribution by the central limit theorem if
NK
is sufficiently large. Let
d_{intra}
denote the intra distance between original video fingerprint sequence and its distortion version, and
d_{inner}
denote the inner distance of fingerprint sequences between two different clips, then the distributions of these two distances can be obtained as follows:
where
μ
_{intra}
,
σ
_{intra}
,
μ
_{inner}
and
σ
_{inner}
are their means and standard deviations respectively. By maximum likelihood estimation, given the distance values
d_{i}
(1 ≤
i
≤
M
) of
M
fingerprint sequences, their means and variances can be estimated as follows:
The false positive rate
P_{fp}
can be written as follows:
and the threshold
T
can be obtained by:
where
erfcinv
(x) is the inverse function of complementary error function
erfc
(x).
Given a set of different
P_{fp}
, various thresholds
T
can be determined by equ.13 and equ.14, and then
P_{fn}
can be measured through the experiments using equ.15:
where
N_{fn}
is the number of similar clip pairs whose
d
(
P
,
Q
) ≥
T
, and the
N_{s}
is the true number of similar video clips.
3. Experiment Results
Our experiments were conducted on a database which contains more than 200 movies, which are collected from Internet. The total length of these videos is about 300 hours. In our experiment, each video clip is downsampled to 10 fps (frames per second), and resized to 320 × 320. The fingerprint dimension
N
is 8, and the clip length is 10 seconds, i.e.,
K
= 100.
 3.1 Robustness Test
Comparison between original video frame and the frames with various video attacks. (a) Original Frame, (b) Resizing to CIF, (c) Aspect ratio change (from 4:3 to 16:9), (d) Gaussian blurring with radius 1 pixel, (e) Global change in brightness (+25%), (f) Gamma correction (γ = 1.3), (g) Gaussian noise addtion, (h) Logo overlay in the upperleft by (9% and 16%), (i) Logo overlay in the middle by (9% and 16%), (j) Crop (14%), (k) Horizontal shift (10 pixel), (l) Frames mirroring vertically, (m) Rotation (1, 13, 20, and 40 degrees).
Empirical inner and intra distance distributions for CSTGO and CGO algorithms. (a) The inner distance distributions of the two algorithms. (b) The intra and inner distance distributions of CSTGO.
In this test, the videos suffer twelve types of attacks. All attacks, together with the original frame, are illustrated in
Fig. 5
.
Global attack (Type 1 to Type 6)
(1) Resizing to CIF,
(2) Aspect ratio change (from 4:3 to 16:9),
(3) Gaussian blurring with radius 1 pixel,
(4) Global change in brightness (+25%),
(5) Gamma correction (γ = 1.3),
(6) Gaussian noise addition,
Logo overlay (Type 7 and Type 8)
(7) Logo overlay in the upperleft by (9% and 16%),
(8) Logo overlay in the middle by (9% and 16%),
Geometric attack (Type 9 to Type 12)
(9) Crop (14%),
(10) Horizontal shift (10 pixel),
(11) Frames mirroring vertically,
(12) Rotation (1, 13, 20, and 40 degrees).
First, the empirical inner distance distributions of the CSTGO based algorithm and CGO based algorithm
[12]
are compared in
Fig. 6
(a). 45082260 pairs of different fingerprint sequences are used to analyze the distributions. The mean and the variance of these inner distances in the CSTGO based algorithm are 1.9974 and 0.0127287, while these in the CGO based algorithm are 1.95693 and 0.120112, respectively. The variance of inner distances in the CSTGO based algorithm is smaller than that in the CGO based algorithm. It means that under a certain threshold
T
, the CSTGO based algorithm can achieve a lower
P_{fp}
than the CGO based algorithm. For example, if
T
is fixed to 0.5, then the expected
P_{fp}
in the CGO based algorithm is 1.3122×10
^{5}
, while that in the CSTGO based algorithm is 1.6759×10
^{40}
, which is much smaller than that in CGO. Hence, the proposed algorithm has a better discriminative feature than the CGO based algorithm does.
Fig. 6
(b) shows the inner distance distribution and three different intra distance distributions in CSTGO. These intra distributions are calculated under three attacks: (1) Type 8, logo overlay in the middle of the frame by 16%, (2) Type 9, crop the frame, (3) Type 12, rotate the frame 40 degrees. As shown in
Fig. 6
(b), the distances between the inner distance distribution and those three intra distance distributions are very large, so that the CSTGO based algorithm has a better robustness than the CGO based algorithm.
In
Fig. 7

Fig. 11
, about 130 hours of videos, together with their twelve attacked versions are used. Because the
P_{fp}
is usually hard to obtain in practice, here the
P_{fp}
is set from 10
^{60}
to 1, and the corresponding threshold T is determined by equ.14. Then the
P_{fn}
is obtained in equ.15, where
N_{fn}
,
N_{s}
is measured though the experiments, respectively. In these experiments, the robustness performances of the proposed CSTGO are compared with the weighted hashing (WH) method in
[11]
and CGO
[12]
under different attacks.
The robustness performances of CSTGO compared with WH [11] and CGO [12] under global attacks. (a) Resizing to CIF, (b) Aspect ratio change (from 4:3 to 16:9), (c) Gaussian blurring with radius 1 pixel, (d) Global change in brightness (+25%), (e) Gamma correction (γ = 1.3), (f) Gaussian noise addition.
The robustness performances of CSTGO compared with WH [11] and CGO [12] under logo insertions. (a) Logo overlay in the upperleft by 9%, (b) Logo overlay in the middle by 9%.
The robustness performances of CSTGO under different logo insertions.
The robustness performances of CSTGO compared with WH [11] and CGO [12] under geometric attacks. (a) Frame crop (14%), (b)Horizontal shift (10 pixel), (c) Rotation 1 degrees, (d)Frames mirroring.
The robustness performances of CSTGO under different rotational degrees
Fig. 7
gives the robustness comparisons of above three algorithms under different global attacks. Since the proposed CSTGO uses the maximum of luminance differences as spatial derivative, brightness change and noise addition may influence the performance of the proposed CSTGO. However, a lower
P_{fn}
is achieved in the proposed algorithm than other two algorithms under a given
P_{fp}
, which means our algorithm is more robust against the global attacks than WH and CGO.
Fig. 8
shows the comparison of these three algorithms under different logo insertions. Our CSTGO achieves better performances than CGO and WH. Usually the logo is inserted in the marginal region. For example, when the logo is inserted in the upperleft corner of original frame, with the feature of temporal derivative and partition scheme, a low
P_{fn}
is achieved when
P_{fp}
is small because most of the influenced area where logo is inserted would be discarded, so the impact on the fingerprints is weak, as shown in
Fig. 8
(a). Furthermore, when the logo is inserted in the middle region of the frame, the main visual content would be changed, so do video fingerprints under this attack. However,
Fig. 8
(b) demonstrates that our method is still more robust to this attack than WH and CGO. In addition,
Fig. 9
gives the robustness of proposed CSTGO under logo insertion attacks with different logo insertion positions and different logo size. We can see that our method can resist to this attack even when the inserted logo overlays 16% middle area of the original frame.
The comparisons of these three algorithms under different geometric attacks are shown in
Fig. 10
. The figure demonstrates that our proposed CSTGO is the most robust algorithm among these three methods when videos suffer geometric transforms. Traditional blockbased partition scheme like WH
[11]
and CGO
[12]
are fragile to some geometric attacks since these attacks may change the order of frame blocks. For example,
Fig. 10
(d) illustrates that WH and CGO cannot resist to frame mirroring attack.
Fig. 11
shows that with the feature of rotationinvariant spatial derivative and partition scheme in our CSTGO based approach, the generated fingerprints are immune to rotation even when frames are rotated up to 40 degrees.
 3.2 Scalability Test
In this test, the scalability of the proposed algorithm is examined. Since the dimension of fingerprinting vector of each frame is 8, the kdtree
[14]
is used as the indexing structure in our database. In the matching stage, each fingerprint vector in the fingerprint sequence is used to retrieval its most similar fingerprint vector in the database via kdtree, and the potential video clips would be found in the database. The distances between the query clip and the clips found in the database are then calculated, and the clip with the smallest distance is selected as the retrieval result finally.
In the test, 18047 query clips were constructed, and the total length of these video clips is about 50 hours. In order to test the scalability, the size of the database varies from 50 hours to 300 hours.
Fig. 12
shows that the precision of the retrieval result is able to maintain at 100% even if the size of database increases. It demonstrates that the proposed algorithm has good scalability.
The scalability performances of CSTGO
4. Conclusion
In this paper, a new algorithm based on CSTGO is proposed for video fingerprint. The algorithm includes an improved spatial derivative, a new defined spatiotemporal gradient to represent the spatial and temporal information for each frame, and a new partition scheme based on concentric rings. Our investigation indicates that the feature of temporal derivative can be used to resist logo insertion and the combination of the spatial and temporal information in each frame can enhance the discrimination and robustness of the video fingerprints. Our experimental results have demonstrated that the proposed CSTGO based algorithm performs better that the stateoftheart algorithms in terms of discrimination and robustness when the videos suffer the typical attacks, such as, global change, geometric attacks, and logo insertion. Our fingerprints are also robust against geometric attacks such as rotation and frame mirroring which traditional global feature based fingerprints cannot resist. The proposed CSTGO also has good scalability. The future work is to evaluate our CSTGO algorithm on some open video databases.
Acknowledgements
The authors would like to thanks Mr. Dennis D. Zhu and Mr. Guibo Luo for their kindly support and valuable comments on this paper. This work is supported by the 973 Program of China under Grant 2012CB315904.
BIO
Ziqiang Sun was born in 1987. He received his B.S. degree in Electronics Engineering from Lanzhou University in 2009. He is currently a Ph.D. candidate in School of Electronics Engineering and Computer Science, Peking University. His research interests include video fingerprint and multimedia technology.
Yuesheng Zhu received his B.Eng. degree in Radio Engineering, M.Eng. degree in Circuits and Systems and Ph.D. degree in Electronics Engineering in 1982, 1989 and 1996, respectively. He is currently working as a professor at the Lab of Communication and Information Security, Shenzhen Graduate School, Peking University. He is a senior member of IEEE, fellow of China Institute of Electronics, and senior member of China Institute of Communications. His interests include digital signal processing, multimedia technology, communication and information security.
Xiyao Liu was born in Hunan Province, in 1987. He received the B.S. degree from School of Electrical Engineering and Computer Sciences, Department of Microelectronics, Peking University in 2008. He is currently pursuing the Ph.D. degree in School of Electrical and Computer Engineering, Department of Microelectronics and SolidState Electronics, Peking University. His research interests include video fingerprint, video watermarking and other digital video process.
Liming Zhang received the B.S. degree from Nankai University, China, in 1987, the M.S. degree from the East China Institute of Technology, China, in 1990, and the Ph.D. in computer science (image processing) from the University of New England, Australia, 2002. She is currently working as an Assistance Professor in the Faculty of Science and Technology, University of Macau, in 2010. Her recent research interests including signal processing and image processing.
Lian S.
,
Nikolaos N.
,
Sencar H. T.
2010
“Contentbased video copy detection — a survey,” Intelligent Multimedia Analysis for Security Applications
vol. 282, Article (CrossRef Link)
253 
273
Jian Lu
2009
“Video fingerprinting for copy identification: from research to industry applications,”
Proc. of SPIE  Media Forensics and Security XI
vol. 7254, Article (CrossRef Link)
Harris C.
,
Stevens M.
1988
“A combined corner and edge detector,”
in Proc. of 4th Alvey Vision Conference
Article (CrossRef Link)
153 
158
Joly A.
,
Buisson O.
,
Frelicot C.
2007
“Contentbased copy retrieval using distortionbased probabilistic similarity search,”
IEEE Transactions on Multimedia
Article (CrossRef Link)
9
(2)
293 
306
DOI : 10.1109/TMM.2006.886278
LawTo J.
,
Buisson O.
,
GouetBrunet V.
,
Boujemaa N.
2006
“Robust voting algorithm based on labels of behavior for video copy detection,”
in Proc. of ACM Multimedia
Article (CrossRef Link)
835 
844
Liu Z.
,
Liu T.
,
Shahraray B.
2009
“At&t research at TRECVID 2009 contentbased copy detection,”
in Proc. of TRECVID
Article (CrossRef Link)
Mohan R.
1998
“Video sequence matching,”
in Proc. of Int. Conf. on Acoustics, Speech and Signal Processing
vol. 6, Article (CrossRef Link).
3697 
3700
Hampapur A.
,
Bolle R.
2002
“Comparison of sequence matching techniques for video copy detection,”
in Proc. of Conf. on Storage and Retrieval for Media Databases
Article (CrossRef Link)
194 
201
Kim C.
,
Vasudev B.
2005
“Spatiotemporal sequence matching for efficient video copy detection,”
IEEE Transactions on Circuits and Systems for Video Technology
Article (CrossRef Link)
15
(1)
127 
132
DOI : 10.1109/TCSVT.2004.836751
Sun J.
,
Wang J.
,
Zhang J.
,
Nie X.
,
Liu J
2012
“Video hashing algorithm with weighted matching based on visual saliency,”
IEEE Signal Processing letters
Article (CrossRef Link)
19
(6)
328 
331
DOI : 10.1109/LSP.2012.2192271
Lee S.
,
Yoo C.
2008
“Robust video fingerprinting for contentbased video identification,”
IEEE Transactions on Circuits and Systems for Video Technology
Article (CrossRef Link)
18
(7)
983 
988
DOI : 10.1109/TCSVT.2008.920739
Radhakrishnan R.
,
Bauer C.
2009
“Video Fingerprinting Based on Moment Invariants Capturing Appearance and Motion,”
in Proc. of Int. Conf. on Multimedia and Expo
Article (CrossRef Link)
1532 
1535
Robinson J. T.
1981
“The kdbtree: A search structure for large multidimensional dynamic indexing,”
in Proc. of ACM SIGMOD Int. Conf. Management Data
Article (CrossRef Link)
10 
18