Advanced
A Robust Video Fingerprinting Algorithm Based on Centroid of Spatio-temporal Gradient Orientations
A Robust Video Fingerprinting Algorithm Based on Centroid of Spatio-temporal Gradient Orientations
KSII Transactions on Internet and Information Systems (TIIS). 2013. Nov, 7(11): 2754-2768
Copyright © 2013, Korean Society For Internet Information
  • Received : July 09, 2013
  • Accepted : November 10, 2013
  • Published : November 30, 2013
Download
PDF
e-PUB
PubReader
PPT
Export by style
Share
Article
Author
Metrics
Cited by
TagCloud
About the Authors
Ziqiang Sun
Communication & Information Security Lab, Shenzhen Graduate School, Peking University Shenzhen, Guangdong 518055 - China
Yuesheng Zhu
Communication & Information Security Lab, Shenzhen Graduate School, Peking University Shenzhen, Guangdong 518055 - China
Xiyao Liu
Communication & Information Security Lab, Shenzhen Graduate School, Peking University Shenzhen, Guangdong 518055 - China
Liming Zhang
Faculty of Science and Technology, University of Macau Macao, 999078 - China

Abstract
Video fingerprints generated from global features are usually vulnerable against general geometric transformations. In this paper, a novel video fingerprinting algorithm is proposed, in which a new spatio-temporal gradient is designed to represent the spatial and temporal information for each frame, and a new partition scheme, based on concentric circle and rings, is developed to resist the attacks efficiently. The centroids of spatio-temporal gradient orientations (CSTGO) within the circle and rings are then calculated to generate a robust fingerprint. Our experiments with different attacks have demonstrated that the proposed approach outperforms the state-of-the-art methods in terms of robustness and discrimination.
Keywords
1. Introduction
W ith the prompt development of the internet and multimedia technology, videos can be easily accessed and distributed. However, this convenience creates copyright control issues, for example, pirated copies of videos can be easily made and propagated over the internet. Video fingerprinting or content-based copy detection [1] provides an efficient way to identify the video features, and is widely used in multimedia industry [2] such as video database management, content searching, and content tracking. Fig. 1 shows the common structure of a video fingerprinting system, which consists three parts: fingerprint extraction, database storage and fingerprint matching.
PPT Slide
Lager Image
Overall structure of a complete fingerprinting system
A good video fingerprinting algorithm must satisfy following properties:
Discrimination : Video fingerprinting algorithm should generate distinct fingerprints for different video contents, so that these fingerprints can distinguish two different media contents.
Robustness : Fingerprints of a video should remain invariant even if the video suffers several perceptual- preserving distortions like frame rotation or brightness change.
Efficiency : Video fingerprints should have low computational and storage cost to ensure fast retrieval. Efficiency is valuable in industry application.
In general, video fingerprinting algorithms can be categorized into two classes. In the first class, video fingerprints are extracted based on local features. For example, the video fingerprints are constructed by using Harris interest points [3] in literature [4] , or formed by using the trajectories of the interest points in literature [5] . In literature [6] , the video fingerprints are generated based on SIFT features [7] . One of main drawbacks of these algorithms is that the number and the dimension of the extracted features are usually very large, which incurs large computational and storage cost. In the second class, video fingerprints are obtained based on global video features. Ordinal measure can be used for copy detection in [8] - [10] , and the video fingerprint weights based on visual saliency are computed for ordinal measure in literature [11] . In literature [12] , the gradient of each pixel is obtained by calculating luminance differences along both horizontal and vertical directions and the fingerprints are generated from the centroid of these luminance gradient orientations (CGO). Although the computation cost of these algorithms is low compared to the algorithms of the first class, they are vulnerable to logo insertion and some geometric transforms like frame rotation or mirroring.
In this paper, a novel video fingerprinting algorithm is proposed to enhance the robustness of video fingerprints, by using the centroid of spatio-temporal gradient orientations (CSTGO). The key points of this paper is: (1) In our spatio-temporal gradient, the spatial derivative in eight directions is designed to avoid geometric limitation of traditional horizontal and vertical derivatives, while the temporal derivative is used to enhance the resistance to logo insertion of our fingerprints. (2) A novel partitioning scheme based on concentric circle and rings is also developed to further improve the robustness of the fingerprints. Our analysis and experimental results show that the combination of the spatial information in each frame and temporal information between two consecutive frames can enhance the robustness against attacks like global change as well as immunize against rotation, frame mirroring, and logo insertion.
The rest of the paper is organized as follows: in Section II the proposed fingerprinting algorithm and the design consideration are described in details, the experiment results and analysis are shown in Section III and the conclusions are given in Section IV.
2. The Proposed Fingerprinting Algorithm
In the traditional CGO based algorithm [12] only four neighbors in the horizontal and vertical direction are considered when the luminance difference is calculated for each pixel. In fact, when a video frame suffers the attacks like frame rotation or mirroring, the horizontal and vertical neighbors of each pixel would be changed, while as long as the attacks preserve the perceptual quality, its eight neighbors still revolve around it, only their relative position is changed as shown in Fig. 2 .
PPT Slide
Lager Image
Spatial relationship of a pixel’s eight neighbors (a) Original spatial relationship (Each sub-block represent a pixel) (b) After a 90 degrees rotation of (a) around the central pixel (c) After vertical mirroring of (a)
In order to keep the spatial information intact and get a good rotation invariance, in our algorithm the luminance differences at each central pixel f ( x , y , k ) in frame k are calculated by using its 8-neighborhood values and the maximal absolute value of these differences is used as the spatial luminance difference, and denoted as spatial derivative Gs :
PPT Slide
Lager Image
Block diagram of proposed CSTGO algorithm
PPT Slide
Lager Image
On the other hand, the temporal information can provide additional discriminative power for video fingerprints [10] . However the CGO based approach [12] does not consider temporal relationship between frames. It is noted that the differences between adjacent frames are usually small, and sometimes most of them are closed to zero when the change of scenes is little. This characteristic can be used to improve the robustness of fingerprints when logo insertion occurs in our algorithm. In this case, the differences of the logo region between two consecutive frames would be equal to zero. Here the derivative between two adjacent frames is denoted as temporal derivative Gt :
PPT Slide
Lager Image
The block diagram of the proposed CSTGO based algorithm is shown in Fig. 3 . The pre-processing performs basic operations including frame rate conversion, grayscale conversion and frame resizing [12] . Each frame is then partitioned into a circle and N -1 concentric rings, and the radius of the circle and rings is n × r (where n = 1,2, …, N , r is the radius of the innermost circle) as shown in Fig. 4 , where the frame size is X × Y , and N × r is equal to Y /2 or X /2. If frames are rotated around the centers, the relationship between each pixel and its associated circle or ring would not change. So the robustness of video fingerprint can be enhanced when frames suffer rotation attack, compared with traditional block-based partition schemes [8] - [12] . In addition, assume that in the most common cases, the visual contents are distributed throughout the concentric rings and the importance of these contents decrease with the increasing of their distance to the frame center. Thus, if the logos placed or inserted in the regions outside the largest ring as shown in Fig. 4 , these regions will be discarded during the feature calculation and the generated fingerprints can resist to the logo insertion without losing important visual content. Moreover, rather than the concentric circle-based partition scheme in [13] , in our partition scheme the closer to the central of the frame, the more concentric rings contains, thus different weights are obtained in different regions depended on the importance of visual contents.
PPT Slide
Lager Image
Frame partition into concentric rings (N = 8)
The procedure of the CSTGO calculation is as follows:
(1) For each pixel’s value f ( x , y , k ) at location ( x , y ) in frame k , the Euclidean distance Dist ( x , y , k ) between the pixel and the frame center is defined as:
PPT Slide
Lager Image
where X and Y is the width and height of frame k , respectively.
(2) For each f ( x , y , k ), if ( n -1) × r Dist ( x , y , k ) ≤ n × r (1≤ n N ) is satisfied, then the gradient magnitude r ( x , y , k ) and orientation θ (x, y, k) is given respectively by:
PPT Slide
Lager Image
PPT Slide
Lager Image
where Gs and Gt is calculated by (1) and (2), respectively.
(3) The centroid of gradient cn,k is calculated as follows:
PPT Slide
Lager Image
(4) Since each frame contains a circle and N -1 concentric rings, the fingerprint vector Ck is obtained finally as follows:
PPT Slide
Lager Image
and N is the dimension of the fingerprints.
In the fingerprint matching stage, each fingerprint vector in a fingerprint sequence of K consecutive frames is first normalized by the means and standard deviation of the sequence as in [12] . Then the similarity between two video clips is measured by the Euclidean distance d as follows:
PPT Slide
Lager Image
where P = { pn,k | n = 1,…, N , k = 1,…, K } and Q = { qn,k | n = 1,…, N , k = 1,…, K } are normalized fingerprint sequences extracted from two video clips, respectively. Given a threshold T , if d ( P , Q ) ≤ T , the two video clips are considered as similar.
The performance of the algorithm can be measured by the false positive rate Pfp and the false negative rate Pfn , Pfp is a probability which treats different videos as same one, while Pfn is a probability which treats the same videos as dissimilar. Usually, Pfn is hard to obtain directly, thus, an evaluation method in [12] is used to estimate the Pfp and Pfn here. In fact, since the fingerprints are extracted independently from video clips, the distance d is subject to a normal distribution by the central limit theorem if NK is sufficiently large. Let dintra denote the intra distance between original video fingerprint sequence and its distortion version, and dinner denote the inner distance of fingerprint sequences between two different clips, then the distributions of these two distances can be obtained as follows:
PPT Slide
Lager Image
PPT Slide
Lager Image
where μ intra , σ intra , μ inner and σ inner are their means and standard deviations respectively. By maximum likelihood estimation, given the distance values di (1 ≤ i M ) of M fingerprint sequences, their means and variances can be estimated as follows:
PPT Slide
Lager Image
PPT Slide
Lager Image
The false positive rate Pfp can be written as follows:
PPT Slide
Lager Image
and the threshold T can be obtained by:
PPT Slide
Lager Image
where erfcinv (x) is the inverse function of complementary error function erfc (x).
Given a set of different Pfp , various thresholds T can be determined by equ.13 and equ.14, and then Pfn can be measured through the experiments using equ.15:
PPT Slide
Lager Image
where Nfn is the number of similar clip pairs whose d ( P , Q ) ≥ T , and the Ns is the true number of similar video clips.
3. Experiment Results
Our experiments were conducted on a database which contains more than 200 movies, which are collected from Internet. The total length of these videos is about 300 hours. In our experiment, each video clip is down-sampled to 10 fps (frames per second), and resized to 320 × 320. The fingerprint dimension N is 8, and the clip length is 10 seconds, i.e., K = 100.
- 3.1 Robustness Test
PPT Slide
Lager Image
Comparison between original video frame and the frames with various video attacks. (a) Original Frame, (b) Resizing to CIF, (c) Aspect ratio change (from 4:3 to 16:9), (d) Gaussian blurring with radius 1 pixel, (e) Global change in brightness (+25%), (f) Gamma correction (γ = 1.3), (g) Gaussian noise addtion, (h) Logo overlay in the upper-left by (9% and 16%), (i) Logo overlay in the middle by (9% and 16%), (j) Crop (14%), (k) Horizontal shift (10 pixel), (l) Frames mirroring vertically, (m) Rotation (1, 13, 20, and 40 degrees).
PPT Slide
Lager Image
Empirical inner and intra distance distributions for CSTGO and CGO algorithms. (a) The inner distance distributions of the two algorithms. (b) The intra and inner distance distributions of CSTGO.
In this test, the videos suffer twelve types of attacks. All attacks, together with the original frame, are illustrated in Fig. 5 .
Global attack (Type 1 to Type 6)
(1) Resizing to CIF,
(2) Aspect ratio change (from 4:3 to 16:9),
(3) Gaussian blurring with radius 1 pixel,
(4) Global change in brightness (+25%),
(5) Gamma correction (γ = 1.3),
(6) Gaussian noise addition,
Logo overlay (Type 7 and Type 8)
(7) Logo overlay in the upper-left by (9% and 16%),
(8) Logo overlay in the middle by (9% and 16%),
Geometric attack (Type 9 to Type 12)
(9) Crop (14%),
(10) Horizontal shift (10 pixel),
(11) Frames mirroring vertically,
(12) Rotation (1, 13, 20, and 40 degrees).
First, the empirical inner distance distributions of the CSTGO based algorithm and CGO based algorithm [12] are compared in Fig. 6 (a). 45082260 pairs of different fingerprint sequences are used to analyze the distributions. The mean and the variance of these inner distances in the CSTGO based algorithm are 1.9974 and 0.0127287, while these in the CGO based algorithm are 1.95693 and 0.120112, respectively. The variance of inner distances in the CSTGO based algorithm is smaller than that in the CGO based algorithm. It means that under a certain threshold T , the CSTGO based algorithm can achieve a lower Pfp than the CGO based algorithm. For example, if T is fixed to 0.5, then the expected Pfp in the CGO based algorithm is 1.3122×10 -5 , while that in the CSTGO based algorithm is 1.6759×10 -40 , which is much smaller than that in CGO. Hence, the proposed algorithm has a better discriminative feature than the CGO based algorithm does. Fig. 6 (b) shows the inner distance distribution and three different intra distance distributions in CSTGO. These intra distributions are calculated under three attacks: (1) Type 8, logo overlay in the middle of the frame by 16%, (2) Type 9, crop the frame, (3) Type 12, rotate the frame 40 degrees. As shown in Fig. 6 (b), the distances between the inner distance distribution and those three intra distance distributions are very large, so that the CSTGO based algorithm has a better robustness than the CGO based algorithm.
In Fig. 7 - Fig. 11 , about 130 hours of videos, together with their twelve attacked versions are used. Because the Pfp is usually hard to obtain in practice, here the Pfp is set from 10 -60 to 1, and the corresponding threshold T is determined by equ.14. Then the Pfn is obtained in equ.15, where Nfn , Ns is measured though the experiments, respectively. In these experiments, the robustness performances of the proposed CSTGO are compared with the weighted hashing (WH) method in [11] and CGO [12] under different attacks.
PPT Slide
Lager Image
The robustness performances of CSTGO compared with WH [11] and CGO [12] under global attacks. (a) Resizing to CIF, (b) Aspect ratio change (from 4:3 to 16:9), (c) Gaussian blurring with radius 1 pixel, (d) Global change in brightness (+25%), (e) Gamma correction (γ = 1.3), (f) Gaussian noise addition.
PPT Slide
Lager Image
The robustness performances of CSTGO compared with WH [11] and CGO [12] under logo insertions. (a) Logo overlay in the upper-left by 9%, (b) Logo overlay in the middle by 9%.
PPT Slide
Lager Image
The robustness performances of CSTGO under different logo insertions.
PPT Slide
Lager Image
The robustness performances of CSTGO compared with WH [11] and CGO [12] under geometric attacks. (a) Frame crop (14%), (b)Horizontal shift (10 pixel), (c) Rotation 1 degrees, (d)Frames mirroring.
PPT Slide
Lager Image
The robustness performances of CSTGO under different rotational degrees
Fig. 7 gives the robustness comparisons of above three algorithms under different global attacks. Since the proposed CSTGO uses the maximum of luminance differences as spatial derivative, brightness change and noise addition may influence the performance of the proposed CSTGO. However, a lower Pfn is achieved in the proposed algorithm than other two algorithms under a given Pfp , which means our algorithm is more robust against the global attacks than WH and CGO.
Fig. 8 shows the comparison of these three algorithms under different logo insertions. Our CSTGO achieves better performances than CGO and WH. Usually the logo is inserted in the marginal region. For example, when the logo is inserted in the upper-left corner of original frame, with the feature of temporal derivative and partition scheme, a low Pfn is achieved when Pfp is small because most of the influenced area where logo is inserted would be discarded, so the impact on the fingerprints is weak, as shown in Fig. 8 (a). Furthermore, when the logo is inserted in the middle region of the frame, the main visual content would be changed, so do video fingerprints under this attack. However, Fig. 8 (b) demonstrates that our method is still more robust to this attack than WH and CGO. In addition, Fig. 9 gives the robustness of proposed CSTGO under logo insertion attacks with different logo insertion positions and different logo size. We can see that our method can resist to this attack even when the inserted logo overlays 16% middle area of the original frame.
The comparisons of these three algorithms under different geometric attacks are shown in Fig. 10 . The figure demonstrates that our proposed CSTGO is the most robust algorithm among these three methods when videos suffer geometric transforms. Traditional block-based partition scheme like WH [11] and CGO [12] are fragile to some geometric attacks since these attacks may change the order of frame blocks. For example, Fig. 10 (d) illustrates that WH and CGO cannot resist to frame mirroring attack. Fig. 11 shows that with the feature of rotation-invariant spatial derivative and partition scheme in our CSTGO based approach, the generated fingerprints are immune to rotation even when frames are rotated up to 40 degrees.
- 3.2 Scalability Test
In this test, the scalability of the proposed algorithm is examined. Since the dimension of fingerprinting vector of each frame is 8, the k-d-tree [14] is used as the indexing structure in our database. In the matching stage, each fingerprint vector in the fingerprint sequence is used to retrieval its most similar fingerprint vector in the database via k-d-tree, and the potential video clips would be found in the database. The distances between the query clip and the clips found in the database are then calculated, and the clip with the smallest distance is selected as the retrieval result finally.
In the test, 18047 query clips were constructed, and the total length of these video clips is about 50 hours. In order to test the scalability, the size of the database varies from 50 hours to 300 hours. Fig. 12 shows that the precision of the retrieval result is able to maintain at 100% even if the size of database increases. It demonstrates that the proposed algorithm has good scalability.
PPT Slide
Lager Image
The scalability performances of CSTGO
4. Conclusion
In this paper, a new algorithm based on CSTGO is proposed for video fingerprint. The algorithm includes an improved spatial derivative, a new defined spatio-temporal gradient to represent the spatial and temporal information for each frame, and a new partition scheme based on concentric rings. Our investigation indicates that the feature of temporal derivative can be used to resist logo insertion and the combination of the spatial and temporal information in each frame can enhance the discrimination and robustness of the video fingerprints. Our experimental results have demonstrated that the proposed CSTGO based algorithm performs better that the state-of-the-art algorithms in terms of discrimination and robustness when the videos suffer the typical attacks, such as, global change, geometric attacks, and logo insertion. Our fingerprints are also robust against geometric attacks such as rotation and frame mirroring which traditional global feature based fingerprints cannot resist. The proposed CSTGO also has good scalability. The future work is to evaluate our CSTGO algorithm on some open video databases.
Acknowledgements
The authors would like to thanks Mr. Dennis D. Zhu and Mr. Guibo Luo for their kindly support and valuable comments on this paper. This work is supported by the 973 Program of China under Grant 2012CB315904.
BIO
Ziqiang Sun was born in 1987. He received his B.S. degree in Electronics Engineering from Lanzhou University in 2009. He is currently a Ph.D. candidate in School of Electronics Engineering and Computer Science, Peking University. His research interests include video fingerprint and multimedia technology.
Yuesheng Zhu received his B.Eng. degree in Radio Engineering, M.Eng. degree in Circuits and Systems and Ph.D. degree in Electronics Engineering in 1982, 1989 and 1996, respectively. He is currently working as a professor at the Lab of Communication and Information Security, Shenzhen Graduate School, Peking University. He is a senior member of IEEE, fellow of China Institute of Electronics, and senior member of China Institute of Communications. His interests include digital signal processing, multimedia technology, communication and information security.
Xiyao Liu was born in Hunan Province, in 1987. He received the B.S. degree from School of Electrical Engineering and Computer Sciences, Department of Microelectronics, Peking University in 2008. He is currently pursuing the Ph.D. degree in School of Electrical and Computer Engineering, Department of Micro-electronics and Solid-State Electronics, Peking University. His research interests include video fingerprint, video watermarking and other digital video process.
Liming Zhang received the B.S. degree from Nankai University, China, in 1987, the M.S. degree from the East China Institute of Technology, China, in 1990, and the Ph.D. in computer science (image processing) from the University of New England, Australia, 2002. She is currently working as an Assistance Professor in the Faculty of Science and Technology, University of Macau, in 2010. Her recent research interests including signal processing and image processing.
References
Lian S. , Nikolaos N. , Sencar H. T. 2010 “Content-based video copy detection — a survey,” Intelligent Multimedia Analysis for Security Applications vol. 282, Article (CrossRef Link) 253 - 273
Jian Lu 2009 “Video fingerprinting for copy identification: from research to industry applications,” Proc. of SPIE - Media Forensics and Security XI vol. 7254, Article (CrossRef Link)
Harris C. , Stevens M. 1988 “A combined corner and edge detector,” in Proc. of 4th Alvey Vision Conference Article (CrossRef Link) 153 - 158
Joly A. , Buisson O. , Frelicot C. 2007 “Content-based copy retrieval using distortion-based probabilistic similarity search,” IEEE Transactions on Multimedia Article (CrossRef Link) 9 (2) 293 - 306    DOI : 10.1109/TMM.2006.886278
Law-To J. , Buisson O. , Gouet-Brunet V. , Boujemaa N. 2006 “Robust voting algorithm based on labels of behavior for video copy detection,” in Proc. of ACM Multimedia Article (CrossRef Link) 835 - 844
Liu Z. , Liu T. , Shahraray B. 2009 “At&t research at TRECVID 2009 content-based copy detection,” in Proc. of TRECVID Article (CrossRef Link)
Lowe D. G. 2004 “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision Article (CrossRef Link) 60 (2) 91 - 110    DOI : 10.1023/B:VISI.0000029664.99615.94
Mohan R. 1998 “Video sequence matching,” in Proc. of Int. Conf. on Acoustics, Speech and Signal Processing vol. 6, Article (CrossRef Link). 3697 - 3700
Hampapur A. , Bolle R. 2002 “Comparison of sequence matching techniques for video copy detection,” in Proc. of Conf. on Storage and Retrieval for Media Databases Article (CrossRef Link) 194 - 201
Kim C. , Vasudev B. 2005 “Spatiotemporal sequence matching for efficient video copy detection,” IEEE Transactions on Circuits and Systems for Video Technology Article (CrossRef Link) 15 (1) 127 - 132    DOI : 10.1109/TCSVT.2004.836751
Sun J. , Wang J. , Zhang J. , Nie X. , Liu J 2012 “Video hashing algorithm with weighted matching based on visual saliency,” IEEE Signal Processing letters Article (CrossRef Link) 19 (6) 328 - 331    DOI : 10.1109/LSP.2012.2192271
Lee S. , Yoo C. 2008 “Robust video fingerprinting for content-based video identification,” IEEE Transactions on Circuits and Systems for Video Technology Article (CrossRef Link) 18 (7) 983 - 988    DOI : 10.1109/TCSVT.2008.920739
Radhakrishnan R. , Bauer C. 2009 “Video Fingerprinting Based on Moment Invariants Capturing Appearance and Motion,” in Proc. of Int. Conf. on Multimedia and Expo Article (CrossRef Link) 1532 - 1535
Robinson J. T. 1981 “The k-d-b-tree: A search structure for large multidimensional dynamic indexing,” in Proc. of ACM SIGMOD Int. Conf. Management Data Article (CrossRef Link) 10 - 18