Advanced
Deep Learning Object Detection to Clearly Differentiate Between Pedestrians and Motorcycles in Tunnel Environment Using YOLOv3 and Kernelized Correlation Filters
Deep Learning Object Detection to Clearly Differentiate Between Pedestrians and Motorcycles in Tunnel Environment Using YOLOv3 and Kernelized Correlation Filters
Journal of Broadcast Engineering. 2019. Dec, 24(7): 1266-1275
Copyright © 2016, Korean Institute of Broadcast and Media Engineers. All rights reserved.
This is an Open-Access article distributed under the terms of the Creative Commons BY-NC-ND (http://creativecommons.org/licenses/by-nc-nd/3.0) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited and not altered.
  • Received : October 31, 2019
  • Accepted : December 17, 2019
  • Published : December 01, 2019
Download
PDF
e-PUB
PubReader
PPT
Export by style
Article
Author
Metrics
Cited by
About the Authors
Sungchul, Mun
Department of Smart City Research, Seoul Institute of Technology
sungchul.mun@sit.re.kr
Manh Dung, Nguyen
Technical Research Institute, IVS Incorporation
Seokkyu, Kweon
Technical Research Institute, IVS Incorporation
Young Hoon, Bae
Chief Executive Officer, IVS Incorporation

Abstract
With increasing criminal rates and number of CCTVs, much attention has been paid to intelligent surveillance system on the horizon. Object detection and tracking algorithms have been developed to reduce false alarms and accurately help security agents immediately response to undesirable changes in video clips such as crimes and accidents. Many studies have proposed a variety of algorithms to improve accuracy of detecting and tracking objects outside tunnels. The proposed methods might not work well in a tunnel because of low illuminance significantly susceptible to tail and warning lights of driving vehicles. The detection performance has rarely been tested against the tunnel environment. This study investigated a feasibility of object detection and tracking in an actual tunnel environment by utilizing YOLOv3 and Kernelized Correlation Filter. We tested 40 actual video clips to differentiate pedestrians and motorcycles to evaluate the performance of our algorithm. The experimental results showed significant difference in detection between pedestrians and motorcycles without false positive rates. Our findings are expected to provide a stepping stone of developing efficient detection algorithms suitable for tunnel environment and encouraging other researchers to glean reliable tracking data for smarter and safer City.
Keywords
I. Introduction
Smart City has been emerged as an alternative solution to address the recently raised issues (e.g., aging infrastructure, society, and increasing criminal rates) caused by exponential growth of urbanization and population [1] . Due to the all-connected ICT concept in Smart City, much attention has been given to increasing safety level of surveillance systems for sustainable Smart City. For the enhanced safety level, it is indispensable to instantly give warnings to security agents in charge of the surveillance system when accidents or crimes occur. Characteristics of extensive camera system and real-time algorithms for processing video data reflecting real-world settings should be considered when developing the selective warning system.
Many studies have been conducted to propose essential algorithms for the real-time surveillance system [2 - 9] . Xiong et al. [2] proposed a recognition algorithm of different appearances of the same person under dynamic circumstances. They applied a multiple deep metric learning method with modified Softmax regression models, which can calculate probabilities of differences in appearances of the same person. Sharif et al. [3] addressed issues in selecting robust features for recognizing human actions with an innovative method utilizing multi-class correlation and Euclidean distance techniques. Big data framework and camera network system suitable for video surveillance system were proposed by Subudhi et al. [4] and Lee and Kim [6] for the purpose of efficiently storing, retrieving, processing, and analyzing gigantic data coming from real-world security environment. Auto-tracking algorithms using multiple cameras have also been proposed to reflect characteristics of real-world surveillance system [5 , 10 - 14] .
However, the previous studies have heavily focused on recognizing and classifying multiple objects in normal situations on the road or public space. In other words, although a lot of studies have been conducted to propose the real-time warning systems, it remains to be seen how researchers and developers address challenging points when recognizing, tracking, and re-identifying specific objects under harsh real world situations. In order to develop intelligent surveillance system covering the whole range of the city, detecting and tracking vehicles and pedestrians in tunnels should be performed while minimizing false positive rates or losing object trajectories. Few trials have been made to develop object detection and tracking techniques in the real-world tunnel environments. In the future Smart City, considering specific characteristics of core sensing technologies in autonomous vehicles, unexpected accident rates might be higher in tunnel environments than in normal driving condition (except for snowy road condition). This study is novel in that we proposed a method to clearly reduce false-positive rates between motorcycles and pedestrians in tunnel environments. High false positive rates in the tunnels are heavily attributed to contamination effects from sudden changes in intensity of illumination caused by headlights reflected on the wall (or on the road) and light intensity distributions of tunnel lamps.
To put it simply, this study was aimed at developing an improved technique to detect pedestrians in a tunnel environment at higher accuracy compared to existing methods, e.g., using YOLO to detect a pedestrian, which falsely detects a motorcycle as a pedestrian sometimes.
II. Methods
Figure 1 illustrates our algorithm. This algorithm includes four major steps, detecting objects, tracking objects, calculating object trajectory smoothness, and finally classifying objects into a pedestrian or motorcycle.
PPT Slide
Lager Image
Algorithm flow chart
In this paper, we proposed a method of detecting a pedestrian, which combines deep-learning based detection mechanism and object tracking. We employed YOLO v3 [15] as a detection mechanism. YOLO is a very popular deep-learning algorithm which performs both detection (locating an object) and classification concurrently. In the network architecture of YOLOv3, input layer of the network is image frame and the outputs are prediction feature map which contains the attributes of a bounding box.
Entire image is divided into grid of cell, and bounding boxes cells can predict are generated. Attributes of bounding box include bounding rectangle coordinators, objectness score, and classes score. Once a pedestrian object is detected by YOLOv3, the proposed mechanism starts tracking the object in the next frames to determine whether it is a pedestrian or a motorcycle ridden by a human. Let the nth frame be a frame that an object is detected. A bounding box is defined such that its area contains the object with minimal background.
We trained YOLOv3 model to detect humans only. Humans include pedestrians, human riding motorcycles and human riding bicycles. The dataset included over 50,000 images, which were extracted from COCO dataset and some of which were manually labeled to enhance the detection rate in tunnels.
In the ( n + 1 ) th frame, the same object was detected using YOLOv3. An object detected in the ( n + 1 ) th frame was defined the same as the original object if its bounding box was overlapped with the original object and the color histogram of the bounding box matched that of the original object. Figure 2 illustrates the overlapping the bounding boxes of the same object in two consecutive frames. In Figure 2b , the red bounding box in the center is the bounding box of a human in ( n + 1 ) th frame while the green bounding box in the center in Figure 2a is the bounding box of that human in the nth frame. These two bounding boxes were more than 90% overlapped. Two bounding boxes
PPT Slide
Lager Image
and
PPT Slide
Lager Image
in the two consecutive frames were defined to belong to the same object if they meet following conditions:
PPT Slide
Lager Image
bounding box overlapped tracking, a (the nth frame) and b (the (n+1)th frame)
PPT Slide
Lager Image
PPT Slide
Lager Image
PPT Slide
Lager Image
where
S is the area of a bounding box,
PPT Slide
Lager Image
and
PPT Slide
Lager Image
are the color histograms of sub images inside bounding boxes,
d(
PPT Slide
Lager Image
,
PPT Slide
Lager Image
) is the cross correlation of two histograms, and α and β are tuning thresholds, which are determined through experiment.
The cross correlation (delta) of two histograms was calculated by using below equation:
PPT Slide
Lager Image
where
PPT Slide
Lager Image
If the same object was not detected in the ( n + 1 ) th frame, i.e., the object was lost, the object was searched in the ( n + 1 ) th frame using a tracking algorithm. In this work, we used the KCF (Kernelized Correlation Filters) algorithm [16] as a tracking algorithm. In other words, we used YOLOv3 to detect pedestrians in every frame using object matching algorithms to match the objects in two consecutive frames. We also used three equation (1), (2), (3) to determine whether objects in two consecutive frames are matched or not. If one object is found in the current frame but its matched object is not found in the next frame, it means that YOLOv3 cannot detect this object in the next frame. In this case, we utilized the KCF algorithm to rediscover the object. If KCF cannot find the object in the next frame, the object will be considered as a disappeared one.
Figure 3 illustrates how to use KCF to predict the location of the object in the next frame. The highest correlation score in the confidence map will be estimated as the centroid of the object in the next frame.
PPT Slide
Lager Image
Kernelized Correlation Filters
In the next step, we calculated the object’s trajectory to determine whether it was a pedestrian or a motorcycle ridden by a human. An object’s trajectory was defined as {c(0), c(1), …, c(L)} whether c(i) is the center of the object’s bounding box in the i th frame. KCF can predict the bounding box as well as centroid c(i) of an object in next frame if we know the object bounding box in the current frame. Next, we defined two distances, dTotal and dReal , as follows:
PPT Slide
Lager Image
PPT Slide
Lager Image
and next we define deviation, Dev , as follows:
PPT Slide
Lager Image
Finally, we defined a criterion to determine whether an object was a pedestrian or a motorcycle as follows:
Object was a pedestrian if
PPT Slide
Lager Image
Where δ is threshold that was determined by analyzing real trajectories obtained from actual video footages. The meaning of this criterion is self-explanatory since a pedestrian’s trajectory is not as smooth as that of a motorcycle as shown in Figure 4 . The algorithm was aimed to detect pedestrians in tunnels. We used YOLOv3 algorithm to detect pedestrians, but the YOLOv3 only uses shape information to detect objects. This causes many objects which have very similar shapes as pedestrians such as human riding motorcycles or bicycles to be falsely detected. To reduce the number of false detection rates, we developed a classification algorithm based on object trajectory smoothness. In order to execute the comparative analyses, we used Intel Core -i7 PC and NVIDIA graphic card RTX 2080. The proposed algorithm can apply to tunnel or highway environment, but currently this algorithm only works for the straight way. In case of curved ways, we have to estimate other equations to measure the smoothness of the object trajectory.
PPT Slide
Lager Image
Trajectories of Pedestrian and Motorcycle
Ⅲ. Results
We used 40 videos to evaluate our algorithm, including 20 videos containing pedestrians and the other 20 videos containing motorcycles. The duration of each video is about one minute. We analyzed history of objects for 5 seconds to measure the smoothness of the object trajectory. To put it simply, the runtime of the proposed algorithm took 5 seconds in identifying whether the objects are pedestrians or motorcycles since the algorithm was based on the tracking information of objects during 5 seconds.
Experiment results have shown that pedestrians and motorcycles can be distinguished effectively by comparing smoothness of their movement trajectories. For a motorcycle Dev is close to 1 but for a pedestrian Dev is much greater than 1. Figure 5 shows some examples of our experiment and Figure 6 is an illustration of experimental summary. The graph in Figure 6 shows that, for pedestrian Dev is always greater than 1.1 but for motorcycle Dev is almost equal 1.
PPT Slide
Lager Image
Experiment Results: sample videos
PPT Slide
Lager Image
Experimental summary
To identify statistical reliability of our results, Wilcoxon’s Matched-Pairs Signed-Ranks test was used to statistically compare the detection performance between pedestrians and motorcycles because normality assumption was not tenable ( p < .001). The analysis showed significantly lowered Dev in motorcycle detection compared to that in pedestrian detection ( Z =−3.920, p < 0. 001, r = 0.876), indicating large effect size (Cohen (1988) criteria of .1 = small, .3 = medium, .5=large).
The experiment results showed that the proposed algorithm can classify pedestrian and motorcycle at 100% success rate. In fact, motorcycles or bicycles are rarely moving following zigzag trajectory because they move at higher speeds. The trajectory of a motorcycle or bicycle usually is a line or curve. In contrast, the trajectory of a pedestrian looks similar to zigzag, so that the deviation of a pedestrian is higher than that of a motorcycle or bicycle. In performance test, we executed our experiment using intel i7 machine with windows 10 and NVIDIA RTX2080. Our system could process 60 frame per second on average. The experiment also record that proposed algorithms require minimum 1.8G CPU RAM and 2.5 GPU RAM.
Ⅳ. Discussion and Conclusion
In this study, we proposed a combined model to detect and track pedestrians accurately in a tunnel environment using YOLOv3, Kernelized Correlation Filter, and empirical rule bases in the real-world settings. Our study has significant points in terms of considerably reducing errors in detecting pedestrians in a tunnel environment where deep learning algorithms very famous for object detection, such as YOLOv3 and SSD often falsely detect motorcycles and bicycles as pedestrians.
Recently, many studies have proposed real-time tracking algorithms and advanced optical flow descriptors to track magnitude changes in pixels between each frame of a specific video clip. Padmalatha et al. [9] investigated a possibility to detect violent behaviors in real time using an revised Violent Flow Descriptor in which AdaBoost algorithm was used to improve classification of contributing features and detection accuracy. Hasan et al. [17] extracted the conventional spatio-temporal local features and tried to detect anomaly events by learning a fully connected autoencoder, but the proposed model could predict regular patterns with limited supervision circumstances. The previous studies tested their proposed models only in detecting moments of changes in optical flows characterizing violent behaviors mainly happened in public space. Under the actual surveillance environment, accidents, fires or crimes have been frequently occurred in tunnel environment and intelligent surveillance system suitable for tunnel environment to ensure golden time should be developed. Although the methods we proposed did not include all of the technical components for the real time surveillance system in the tunnels, our study is meaningful in that we provided a stepping stone to the future studies for obtaining reliable track data and developing optical flow descriptors for tunnel environments. As a matter of fact, in Republic of Korea, there are 2566 tunnels (1896 km) throughout the nation [18] . Thus, further studies are encouraged to examine feasibility of detecting optical flows of anomaly behaviors and accidents in tunnel environment for selective intelligent surveillance system to monitor harsh real world settings (e.g., dead zones or tunnels).
It will be very helpful to aid security agents to respond quickly to the accidents or crimes, leading to reducing accidental and criminal rates in Smart City environment. However, it should be noted that excessive specification of deep learning servers of intelligent CCTVs causing heavy computational load and time-consuming repetition processing, and excessive price rise of the system should be avoided for the rapid extension of value chains in intelligent CCTV ecosystem. Even though it heavily depends on environmental factors, appropriate technology rather than the cutting-edge technology may be more effective in accurately detecting unexpected accidents or abnormal behaviors in some cases.
Our study has a limitation of experiments in which comparative trials were not enough to fully validate the experimental results because we used actual video footages taken in a real-world tunnel environment. It is very difficult to glean video clips from real-world settings due to Personal Information Protection Act applied on CCTVs.
This work was supported by Seoul Institute of Technology (SIT) (19-4-5, Development of Crime Detection Technology on CCTV).
BIO
Sungchul Mun
- Feb. 2012 : M.S. Emotion Engineering, Sangmyung University
- Feb. 2015 : Ph.D. HCI & Robotics (Data Science for Ergonomics), Korea Institute of Science and Technology (UST)
- Mar. 2015 ~ Nov. 2016 : Post Doc. National Agenda Research Division, Korea Institute of Science and Technology
- Dec. 2016 ~ Jun. 2017 : Chief Research Engineer, Strategy Business Office, Golfzon Newdin Group
- Jul. 2017 ~ Jun. 2019 : General Manager, Future Engine Lab., CJ Hello
- Jun. 2019 ~ Present : Chief Research Scientist, Department of Smart City Research, Seoul Institute of Technology
- Research interest : Deep Learning, Video Analytics, Human Factors, HCI, Smart Healthcare
Manh Dung Nguyen
- Feb. 2009 : M.S. Information and Communication, Kongju University
- Dec. 2019 : Ph.D. Information and Communication, Kongju University
- Feb. 2011 ~ Present : Senior Software Engineer, IVS Inc.
- Research interest : Deep Learning, Sound Analytics, Video Analytics
Seokkyu Kweon
- Feb. 1991 : M.S. Electronics, Seoul National University
- Dec. 1998 : Ph.D. Electrical Engineering and Computer Science, University of Michigan, Ann Arbor
- May 2000 ~ Feb. 2003 : Software Engineer, Cisco Systems
- Mar. 2003 ~ Feb. 2013 : Team Leader, Samsung Electronics
- May. 2013 ~ Sep. 2015 : VP of Software Development, Samsung Techwin
- Sep. 2015 ~ Dec. 2018 : Head of R&D Center, SMC Networks
- Dec. 2018 ~ Present : Head of R&D Center, IVS Inc.
- Research interest : Deep Learning, Sound Analytics, Video Analytics
Young Hoon Bae
- Feb. 1981 : M.S. Mechanical Design Department, Seoul National Graduate school
- Oct. 1981 ~ Aug. 1986 : Team leader, Hyundai Engineering ICAE Team
- Jun. 1986 ~ Jun. 1991 : Team leader, Samsung Engineering ICAE Team
- Jun. 1991 ~ Aug. 1998 : General Manager, Samsung SDS CAD/CAM Division
- Mar. 2008 ~ Mar. 2010 : CEO, KiryungElectronics Co., Ltd.
- May 2010 ~ Present : CEO, IVS Inc. / CEO
- Research interest : Deep Learning, Sound Analytics, Video Analytics
References
Silva B. N. , Khan M. , Han K. 2018 Towards sustainable smart cities: A review of trends, architectures, components, and open challenges in smart cities Sustainable Cities and Society 38 697 - 713    DOI : 10.1016/j.scs.2018.01.053
Xiong M. F. , Chen D. , Chen J. , Chen J. Y. , Shi B. Y. , Liang C. , Hu R. M. 2019 Person re-identification with multiple similarity probabilities using deep metric learning for efficient smart security applications Journal of Parallel and Distributed Computing 132 230 - 241    DOI : 10.1016/j.jpdc.2017.11.009
Sharif A. , Khan M. A. , Javed K. , Umer H. G. , Iqbal T. , Saba T. , Ali H. , Nisar W. 2019 Intelligent Human Action Recognition: A Framework of Optimal Features Selection based on Euclidean Distance and Strong Correlation Control Engineering and Applied Informatic 21 3 - 11
Subudhi B. N. , Rout D. K. , Ghosh A. 2019 Big data analytics for video surveillance Multimedia Tools and Applications 78 26129 - 26162    DOI : 10.1007/s11042-019-07793-w
Iguernaissi R. , Merad D. , Aziz K. , Drap P. 2019 People tracking in multi-camera systems: a review Multimedia Tools and Applications 78 10773 - 10793    DOI : 10.1007/s11042-018-6638-5
Lee G. Y. , Kim H. J. 2019 Optimum Configuration of Surveillance Camera System Based on Real Time Image Recognition Server The Journal of Korean Institute of Communications and Information Sciences 44 1124 - 1127    DOI : 10.7840/kics.2019.44.6.1124
Chandran A. K. , Poh L. A. , Vadakkepat P. 2019 Real-time identification of pedestrian meeting and split events from surveillance videos using motion similarity and its applications Journal of Real-Time Image Processing 16 971 - 987    DOI : 10.1007/s11554-016-0584-0
Lotfi M. , Motamedi S. A. , Sharifian S. 2019 Time-based feedback-control framework for real-time video surveillance systems with utilization control Journal of Real-Time Image Processing 16 1301 - 1316    DOI : 10.1007/s11554-016-0637-4
Padmalatha E. , Sekhar K. A. S. , Mudiam D. R. R. 2019 Real Time Analysis of Crowd Behaviour for Automatic and Accurate Surveillance International Journal of Advanced Computer Science and Applications 10 492 - 496
Eshel R. , Moses Y. 2010 Tracking in a Dense Crowd Using Multiple Cameras International Journal of Computer Vision 88 129 - 143    DOI : 10.1007/s11263-009-0307-0
Roth P. M. , Leistner C. , Berger A. , Bischof H. 2010 Multiple instance learning from multiple cameras 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops 13-18 June 2010 17 - 24
Chen A. T. Y. , Biglari-Abhari M. , Wang K. I. K. 2019 Investigating fast re-identification for multi-camera indoor person tracking Computers & Electrical Engineering 77 273 - 288    DOI : 10.1016/j.compeleceng.2019.06.009
Sun C. C. , Sheu M. H. , Chi J. Y. , Huang Y. K. 2019 A Fast Non-Overlapping Multi-Camera People Re-Identification Algorithm and Tracking Based on Visual Channel Model Ieice Transactions on Information and Systems E102D 1342 - 1348
Previtali F. , Bloisi D. D. , Iocchi L. 2017 A distributed approach for real-time multi-camera multiple object tracking Machine Vision and Applications 28 421 - 430    DOI : 10.1007/s00138-017-0827-5
Redmon J. , Farhadi A. 2018 YOLOv3: An Incremental Improvement arXiv preprint arXiv:1804.02767
Li Y. , Zhu J. 2015 A Scale Adaptive Kernel Correlation Filter Tracker with Feature Integration Lecture Notes in Computer Science 8926 254 - 265
Hasan M. , Choi J. , Neumann J. , Roy-Chowdhury A. K. , Davis L. S. 2016 Learning Temporal Regularity in Video Sequences arXiv preprint arXiv:1604.04574
KOSIS National Tunnel Statistics http://kosis.kr/statHtml/statHtml.do?orgId=116&tblId=DT_MLTM_1040