Skip to main content

Real-time monitoring of traffic parameters


This study deals with the problem of rea-time obtaining quality data on the road traffic parameters based on the static street video surveillance camera data. The existing road traffic monitoring solutions are based on the use of traffic cameras located directly above the carriageways, which allows one to obtain fragmentary data on the speed and movement pattern of vehicles. The purpose of the study is to develop a system of high-quality and complete collection of real-time data, such as traffic flow intensity, driving directions, and average vehicle speed. At the same time, the data is collected within the entire functional area of intersections and adjacent road sections, which fall within the street video surveillance camera angle. Our solution is based on the use of the YOLOv3 neural network architecture and SORT open-source tracker. To train the neural network, we marked 6000 images and performed augmentation, which allowed us to form a dataset of 4.3 million vehicles. The basic performance of YOLO was improved using an additional mask branch and optimizing the shape of anchors. To determine the vehicle speed, we used a method of perspective transformation of coordinates from the original image to geographical coordinates. Testing of the system at night and in the daytime at six intersections showed the absolute percentage accuracy of vehicle counting, of no less than 92%. The error in determining the vehicle speed by the projection method, taking into account the camera calibration, did not exceed 1.5 km/h.


Urbanization leads to a significant growth of the population density and road traffic concentration in large cities. This increased the likelihood of traffic accidents, road congestion, and led to increased vehicle emissions. In the conditions of urban infrastructural constraints, the tasks of ensuring an adequate population mobility can no longer be solved through the use of non-optimal heuristics based on a small amount of statistical information. Intelligent transport systems (ITS) of cities should ensure the maximum capacity of the road network and instantly respond to any traffic incidents to prevent road congestion. Currently, cities experience a rapid growth of video surveillance systems, which include video cameras with different resolutions and fixed frame rates with different resolutions and mounting points [1]. Continuous monitoring of quantitative and qualitative road traffic parameters from fixed cameras will allow us to use vehicles as indicators of the transport system performance. The most reported issues when processing real-time data from street cameras are low counting accuracy, classification of a limited number of vehicle types, tracking an object with determining the speed and the driving direction in all sections when crossing the functional zone of the intersection. Despite the obvious advantages of developing such systems, there are few studies aimed at collecting and analyzing the speed and movement pattern of traffic flows through the use of survey street cameras [2]. Artificial neural networks have proven themselves to be good in the tasks of collecting, interpreting, and analyzing big data coming from video cameras [3].

Some studies [4] and [5] use low-resolution video surveillance system data and deep neural networks to count vehicles on the road and estimated traffic density. Examples of using conventional machine vision methods are systems developed in [6, 7], which analyzed the problems of freight traffic. To detect a vehicle, most modern works discuss the adaptation and improvement of modern detection systems, such as Faster R-CNN [8], YOLO [9], and SSD [10]. This includes architectural innovations solving the problems of scale sensitivity [11], vehicle classification [12,13,14], and increasing the speed and accuracy of the detection methods [15, 16]. Improving the detection rate [17], temporary information is also used for joint detection and tracking of objects [3, 18, 19].

The existing solutions in the problems of real-time vehicle detection and classification require large computing capabilities and place strict requirements for the installation location and camera performance.

Related work

Object detection

Neural network architectures can be conditionally divided into single-stage (RetinaNet, YOLO, SSD) and double-stage (R-CNN, Faster R-CNN, etc.) [20]. The main difference in these approaches is that the two-stage models generate regions at the first stage and classify at the second stage. This approach gives higher accuracy at the cost of the image processing speed. The single-stage approach generates and classifies at one stage, which provides a high image processing speed but lower accuracy. One of the main factors in this work is the image processing speed; therefore, to solve this problem, we considered single-stage networks.

To solve the problem of real-time object recognition, we considered the following neural networks: SSD, YOLO v3, RetinaNet, etc. After studying the performance tests [21, 22], we came to the conclusion that YOLO v3 shows the best result processing one image in 51 ms at a resolution of 608 × 608, which allows us to process 19 frames per second. Based on the real-time object detection task, the YOLO v3 neural network is capable to process the maximum number of frames per second, while it does not lose much in accuracy (Fig. 1).

Fig. 1

Performance tests of neural networks on COCO dataset

An important feature of this architecture is that convolution layers are applied to the image once, unlike such architectures as R-CNN [23,24,25] and Faster R-CNN [8], which provides a multiple increase in the image processing speed without significant losses in accuracy: one image is processed 1000 times faster using YOLO than R-CNN, and 100 times faster than Fast R-CNN [24].

Speed detection

The complexity of the task of determining the speed of vehicles on the video stream is caused by a large number of possible movement patterns, as well as the direction of the camera view center, which is not perpendicular to the movement patterns of vehicles. Several existing solutions are based on the use of traffic cameras located directly above the carriageway or on the side of it [26, 27]. In [28], the authors manually marked the measurement zone in the camera image. It is a rectangular area perpendicular to the traffic flow. In each frame, the Liang–Barsky algorithm checks the intersection of a vehicle with the measurement zone and counts the number of frames, over which the vehicle passed the measurement zone. Thus, the speed is defined as a ratio of the distance traveled to the travel time. In [29], the authors define the vehicle contour. Using the developed optical flow method, they determine the movement speed of the contour pixels. By adjusting the focal distance, angle, and height of the camera installation, the authors highlight the area of interest in the image so that it is equal to the width of the image. Thus, the vehicle speed (km/h) is calculated from the ratio between the image pixels and the road width.

The considered methods are focused on measuring the speed in preset zones with the known dimensions and traffic cameras located above the road and at a low height, which does not allow us to use them to collect data over the entire functional area of road junctions.

We propose a method to determine the average speed based on the coordinate mapping from the camera image to the space of geographic coordinates using a perspective transformation.


The purpose of this work is to develop an autonomous approach able to assess the quantitative and qualitative parameters of road traffic, such as the amount, speed and movement pattern of vehicles. To this end, we divide the problem into four sub-tasks: detection and classification, tracking, counting and determining the average vehicle speed. This naturally leads to a modular and easily testable architecture consisting of indicator detection, tracking, and calculation modules. In the following sections, we will describe in detail each module together with the data collected for training and assessment. Figure 2 shows an algorithm of obtaining the data on the driving directions and average speeds of vehicles.

Fig. 2

An algorithm for determining the average speed and direction of vehicles

The first module receives every third frame of the video stream and receives object predictions using YOLOv3. Upon receipt of the bounding boxes to find the average speed and determine the driving directions, the objects should be identified by comparison with the data from previous frames. To train the neural network, we collected a dataset from street surveillance cameras in Chelyabinsk. To track vehicles, we used the SORT tracker because it has a good compromise between speed and accuracy [30].

Detection of vehicles

Our approach is based on the use of static cameras with a viewing angle, which provides visibility of the entire physical territory of the intersection and adjacent roads. The camera angle was chosen with the condition of visibility of the entire physical territory of the intersection.

We used several freely accessed cameras of Intersvyaz company in the cities of Chelyabinsk [31] and Tyumen. We chose the cameras with a viewing angle providing visibility of the entire functional area of the intersection and adjacent roads. The cameras are located at a height of 14–40 m, with an elevation angle of 30–60° to the horizon. The video streams of these cameras provide a stable transmission of 25 frames per second, supporting a resolution of 1920 × 1080 pixels. At the same time, the video stream is not perfect due to compression artifacts, blurring, bad weather conditions, and hardware errors, which prevents the detection and classification of vehicles, as well as the determination of speed indicators using the existing methods.

We collected and tagged frames of video streams from 7 cameras of various road junctions as the data for training the neural network. As a result, we obtained about 6000 thousand images highlighting over 430,000 vehicle objects (Fig. 3).

Fig. 3

Examples of an input image

The indexation of the classes and their corresponding colors further used to display the detection results are presented in Table 1.

Table 1 Indexation of the classes and their corresponding colors

The input data are presented as follows: a JPG or PNG image and a text file with marking:

In Fig. 4. i is the object number; n is the number of objects in the image; Ci is the index of the class of the i-th object; Xi, Yi are the coordinates of the center of the rectangle containing the object; Wi, Hi are the width and height of the rectangle containing the object.

Fig. 4

The input data

The parameters Xi, Yi, Wi, Hi are recorded in relative values of the image size (\( X_{i} Y_{i} W_{i} H_{i} \in \left[ {0;1} \right] \)).

For better training of the neural network, we expanded the dataset by applying augmentation, which increased the dataset by 10 times. For augmentation, we applied the following transformations in various combinations: horizontal display; affine and perspective transformations; noise overlay; color distortion (Fig. 5).

Fig. 5

Augmented images

The final dataset amounted to 4.3 million objects. The distribution of objects of each class in the training sample is presented in Table 2.

Table 2 Distribution of vehicle classes in the training sample

We divided the dataset into training and validation samples in the ratio of 80/20% and started training for 50,000 iterations with an increment of 0.001. The batch size per one iteration was 64 images, which was divided into 16 units during training to run several images at once.

Training of YOLOv3

The architecture of the YOLOv3 neural network consists of 106 layers (Fig. 6) and is a modification of the Darknet-53 neural network, which includes 53 layers (Fig. 7). Besides, it includes 53 more layers with two N-dimensional output layers, which allows us to make detections at three different scales. This modification contributes to a more accurate recognition of objects of various sizes. As input data, YOLOv3 accepts an image presented as a three-dimensional tensor of h ×  × 3, where h, are the height and length of the input image. The dimensionality of the output layers is determined by reducing the size of the input image by 32, 16, and 8 times, respectively (Fig. 6).

Fig. 6

The architecture of YOLO v3 [21]

Fig. 7

The architecture of Darknet-53 [21]

In addition to the use of ultra-precise layers, its architecture YOLOv3 also contains residual levels [25], layers with increased discretization and passed connections. CNN takes the image as input data and returns a tensor (Fig. 8), which represents:

Fig. 8

Output tensor

  • coordinates and positions of the predicted bounding boxes, which should contain the objects;

  • the probability that each bounding box contains an object;

  • the probability that each object within its bounding box belongs to a certain class.

To train the YOLOv3 neural network, we used the backpropagation method with a gradient descent. This method is based on the use of the output error of a neural network to calculate the correction values for the weights of neurons in its hidden layers. The algorithm is iterative and uses the principle of training “by epochs”, when the weights are changed after several instances of the training set are supplied to the neural network input, and the error is averaged for all the instances.

We improved the basic performance of YOLO with an additional mask branch and optimizing the shape of the anchors. An additional regression of the masks for each instance improves the precision in the corresponding regression problem of the bounding box. Consequently, the first optimization we applied was an additional mask branch. This branch runs in parallel with the existing branches and tends to regress the mask for each area of interest. For simplicity, we approximated the exact pixel masks of the instance using coarse polygonal masks from the collected dataset.

The results of training the neural network

The Average Precision (AP) is a popular indicator for measuring the precision of object detectors, such as Faster R-CNN, SSD, YOLOv3, etc. The calculate it, the AP values are used for each detected vehicle class, as shown in Fig. 9.

Fig. 9

Average precision values for classes

To obtain the “average precision” (mAP) for all classes, we average the AP values for each class. The average precision (mAP) of the system is 0.85.

Vehicle tracking

A comparison of the detected objects in the current frame with objects from the previous frames is a very difficult task. Vehicles detected in the previous frame may not be detected in the next frame for various reasons. For example, due to poor lighting conditions or occlusions, when one object is overlapped with another one. We solved the problem of multiple tracking of objects using the freely available SORT tracker. This is a simple and fast tracker operating in real time, which is very important in our task. It is based on two methods: the Kalman filter [32] and the Hungarian algorithm [33]. The linear speed is calculated for each object and the position of the object in the next frame is predicted. Based on the data received from YOLO, we calculated the shortest distance from each detected object to all the predicted ones. The Hungarian detection algorithm is used for the optimal matching of the predicted objects. Based on this data, the Kalman filter corrects the state of the object. The tracker assigns a unique identifier to each object (Fig. 10).

Fig. 10

The result of the tracker operation

Each vehicle has its own identifier. To save memory and improve the tracking quality, the tracker takes into account an object only if it was detected at least in min_hits frames. If the object is not detected during max_age frames, it is deleted. In [34], the authors made a comparison using various metrics of several trackers operating in real time (RMOT, TDAM, MDP, etc.). As a result of the comparison, SORT showed the best ratio of speed and quality of operation: better or rather high indicators in the metrics of MOTA, MOTP, FP, FN, etc. at a frame rate of 260 on one Intel i7 2.5 GHz processor core and 16 GB of memory.

The video stream frequency of the camera is 25 frames per second. To increase the operating speed of the system, we skip every two frames and process only every third one. However, at some intersections, cars can drive at a high speed, abruptly change their movement pattern, and cannot be detected by the neural network in each frame due to poor lighting conditions, small size, or overlapping with tree branches. In such situations, the tracker may not match all the new objects with the objects from the previous frames and assigns a new identifier to them. Therefore, we use a different number of passed frames for each intersection. Thus, between the frames, where the object was not detected, there will appear another frame, in which it can be detected. This allowed us to reduce errors at complex intersections, but at the same time increased the operating time.

Elimination of the camera distortion

Modern cameras are imperfect–they distort the image, changing the size, shape, and distances of objects. In our case, the image transmitted from the camera is subject to distortion. To determine accurately the coordinates of objects, we should eliminate the distortion by calibrating the camera. The easiest method of calibration is to use a spatial test object, such as a checkerboard [35], as shown in Fig. 11.

Fig. 11

Demonstration of correcting the image distortion through the use of a checkerboard

Figure 12 shows the source images and the images after applying the calibration.

Fig. 12

Source and corrected camera images

Calculation of the distance

To calculate distance traveled, we must find the change in the latitude and longitude of the vehicle’s location over a certain time interval using the change of coordinates in the camera image. To solve this problem, we calculated the perspective transformation matrix (Fig. 3) by selecting four reference points in the map and comparing the corresponding points in the image (Fig. 13).

Fig. 13

Reference points in the image

To calculate the perspective transformation matrix A= (cij)3 × 3 we need to derive the coefficients cij from the following linear equations describing the dependence between the coordinates in the image and the geographic coordinates:

$$ u_{i} = \frac{{c_{00} x_{i} + c_{01} y_{i} + c_{02} }}{{c_{20} x_{i} + c_{21} y_{i} + c_{22} }} $$
$$ v_{i} = \frac{{c_{10} x_{i} + c_{11} y_{i} + c_{12} }}{{c_{20} x_{i} + c_{21} y_{i} + c_{22} }} $$

where ui,vi are geographic coordinates; cij are elements of the matrix A, c22 = 1; xi, yi are the coordinates from the image, i = 1.4.

As a result of the calculation, we solve the following matrix equation

(aij)8 × 8 * (cij)8 × 1 = (xij)8 × 1 described in detail in Fig. 14.

Fig. 14

The matrix form of the solved equations

After finding the matrix coefficients, we can perform transformation by multiplying the perspective transformation matrix by the coordinate vector from the image.

$$ A \times \left( {\begin{array}{*{20}c} {x_{i} } \\ {y_{i} } \\ 1 \\ \end{array} } \right) = \left( {\begin{array}{*{20}c} {x_{i}^{'} } \\ {y_{i}^{'} } \\ {t_{i} } \\ \end{array} } \right) $$

where A is the transformation matrix; xi, yi are the pixel coordinates in the image; x’i, y’i are the latitude and longitude of the point.

To calculate the distance between two points, we find the distance between the two points on the sphere using the inverse haversine (4). The haversine in Eq. (4) is  = 2(Ө/2). This method of determining the speed is universal for any movement pattern and does not require additional preliminary marking of the intersection and finding any reference distances.

$$ d = rhav^{ - 1} \left( {hav\left( {\phi_{2} - \phi_{1} } \right) + \cos \left( {\phi_{2} } \right)\cos \left( {\phi_{1} } \right)hav\left( {\lambda_{2} - \lambda_{1} } \right)} \right) $$

where d is the measured distance; φ1, φ2, λ1, λ2 are the latitude and longitude of the i-th point; r is the radius of the earth (r = 6371 km).

Now, to calculate the average speed, we apply the following formula:

$$ v = \frac{d}{{t_{2} - t_{1} }} $$

where t1, t2 are the time of the beginning and end of movement at a distance.

To analyze the average speed of vehicles in real time, we record the time when

the vehicle appeared, as well as at each i-th step of receiving a frame from the video stream, we calculate the accumulated distance di used to find the average speed. The described algorithm is schematically shown in Fig. 15.

Fig. 15

The process for determining the distance and speed

In Fig. 15: is the perspective transformation matrix, ai are the coordinates of a specific vehicle, di is the distance between two points, ti is the time between frames, vi is the vehicle speed at the section di, V is the average speed.

Updating the data on the average vehicle speed when processing each frame of the video stream allows us to use the proposed method in real time.

Experimental results and discussions

Counting the vehicles

To assess the counting quality, we took the video content from CCTV cameras lasting from 1 to 2 h. We performed preliminary preparation for each intersection: marking the driving direction, a mask hiding parking spaces and the adjacent territory (Fig. 16).

Fig. 16

Overview of intersections from CCTV cameras

Table 3 shows the values of the programmed and manual vehicle counting at the intersections of the city of Chelyabinsk.

Table 3 Counting of vehicles at four intersections

Table 4 shows the percentage of the counting error for each class. After analyzing the data of manual and programmed vehicle counting, we found out that the mean counting error for all the classes is 5.5% of the total number of vehicles.

Table 4 Counting errors

An additional study of typical errors showed that most of them result from strong and prolonged occlusions between vehicles in queuing traffic. For example, while a trolleybus or a truck is moving, one or two lanes are partially blocked. Many cars are overlapped when turning, waiting in the center of the intersection for a free window. This problem can be solved by improving the tracking module using special methods for instance re-identification based on appearance tips. However, as it has been mentioned above, the existing approaches have a high computation load and are not applicable to real systems. The development of efficient algorithms for re-identification of vehicles remains an open question.

Average vehicle speed

We conducted comparative testing to check the accuracy of the proposed system. To this end, we made manual calculations of the average vehicle speed. Namely, the travel time of the vehicle was measured on movement patterns with a priori known distances (Fig. 17).

Fig. 17

Measured movement patterns and their lengths

This video was processed by the program, a comparison with the program calculation result is presented in Table 5.

Table 5 The experimental results of the speed detection system

As a result of analyzing the obtained data, we revealed the maximum speed determination error of 1.5 km/h, the mean error for all the movement patterns is 0.57 km/h.

Time complexity

So that the proposed method for determining the speed and number of vehicles could work in real time, it is necessary that the time complexity for processing each frame did not exceed 1/q, where q is the number of frames per second. For the test intersection, we used every third frame of the video stream; therefore, the upper estimate of the time complexity of processing one frame will be 1/25 × 3=0.12.

Figures 18, 19 show the time complexities for the vehicle detection and speed calculation processes.

Fig. 18

The time spent on vehicle detection for one frame

Fig. 19

The time spent for calculating the speed and number of vehicles for all the directions

The tests were made on a PC with the following specifications: CPU: i9 9900 k, GPU: GeForce RTX 2080TI, RAM: 64 GB. The maximum time spent on vehicle detection for one frame was 0.066 s, the maximum time for calculating the speed and counting was 0.009 s. In addition to the main processes implementing the above methodology, the software solution consists of many auxiliary processes responsible for data transfer, aggregation, and storage. Figure 20 shows a diagram of the time spent to complete all the processes and obtain the final data for the tested intersection.

Fig. 20

The time of complete processing of one frame

After analyzing the data, we can conclude that the upper estimate of the time complexity of processing one frame is 0.08 s, which fits into the above limitations and allows us to use the presented method to determine the speed and monitor traffic in real time.

Software solution

The system includes the following sequence of processes (Fig. 21):

Fig. 21

System workflow

  • frames reading (Process 1);

  • detection and classification of vehicles from the current frame (Process 2);

  • vehicle tracking and counting in all directions of the road junction (Process 3);

  • calculation of the latitude and longitude of the vehicle location (Process 4);

  • calculation of vehicle speeds (Process 5);

  • calculation of metrics related to the amount of harmful substances emitted by each vehicle (Process 6).

Used technologies

We used the following technologies for the software implementation of the presented architecture:

  1. 1.

    OpenCV is an open-source library designed to work with computer vision algorithms, image processing and general-purpose numerical algorithms. We used this library to perform the following tasks:

    1. a.

      Resizing an image and applying a mask to it;

    2. b.

      Setting and displaying of entry and exit areas, as well as determining the presence of vehicles in said areas;

    3. c.

      Camera calibration and elimination of distortion;

    4. d.

      Use of the perspective transformation matrix and determining the length of the distance;

    5. e.

      Data visualization.

  2. 2.

    Sort is an open-source library for 2D tracking of several objects in video sequences based on the elementary data association and state estimation methods. We used it to track vehicles in the video stream.

  3. 3.

    Redis is a resident open-source NoSQL-class database management system. We used it to store intermediate results of the modules.

  4. 4.

    RabbitMQ is a software message broker based on the AMQP standard. We used it to organize a data queue for transferring to a web page.

  5. 5.

    PostgreSQL is a free object-relational database management system. To compile statistics and calculate various metrics, such as KPI and daily flow structure, we aggregate and save the received data in a database every hour.


In this study, we focused on the problem of obtaining the data on the speed and driving direction of vehicles based on the video stream from street surveillance cameras. The complexity of the task is caused by the following factors: different viewing angle, remoteness from the intersection, overlapping of objects. We added an additional mask branch in the YOLO v3 neural network architecture and optimized the shapes of anchors to improve the accuracy of detection and classification of objects of different sizes to improve the quality of object tracking. To determine the speed in real time, we presented a method based on the application of a perspective transformation of the coordinates of vehicles in the image to geographic coordinates.

The proposed system was tested at night and in the daytime at six intersections in the city of Chelyabinsk, showing a mean vehicle counting error of 5.5%. The error in determining the vehicle speed by the projection method, taking into account the camera calibration at the tested intersection, did not exceed 1.5 m/s. The presented methodology allows us to generate complete and high-quality data for real-time traffic control and significantly reduce the requirements to peripheral equipment. Within the framework of this study, we did not consider the solution of many problems, such as overlapping of objects, a more detailed classification of vehicles, the definition of accidents and blocking objects. We consider our solution as a basis for our future research aimed at solving these problems.

Availability of data and materials



Intelligent transport systems


Convolutional neural networks


Average precision


Closed-circuit television


  1. 1.

    Peppa MV, Bell D, Komar T, Xiao W. Urban traffic flow analysis based on deep learning car detection from cctv image series. Int Arch Photogramm Remote Sens Spat Inf Sci. 2018;42(4):565–72.

    Article  Google Scholar 

  2. 2.

    Fedorov A, Nikolskaia K, Ivanov S, Shepelev V, Minbaleev A. Traffic flow estimation with data from a video surveillance camera. J Big Data. 2019.

    Article  Google Scholar 

  3. 3.

    Li C, Dobler G, Feng X, Wang Y. TrackNet: simultaneous object detection and tracking and its application in traffic video analysis. 2019; pp. 1–10.

  4. 4.

    Zhang F, Li C, Yang F. Vehicle detection in urban traffic surveillance images based on convolutional neural networks with feature concatenation. Sensors. 2019;19(3):594.

    Article  Google Scholar 

  5. 5.

    Zhang S, Wu G, Costeira JP, Moura JM. FCN-rLSTM: Deep spatio-temporal neural networks for vehicle counting in city cameras. In: Proceedings of the IEEE international conference on computer vision. 2017.

  6. 6.

    Rathore MM, Son H, Ahmad A, Paul A. Real-time video processing for traffic control in smart city using Hadoop ecosystem with GPUs. Soft Comput. 2018;22(5):1533–44.

    Article  Google Scholar 

  7. 7.

    Sun X, Ding J, Dalla Chiara G, Cheah L, Cheung NM. A generic framework for monitoring local freight traffic movements using computer vision-based techniques. In: 5th IEEE international conference on models and technologies for intelligent transportation systems (MT-ITS). 2017. p. 63–8.

  8. 8.

    Ren S, He K, Girshick R, Sun J. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell. 2017;39(6):1137–49.

    Article  Google Scholar 

  9. 9.

    Redmon J, Divvala S, Girshick R, Farhadi A. You only look once: unified, real-time object detection. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR). 2016.

  10. 10.

    Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC. SSD: single shot multibox detector. Lect Notes Comput Sci. 2016;9905:21–37.

    Article  Google Scholar 

  11. 11.

    Hu X, Xu X, Xiao Y, Chen H, He S, Qin J, Heng PA. SINet: a scale-insensitive convolutional neural network for fast vehicle detection. IEEE Trans Intell Transp Syst. 2019;20(3):1010.

    Article  Google Scholar 

  12. 12.

    Jung H, Choi MK, Jung J, Lee JH, Kwon S, Jung WY. ResNet-based vehicle classification and localization in traffic surveillance systems. In: 2017 IEEE conference on computer vision and pattern recognition workshops (CVPRW). 2017. 934–40.

  13. 13.

    Li S, Lin J, Li G, Bai T, Wang H, Pang Y. Vehicle type detection based on deep learning in traffic scene. Procedia Comput Sci. 2018;131:564–72.

    Article  Google Scholar 

  14. 14.

    Sommer L, Acatay O, Schumann A, Beyerer J. Ensemble of two-stage regression based detectors for accurate vehicle detection in traffic surveillance data. 2019. p. 1–6.

  15. 15.

    Wang L, Lu Y, Wang H, Zheng Y, Ye H, Xue X. Evolving boxes for fast vehicle detection. In: 2017 IEEE international conference on multimedia and Expo (IC-ME). 2017. p. 1135–40.

  16. 16.

    Zhu F, Lu Y, Ying N, Giakos G. Fast vehicle detection based on evolving convolutional neural network. In: 2017 IEEE international conference on imaging systems and techniques (IST). 2017. p. 1–4.

  17. 17.

    Anisimov D, Khanova T. Towards lightweight convolutional neural networks for object detection. In: 2017 14th IEEE international conference on advanced video and signal based surveillance (AVSS). 2017; 1–8.

  18. 18.

    Li S. 3D-DETNet: a single stage video-based vehicle detector. 2018.

  19. 19.

    Luo W, Yang B, Urtasun R. Fast and furious: real time end-to-end 3D detection, tracking and motion forecasting with a single convolutional net. In: 2018 IEEE/CVF conference on computer vision and pattern recognition. 2018; 3569–77.

  20. 20.

    Wu Y, Jiang S, Xu Z, Zhu S, Cao D. Lens distortion correction based on one chessboard pattern image. Front Optoelectron. 2015;8(3):319–28.

    Article  Google Scholar 

  21. 21.

    Redmon, J., Farhadi, A. YOLOv3: An Incremental Improvement. 2018.

  22. 22.

    Lin T-Y, Goyal P, Girshick R, He K, Dollar P. Focal loss for dense object detection. 2017.

  23. 23.

    He K, Gkioxari G, Dollar P, Girshick R. Mask R-CNN. In: 2017 IEEE international conference on computer vision (ICCV). vol. 2017: 2017; 2980–8.

  24. 24.

    Shreyas Dixit KG, Chadaga MG, Savalgimath SS, Ragavendra Rakshith G, Naveen Kumar MR. Evaluation and evolution of object detection techniques YOLO and R-CNN. Int J Recent Technol Eng. 2019;8(3):824–9.

    Article  Google Scholar 

  25. 25.

    He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. 770–8.

  26. 26.

    Javadi S, Dahl M, Pettersson MI. Vehicle speed measurement model for video-based systems. Comput Electr Eng. 2019;76:238–48.

    Article  Google Scholar 

  27. 27.

    Gholami A, Dehghani A, Karim M. Vehicle speed detection in video image sequences using CVS method. Int J Phy Sci. 2010;5(17):2555–63.

    Google Scholar 

  28. 28.

    de Barth O VB, Oliveira R, de Oliveira MA, Nascimento VE. Vehicle speed monitoring using convolutional neural networks. IEEE Latin Am Trans. 2019;17(06):1000–8.

    Article  Google Scholar 

  29. 29.

    Lan J, Li J, Hu G, Ran B, Wang L. Vehicle speed measurement based on gray constraint optical flow algorithm. Optik Int J Light Elect Optics. 2014;125(1):289–95.

    Article  Google Scholar 

  30. 30.

    Bewley A, Ge Z, Ott L, Ramos F, Upcroft B. Simple online and realtime tracking. In: 2016 IEEE international conference on image processing (ICIP). 2016: 3464–8.

  31. 31.

    Video observation. Accessed 20 May 2020.

  32. 32.

    Kalman R. A new approach to linear filtering and prediction problems. J Basic Eng. 1960;82:35–45.

    MathSciNet  Article  Google Scholar 

  33. 33.

    Kuhn HW. The Hungarian method for the assignment problem. Naval Res Log Quart. 1955;2:83–97.

    MathSciNet  Article  MATH  Google Scholar 

  34. 34.

    Bewley A, Ge Z, Ott L, Ramos F, Upcroft B. Simple online and realtime tracking. In: 2016 IEEE International Conference on Image Processing (ICIP). 2016. 3464–8.

  35. 35.

    Wu W, Wu L, Li J, Wang S, Zheng G, He X. RetinaNet-based visual inspection of flexible materials. In: 2019 IEEE International Conference on Smart Internet of Things (SmartIoT). 2019; 432–5.

Download references


Not applicable.


The work was supported by Act 211 Government of the Russian Federation, contract No. 02.A03.21.0011.

Author information




VS and KH designed research, performed research, analyzed the data, and wrote the paper. IS and TC designed research and was a major contributor in writing the manuscript. IC, IA and SS gathered data and wrote the paper. All authors suggested related works, discussed the structure of the paper and results. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Vladimir Shepelev.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Khazukov, K., Shepelev, V., Karpeta, T. et al. Real-time monitoring of traffic parameters. J Big Data 7, 84 (2020).

Download citation


  • Neural network
  • YOLO v3
  • Data for training the neural network (Dataset)
  • Traffic flow assessment
  • Vehicle detection
  • Vehicle classification
  • Vehicle speed
  • Traffic monitoring