Research on pedestrian detection algorithm in driverless urban traffic environment

Pedestrian detection in urban traffic environment is an important field of driverless vehicle research. Due to the variability of traffic flow, target detection algorithm cannot extract complete feature information, which brings great challenges to driverless pedestrian detection. Target detection algorithm YOLOv4 has excellent detection performance in object detection, but it is not perfect in identifying semiblocked pedestrians. In this paper, the Spatial Pyramid Pooling was added in front of the third yolo detection head module of YOLOv4 to optimize the extraction of deep network features. Then, on the basis of optimizing the network, pruning strategy was adopted to simplify the target detection algorithm, which was called TidyYOLOv4.TidyYOLOv4 and YOLOv4 (network set input image size is 864×864) were compared on the self-made human head data set. Total BFLOPS decreased by 95.04% and Inference time decreased by 82.82%. The above experimental results show that the optimized TidyYOLOv4 algorithm is more suitable for driverless pedestrian detection in urban traffic environment.


Introduction
With the progress of artificial intelligence, driverless vehicles have become one of the main research and development directions. Unmanned driving adopts a number of technology fusion detection, among which visual detection is one of the most important detection technologies. Pedestrian detection in urban roads is the basic task of visual perception applied to driverless cars in various traffic scenes. Because when the driverless vehicle does not detect the pedestrian in the road accurately, it may harm the life and safety of the pedestrian. Therefore, it is very important to ensure the accuracy of pedestrian detection. With the progress and improvement of deep learning algorithm, the detection of road pedestrian has been further improved, but it still needs to be further improved in practical application. There are two main problems: (1) The deep neural network vision algorithm needs strong computing power and running space. Currently, it is mainly used to test and verify its detection performance on the server, which is difficult to store and run on the on-board chip. (2) The complex traffic flow will make the target detection algorithm fail to extract complete feature information (for example, the body part of the pedestrian is blocked by other vehicles or traffic signs), so it is necessary to rely on part of the acquired information to determine the characteristics of the target.
In order to solve the problem that deep learning target detection algorithm cannot be applied to unmanned chip, pruning algorithm is developed to reduce the spatial volume of target detection algorithm and reduce the consumption of computing force, so as to realize the reasonable deployment of target detection algorithm on unmanned chip. In order to improve the detection effect of obscured the pedestrian, pedestrian in the presence of block data set to validate the performance of the improved algorithm, due to the characteristics of the legs, hands and body information exists strong uncertainty, so choose to identify with high degrees of the head as an object of annotation besides has remarkable characteristics in the road obscured the probability is relatively low. Such annotation does not exist in the common open source pedestrian data set, so we made the head annotation data set with the human body partially obscured. The experimental results show that TidyYOLOv4, an optimization algorithm based on this data set, is more suitable than YOLOv4 [1] to be applied to the detection of pedestrians by driverless vehicles in urban traffic environment.

Related work
Machine vision is mainly divided into two categories: (1) Classifying the element information in the image; (2) To locate the object information in the image, and target detection is the fusion problem of classification and positioning. The initial target detection algorithm mainly extracts the target information in the image through the sliding window, and then analyzes the target positioning and classification. The result of the analysis cannot achieve satisfactory results. Until the advent of R-CNN aroused the interest of a large number of researchers, and became one of the hot research areas in the field of vision.Now more excellent target detection algorithms have been developed on the basis of R-CNN, such as R-CNN [2], Fast R-CNN [3], R-FCN [4], SSD [5], YOLO [6], YOLOv2 [7], YOLOv3 [8], YOLOv4 [1], etc.
These deep target detection algorithms are mainly divided into two categories according to their different network architectures: one is a two-stage target detector represented by R-CNN and Fast R-CNN, which is composed of three major modules, namely, regional recommendation module, backbone network and detection head. First of all, the region detection module of the two-stage target detector will generate suggestions with regions of interest, and the detection head will conduct information classification based on these suggestions. Finally, position regression will be carried out to accurately locate the target object. The two-stage target detector achieves excellent detection accuracy through region suggestion. Its running process not only requires huge loss of computing power and running memory, but also leads to slow real-time target detection. In the other category, the singlestage target detector represented by YOLO series and SSD is set with k prior boxes densely covering each specific position of the image at each position of the feature graph, and no branch network similar to the regional suggestion is used. Therefore, the single-stage detector is faster than the two-stage detector in reasoning. In the single-stage target detector, YOLOv4 target detection algorithm has excellent detection speed and advanced detection accuracy. Therefore, In this study, YOLOv4's target detection algorithm was selected as the basic algorithm model for pruning. In combination with pruning strategy, a more efficient target detection model, TidyYOLOv4, was learned to improve the real-time detection of pedestrians by driverless vehicles in urban traffic.

Network optimization
YOLOv4 is an advanced algorithm that is constantly optimized and improved from YOLO algorithm.YOLOv4 algorithm is mainly based on YOLOv3 combined with the existing advanced optimization strategy to complete, it has been greatly improved in speed and accuracy. For the sake of better detection effect of network model, YOLOv4 combines the thought of CSP-Net [9] on the basis of Backbone-53 to construct CSPDarknet-53 to greatly improve the transmission effect of network algorithm, while the Neck combines the advantages of SPP [10] and PAN [11] to strengthen the extraction of deep network, and Head uses the detection method of YOLOv3 for reference.
In order to fully enhance the feature extraction of deep structure in the experiment, an SPP module was added between the 5th and 6th convolutional layers in front of the third detection head of YOLOv4 to improve the detection effect, and YOLOv4-SPP1 was combined. As shown in Figure 1 below:

Network pruning
Pruning strategy is adopted to reduce the running resource consumption of target detection algorithm and improve detection efficiency. Based on the optimized network of YOLOv4-SPP1, the model was simplified, and the optimized network TidyYOLOv4 was obtained through the iterative process of network pruning in Figure 2.  The iterative process of pruning optimization for YOLOv4-SPP1 network :(1) Select an appropriate basic network framework;(2) Basic training;(3) Conduct sparse training on the network after basic training;(4) Evaluate the importance of deep model and develop pruning strategies;(5) Pruning the network model by implementing pruning strategy; (6) Fine-tune the pruned model to fully improve the potential algorithm performance; (7) The network after pruning optimization;(8) Conduct pruning deployment again when the performance of the optimized model fails to meet the optimal requirements;(9) No pruning will be performed when the optimal network model required by the experiment is evaluated, namely, the optimal network TidyYOLOv4.

YOLOv4
Optimization strategy: Firstly, By adding the L1 regularization on channel scale factor [12,13] to enhance the sparse nature of channel level help structured pruning, after introducing the global threshold adjust the cutting rate of the channel, and then by cutting off the scaling factor average minimum convolution layer to further improve the detection efficiency and get optimal algorithm TidyYOLOv4. In this experiment, the improvement proposed in Liu [14] method was improved into a coarse-grained depth model search method to explore a more efficient target detector.

Experiment
After a series of optimization on the basis of YOLOv4, the target detection algorithm TidyYOLOv4 which conforms to the detection of pedestrians in urban roads was optimized. The validity of the algorithm is verified by the following experiments.

Experimental environment
During the experiment, the deep network algorithm running platform should be configured to meet the requirements of TidyYOLOv4.The operating environment is CPU/GHZ (Inter Xeon E5-2603 ,Memory/GB 16),GPU(Tesla P4 ， 8GB)and Operating System(Ubuntu 16.04).

Data set
In order to verify the effectiveness of the optimization algorithm for half-blocked pedestrian heads, a half-blocked pedestrian head data set [15] was made to verify the optimization algorithm TidyYOLOv4. This data set was a picture pixel of 1280×720 that was analyzed from the video recorded by DV. Then LableImg was used to make the head tag, and a total of 10,870 data sets were used. The data set is manually tagged, and the images are strictly completed in the same markup manner. The experiment was divided into 8696 pictures of training set, 1087 pictures of verification set and 1087 pictures of test set on a scale of 8:1:1.

Model training
The training and verification of this experiment is implemented in PyTorch, a deep learning framework. During the training process, 4 information pictures were sent to the network for the training of 100 Epochs at a time. The learning rate of the initial training was 0.01, and the learning rate was ten times smaller when the network training iteration reached 70% and 90% of the whole process. The weight attenuation is set to 0.001 and the momentum is set to 0.9.
Sparse training: Each YOLO model trained 100 epochs. Sparse training of 300 Epochs on top of 100 Epochs was carried out to promote network pruning. Due to different learning rates, appropriate punishment factors were selected for training. In this experiment, 0.0001 punishment factor was set for training. Other setting parameters were the same as the basic training.

Analysis of experimental results
The experimental results in the table were used to analyze the experimental results of the basic model and the learning model with different optimization strategies, so as to select the optimal target detection model (YOLOv4-SPP1-X, SPP1 means a Spatial Pyramid Pooling was added on the basis of YOLOv4, X means pruning X%).  Table 1 is the experimental results of several groups of different pruning degrees. It can be seen from the Input size column in the table that the evaluation indexes of YOLOv3, YOLOv4 and YOLOv4-SPP1 will also be significantly improved as the size of the network input picture increases from 416×416 to 864×864, among which the mAP of YOLOv3 increases by 12.   According to the comparison of evaluation indexes of YOLOv4 and Yolov4-SPP1 in Figure 4, it can be seen that the addition of evaluation indexes to the spatial pyramid presents an increasing trend, indicating that it is effective to improve feature extraction by adding the spatial pyramid module before the detection head of YOLOv4. Therefore, YOLOv4-SPP1 was selected as the pruning network.   In the experiment, pruning of YOLOv4-SPP1 was carried out to different degrees. It can be seen from the figure that when the pruning rate of the model increased to 96%, it was the best. When the pruning rate reached 97%, the overall performance of the detection model began to show a downward trend. Based on the comparative analysis of evaluation indexes of YOLOv4-SPP1-95, YOLOv4-SPP1-96 and YOLOv4-SPP1-97 in Figure 5, YOLOv4-SPP1-96 was selected as the final optimized network model TidyYOLOv4.
Detection effect analysis: as can be seen from  Fig.6 and fig.7 of the visual detection effect of YOLOv4 and TidyYOLOv4 show that there is no obvious difference in detection effect. However, the inference time of each frame is reduced by 75.70ms, which greatly reduces the time of target detection and leaves more recognition and processing time for target detection.

Conclusion
In this experiment, TidyYOLOv4, a target detection algorithm suitable for driverless urban traffic roads, was optimized. First, as shown in figure 1, this paper improves network feature extraction by adding SPP before the third detection head of YOLOv4.Secondly, the redundancy of YOLOv4-SPP1 training model is pruned through the joint pruning strategy of layer and channel to obtain a more efficient detection model. Finally, in order to automatically identify the unimportant parts of the training model, sparse L1 regularization is applied to the channel scaling factor to implement pruning strategy, and appropriate scaling factors are adjusted to trim the unimportant parts of the network model to improve the performance of the target detector. Based on this strategy, the TidyYOLOv4 model is optimized on the basis of the original model YOLOv4 (the size of the input image of network setting is 864×864).Compared with YOLOv3, TidyYOLOv4 not only has higher detection speed and better detection accuracy, but also has a 99.05% reduction in model space volume compared with YOLOv4.Therefore, it is concluded that TidyYOLOv4 is more suitable than YOLOv4 to be applied to the detection of pedestrians in the urban traffic environment by driverless vehicles.