Traffic accident analysis based on C4.5 algorithm in WEKA

. At present, China is in a period of steady development of highways. At the same time, traffic safety issues are becoming increasingly serious. Data mining technology is an effective method for analysing traffic accidents. In-depth information mining of traffic accident data is conducive to accident prevention and traffic safety management. Based on the data of Wenli highway traffic accidents from 2006 to 2013, this study selected factors including time factor, linear factor and driver characteristics as research indicators, and established the decision tree using C4.5 algorithm in WEKA to explore the impact of various factors on the accident. According to the degree of contribution of each variable to the classification effect of the model, various modes affecting the type of the accident are obtained and the overall prediction accuracy is about 80%.


Introduction
As a passage connecting the cities, the highway undertakes huge traffic flow. Especially in China, the highway mileage is long, reaching 136,500 kilometers at the end of 2017 [1]. With the vigorous development of the highway industry, the traffic safety problem has become increasingly prominent. The daily personal injury and property damage caused by traffic accidents in China is very serious. The proportion of death tolls on highways to the number of road deaths in the country is on the rise, from less than 1% in 1994 to 10% in 2013 [2]. Therefore, the urgent problem that needs to be solved in the field of transportation today is to reduce the number of accidents and reduce accident losses. In order to achieve this goal, in addition to strengthening infrastructure construction and optimizing management measures, we also need to conduct research from the traffic accident itself to deeply understand the law of accidents.
Through the analysis of the causes of traffic accidents, targeted improvement measures can be proposed. The classification model can quantify the influence degree of each factor on the accident and clarify the influence mechanism of different accident types. It is beneficial for the analyst to grasp the key points.
In order to find out the main causes of accidents, this paper takes the statistical data of 8 years traffic accidents from K117 to K134 in Wenli highway as an example, and proposes a method based on C4.5 model to judge the main causes and accident types. The object of this study is to examine whether or not the C4.5 model can effectively identify the risk factors affecting types of accident.

Literature review
Numerous studies have focused on the factors affecting traffic accidents. After statistical analysis, the current mainstream method is to use data mining technology for analysis. From a methodological viewpoint, a wide variety of approaches have been employed to investigative traffic accident [3].
Sohn S Y et al. [4] used various algorithms to improve the accuracy of the classification of the severity of two types of road traffic accidents. His algorithms included classifier fusion based on Bayes and logic model; data integration fusion based on arc discharge and bagging, and clustering based on k-means algorithm. The research results show that the clustering-based classification algorithm is most suitable for the classification of road traffic accidents in Korea. Chang L Y et al. [5] established the CART classification tree model based on the traffic accident in Taipei in 2001 to investigate the relationship between accident severity and drivers, vehicle characteristics, environmental variables, and accident variables. The vehicle type was found to be most relevant to the severity of the collision. Li Y Z et al. [6] used the Apriori algorithm in the association rules to study the connection between traffic accident related factors. The accident data of Tianjin for 6 years was taken as the research object, and the obtained results were basically consistent with the inspection data of the traffic control department. Singh G et al. [7] examined the application of the M5 model tree and the conventional fixed/random effect negative binomial (FENB / RENB) regression model in the prediction of non-city sector accidents in the Haryana (India) highway. The results show that the two models perform quite well in terms of correlation coefficients and root mean square error values. The M5 model tree provides a simple linear equation that is easy to interpret. Lombardi D A [8] used the US Traffic Safety Administration's (NHTSA) summary of fatal traffic accident census data from 50 states in 2011-2014 to establish a multivariate regression and Poisson model to compare traffic accident-related factors between seniors and young drivers. Liu Z Q et al. [9] used NETICA software as a development platform to establish a Bayesian model to analyze the characteristics of highway traffic accidents in haze weather, and obtained the implicit relationship between the two.
In summary, the development of data mining in the field of traffic accidents presents a diversified trend, but there is a lack of exploration of the patterns of various types of accidents.

Methodology
The popularity of classification tree models stems from their widespread acceptability, ease of interpretation, and the provision of suitable estimation routines in the majority of popular statistical packages. In this study, the j48 algorithm in WEKA was used to explore the distribution of accident types. The j48 algorithm is also the C4.5 algorithm in the decision tree [10]. It is one of the most commonly used data mining techniques and is widely used in industrial and engineering fields [11] [12]. A classification tree can be developed when the target variable is discrete. Because this study aims to simulate the distribution of accident types under various conditions in traffic accidents, and the results of distribution types are discrete, a classification tree has been developed.
The development of C4.5 decision tree model generally consists of the following three steps [13]. The first step is the growth of the branches. The growth of tree is also based on the information gain rate to segment the target variable. The calculation formula of the information gain rate is as follows: The second step is the processing of discrete variables. In the T set, 1 2 { , ,... } n v v v is the value of the continuous attribute A , and there are kinds of ways to divide A . The information gain rate of each division method is calculated, and the maximum information gain rate of A branch is assumed.
The third step is pruning. In order to prevent over-fitting of the model, the pre-selection pruning method is selected. In WEKA, it means to set the minimum number of instances of the branch. When the number of instances of the terminal node is too small, cut it off.

Data
According to the above method, the accident data of 2006-2013 collected from the K117 to K134 road sections of Wenli highway is taken as an example. Wenli highway is an important national trunk line in the central part of Zhejiang Province, connecting Lishui and Wenzhou. Wenli highway is known as the "Bridge and Tunnel Club" for its complexity of terrain. The 17km research section includes 5 tunnels and 1 bridge, and the geographical environment is complex. Figure 1 shows the study section scope.   Table 1 shows the number of accidents occurred in each section. Each accident record includes time, place, vehicle type, record reason item, etc. The linear data, including the absolute value of the elevation difference per km and the radius of curvature, are obtained from designing documents. The original data has the characteristics of clutter, incompleteness and ambiguity. After data cleansing, 586 accident records are finally obtained.
In order to build a C4.5 analysis model, the collected data needs to be divided into two subsets, one for training and one for testing. Normally, the training sample takes 2/3 and the test sample takes 1/3 [14]. Two subsets are selected by the method of generating a random number. A Mann-Whitney test shows that there was no significant difference in accident type between the two samples.

Calculation
This paper uses eight predictors to compare the target variables of the accident type, and try to find the important link between them. The 8 predictors include season, day of week, time of day, cause of accident record, vehicle type, terrain, radius of curvature and absolute difference in elevation. Table 2 gives the definition of variables. The data is processed into a standard form and inputted into the WEKA software. The figure 2 shows the tree diagram obtained after the software running.
The tree has 13 terminal nodes, and collisions with solids and rear-end collisions are the main types of accidents. The first number in the parentheses of the terminal node is the correct classification, and the second number is the error classification. For example, the first branch of the model judges all types of accidents caused by rainy road slip as solid collisions, of which 44 are correct and 10 are wrong. The upper and lower order of the elliptical nodes represents the importance of each attribute. The causes of the accident record at the top are the primary factors, including speeding, loss of control, rainy or snowy, poor condition, fatigue driving, small distance between vehicles, sprinkles from mountains. It means this node contains the most important factors affecting the type of accident. Factors on the second level of the model are the radius of curve and whether there are only cars involved. The factors below the second layer have a weak influence on the classification effect.
The model can clarify the contribution of each element to the classification effect, In addition, it is possible to visually see from the tree diagram which type of accident is more likely to occur in some cases. Taking the rear-end accident as an example, the most common types of accidents in the following six cases are rear-end collision accidents.
(1) Cause of accident record: Small distance between vehicles; (2) Cause of accident record: Speeding→ Radius of curve <=4000; training data is 81.34%, and the test data is 79.35%. The model has a higher accuracy rate in predicting both physical collision and rear-end collision types. However, in terms of other types of accidents, the prediction rate of this model is relatively unsatisfactory.

Discussion
The ultimate type of accident will always involve a complex interplay between a wide range of factors that are difficult to quantify. In this paper, the j48 algorithm in WEKA was used to construct the C4.5 model for analysis distribution of accident types. The model provides a good overall prediction of the training data and test data in this study, so the C4.5 model is a suitable method for analyzing the distribution of accident types. The C4.5 model can efficiently handle large data sets with a large number of explanatory variables and can produce useful results.
One of the advantages of the C4.5 model is to handle continuous variables, which is also its biggest advancement over the previous algorithm. In the C4.5 model, outliers are isolated into a node, so it do not cause splitting and may eventually be clipped. From a practical point of view, the results of the classification tree are displayed as graphical results, which will be easy for non-professionals to understand.
There are also some disadvantages in this model. First, due to the imperfection of statistical data, the ground slope of the accident occurred was replaced by the absolute value of the altitude per kilometer. In addition, the information about the driver was not collected, which reduced the accuracy of the model.

Conclusion
Using eight-year of vehicle accident data from Wenli highway, the model estimation results showed that the effects of the explanatory variables, involving road geometrics (radius of curve, absolute difference in elevation), vehicle type, terrain, and time factor (time of day, season).Various impact modes have been obtained for different types of accidents. This study shows that the C4.5 model is a suitable method for studying the type of traffic accidents.
Although some insights into the causal analysis of different accident types have been obtained, there are still some that need further study. At present, the information collection system in China is still at a relatively backward level, and more comprehensive and accurate information are needed in the future. And also, for a small sample of accident types, the model has a low correct rate. It can be considered to subsequently integrate a single type of accident for research.