Risk prediction of early diabetes mellitus based on combination model

. Aiming at the current low pre-diabetes detection rate, this paper proposes a PSO-SVM model to assist doctors in identifying the risk of patients with pre-diabetes. The paper uses the Support Vector Machine as the verification algorithm, takes the radial basis kernel as the kernel function, uses the adaptive Particle Swarm Optimization algorithm to optimize the penalty factor and kernel parameters of the Support Vector Machine, and establishes a PSO-SVM model, finally compares the model with Neural Network, Logistic Regression, and Naive Bayes model, and use Sensitivity, Specificity indicators and ROC curve to evaluate model performance. Empirical analysis proves that the combined model proposed in this paper can effectively identify the risk of patients with prediabetes.


Introduction
Pre-diabetes is a transitional state between normal blood sugar and diabetes. It is a transitional stage and early warning signal for developing diabetes [1]. With the continuous improvement of the economic level, the prevalence of pre-diabetes continues to increase. The latest data on diabetes in China shows: the pre-diabetes detection rate is only 35.7%, and there are 148 million pre-diabetes patients; the pre-diabetes prevalence rate in my country is 3 % To 4%, of which 7.7% to 8.95% develop diabetes every year [2]. Therefore, how to improve the pre-diabetes detection rate and provide medical guidance and prevention and control for pre-diabetes patients as soon as possible have become an urgent problem for prediabetes prevention and control.
With the application of big data and the development of machine learning, patient medical data has gradually received attention from the industry. Through the processing, analysis and clustering of massive data, they can provide prediction and decision support for disease prevention and control and medical management. Fang Hongxia, Wei Jincai, Li Hongjie, etc. established a (GDM) risk assessment model to make it more effective in establishing gestational diabetes with routine clinical indicators [3]; Wei Zhe, Zhang Yugang, Shi Dongdong, etc. were based on grid search and cross-validation, used Support Vector Machine to diagnose and predicted diabetes complications [4]; Naz Huma, Ahuja Sachin used data mining algorithms and in-depth learning to predict diabetes based on Indian data sets [5]. It can be seen that adopting data mining analysis methods for risk identification of pre-diabetes can assist doctors in risk diagnosis.
To sum up, the main research directions of data mining analysis methods in pre-diabetes prediction include: first, how to mine patient medical data and disease prediction related information, and second, how to reduce the training cost of the constructed model and improve its accuracy. Based on these, this paper proposes a method to use Support Vector Machine algorithm to build prediction model and Particle Swarm algorithm to optimize model performance.

Related methods
Support Vector Machine (SVM) is a nonlinear black box model based on statistical theory [6].It is a strong binary classification model dealing with nonlinear, small sample and high latitude problems in machine learning. The basic principle is to map the selected feature as input vector to a higher latitude space through a nonlinear mapping function. The best hyperplane of classification samples is found in the high latitude space mapped to.
The basic principle of Particle Swarm Optimization (PSO) is inspired by the cooperative behavior of birds in the process of foraging [7]. PSO is to design a massless particle with only two attributes of velocity and position to simulate as an individual, where the velocity represents the speed and direction of movement. Each particle is constantly moving in space. After a certain number of iterations, we can obtain the optimal space position of particles and the optimal solution of particle swarm.
In this paper, Particle Swarm Optimization algorithm is selected and combined with Support Vector Machine model. The parameters of Support Vector Machine model are optimized by using this algorithm, so that the performance of classification model is optimized. The PSO-SVM model was constructed to evaluate the individual characteristics of the data set and identify the risk index of patients with prediabetes. The specific implementation process is as follows: According to the experimental needs, the number of particles, the maximum number of iterations and learning factor c 1 and c 2 are set, and the initial inertia weight is ω.
2) In the process of particle optimization, the spatial position of each particle can be regarded as a solution. It is necessary to update the particle velocity and position according to the optimal point and global optimal position, and evaluate the fitness of each particle.
3) After a round of iteration, the current fitness of each particle is compared with the pbest i position fitness, and the better position is regarded as the new pbest i ; for the whole group of particles, after the iteration, the position fitness of particles is better than the gbest i position fitness as the new gbest i . And update the speed and position of each particle until the maximum number of iterations is reached. 4) After optimization, the optimal parameters C and are brought into the SVM model to train the training set, and PSO-SVM model is obtained. The model is used to evaluate the data set, so as to predict the risk of diabetes.

Data sources
The original data uses the early diabetes risk data set in the machine learning database of the University of California, Irvine. The data set has a total of 520 samples. After preliminary analysis and generalization of the characteristics of the original data, A total of 14 features in each sample except age and gender are divided into two categories: main symptoms of diabetes and complications of diabetes.

Risk assessment of pre-diabetes
256 cases of diabetic patients (80% of total diabetic patients) and 160 cases of non-diabetic patients (80% of non-diabetic patients) were randomly selected as the training set of the model. The radial basis function kernel was used as the kernel function of support vector machine, and particle swarm optimization algorithm was used to optimize. The accuracy of support vector machine model was set as the objective function, the number of particles was 45, the maximum number of iterations was 130, and the learning factors c 1 =c 2 =0.5, the initial inertia weight ω = 0.8, and the Linear Decreasing Weight (LDW) [8] strategy is introduced to improve the global optimization ability of the initial optimization and the local optimization ability of the final optimization. The optimized parameters of SVM are C= 4.663, δ= 2.859. The optimization process is shown in the figure: Bring the obtained model optimal parameters C and into the support vector machine, and then obtain the support vector machine model PSO-SVM supported by the particle swarm algorithm, thereby assessing the risk index of each patient suffering from diabetes, and the risk index of each patient from small to large, it is divided into five categories: very low, low, medium, high, and very high. For different categories, different medical or preventive measures can be taken to respond. The risk index of diabetes in the data set is shown in the table: Table 3. Diabetes Risk Index. Based on the training set, the PSO-SVM model is constructed to quantify the diabetes risk index (probability of 0 to 1) for each patient. The closer to 1, the higher the risk index. It can be seen that the number of patients at moderate risk is the largest. The order is very low, low, high, low and very high. For patients with low and very low risk, education and publicity of diabetes-related prevention knowledge can be carried out. For patients with medium risk, certain medical measures should be taken to control them. Risks should be taken seriously for patients with high and very high risk levels, and various indicators of the patient's body should be controlled in time.

Result analysis
Use the remaining part of the patient data (20%) to test and evaluate the performance of the PSO-SVM model. The ROC curve and AUC index were used to evaluate the model [9], and combine the PSO-SVM model with Logistic Regression, Neural Network and Naive Bayes [10]- [12]. After comparing, get the ROC curve as shown in the figure: The test set shows that the AUC area of the PSO-SVM model is 0.989, and the AUC areas of the neural network, naive Bayes and logistic regression are 0.958, 0.926 and 0.955, respectively. It can be seen from the figure that the performance of the PSO-SVM model is significantly higher than that of the comparison algorithm, and it has better predictive performance in diabetes risk prediction.
In addition, the cost of incorrectly classifying diabetic patients into non-diabetic patients and non-diabetic patients into diabetic patients is different. The cost of the former is higher than that of the latter, so the specificity and sensitivity of the model must be calculated. In this paper, Confusion Matrix is used to evaluate the performance of PSO-SVM model in detail [13], define 0 as a diabetic patient (positive) and 1 as a non-diabetic patient (negative). The calculated specificity is 0.96 and the sensitivity is 1.00, which is within an acceptable range.

In conclusion
This paper constructs a PSO-SVM model on the early diabetes risk data set, predicts the patient's diabetes risk, assists doctors in identifying the patient's diabetes risk, carries out medical prevention and control in advance. After particle swarm optimization, the optimal parameters of the support vector machine are obtained: C= 4.663, δ= 2.859, the AUC area of the PSO-SVM model is 0.989, the specificity and sensitivity are 0.96 and 1.00, which are significantly better than other comparison algorithms, and can be used in practical applications. However, in the process of model construction, this article sets the optimization goal of particle swarm optimization as the optimization of model accuracy. In order to make the model have better practical application value, we can consider using the weighted specificity and sensitivity combination function as the optimization goal. This further reduces the cost of misclassification.