Classification of fragile states based on machine learning

The study of fragile states has become a significant issue in global security, development and poverty at present. The existing classification methods of fragile state, which is a simple addition to the national index and threshold segmentation, is not reasonable enough. We introduce a new method based on machine learning. With this method, it will be easier and more reasonable to classify a country. We use two kinds of classifier, one of which is the support vector machine, and the other is the gradient boosted regression trees. Both models have flaws, so we use ensemble learning techniques to combine them. First of all, subjective labelling of a part of the national data to allows the machine to learn why a country becomes vulnerable from these data, and how to classify the vulnerability class of a country. Then, we trained the model with the data, and divided fragile states into four categories successfully (Alert, Warning, Stable and Sustainable). For the classification result, our model got a 93% test error rate, and a 96% training error rate, which is better than 77% with the threshold segmentation method.


Introduction
At present the study of fragile states or of the country's vulnerability research has become an international academics and policy makers to discuss in today's global security, development and poverty problems [1~3], as one of the core issues. Understanding the vulnerability of the state, and the consequences of its consequences, has only recently emerged as a field of study [4]. For the past five years, how to classify a fragile state has become an international group's main priorities [5~6]. Some wellknown universities and other research institutes in many countries have set up research centers to discuss the issue of "fragile states", and academic papers are constantly being rolled out for the grand view [7~9].
In the past, fragile states were simply separated by subjective thresholds [10~11]. First, the staff evaluates economic, environmental and other indicators (scores from 0 to 10). Then, they simply add up and choose thresholds according to some rules, separating different countries using these thresholds.
In fact, it is not fair or reasonable for some countries to do so. A typical example: a country with 11 indicators is perfect, but one of them, human rights, is very low with only two points. At this time, the human rights indicator will directly affect the country's vulnerability directly, and can affect the country's vulnerability through indirect influence (such as economic inequality, group grievance). This country is supposed to be classified as endangered, but it is recognized as excellent by threshold segmentation method.

Support Vector Machine
SVM is a new generation machine learning method proposed by Vapnik on the basis of statistical learning theory [12]. It has significant advantages in solving small sample, nonlinear, high dimension and other problems. At present SVM has been widely used in many fields of classification problems and regression problems, and has a good prediction effect.
SVC is an abbreviation of support vector classification and is an important branch of support vector machine (SVM) [13].
The classification schematic diagram is as follows: Given training vectors x i ∈ R p , i=1,…, n, in two classes, and a vector y ∈ {1, −1} n , SVC solves the following primal problem: , , where is the vector of all ones, > 0 is the upper bound, Q is an n by n positive semidefinite matrix, ≡ ( , ) , where K( , ) = ( ) ( ) is the kernel. Here training vectors are implicitly mapped into a higher (maybe infinite) dimensional space by the function .
The decision function is:

Gradient Boosted Regression Trees
Gradient Boosted Regression Trees is a generalization of boosting to arbitrary differentiable loss functions [14~16]. GBRT is an accurate and effective off-the-shelf procedure that can be used for both regression and classification problems. Gradient Tree Boosting models are used in a variety of areas including Web search ranking and ecology [16].
The classification schematic diagram is as follows: GBRT considers additive models of the following form: where ℎ ( ) are the basis functions which are usually called weak learners in the context of boosting. Gradient Tree Boosting uses decision trees of fixed size as weak learners. Decision trees have a number of abilities that make them valuable for boosting, namely the ability to handle data of mixed type and the ability to model complex functions.
Similar to other boosting algorithms GBRT builds the additive model in a forward stage wise fashion: At each stage the decision tree ℎ ( ) is chosen to minimize the loss function L given the current model −1 and its fit −1 ( ) The initial model 0 is problem specific, for leastsquares regression one usually chooses the mean of the target values.
Gradient Boosting attempts to solve this minimization problem numerically via steepest descent: The steepest descent direction is the negative gradient of the loss function evaluated at the current model −1 which can be calculated for any differentiable loss function: Where the step length is chosen using line search:

SVC-GBRT model
Whether it is the SVC model or the GBRT model, the expression of the single model may not be enough, so we integrate the two using integrated learning and get the final integrated learning model. The final classification schematic diagram is as follows:  We train SVC and GBRT separately, and finally use ensemble learning to combine them. We use the stacking algorithm to integrate. First, the sub-classifier is used to calculate the probability distribution of the target in each category. Then the probabilities of multiple classifiers are added and averaged. You get the final probability distribution.

Classification of fragile states
All the data comes from the World Bank. We went to their website and downloaded the relevant data and made a lot of analysis and processing.
We got the following data (from 2007 to 2015 of every country), and we added the variance and range indicators for some unbalanced distribution of indicators. We labelled each country year by year, using human annotation. We split up the data set we previously obtained, and we got the training set, validation set, and test set for cross validation. The code was written by Python, then the model was trained through cross-validation, and the best training iterations was 600. Finally, we obtained the following classification error curve( Figure 5).
We used SVC and GBRT as the comparison, as shown in the figure 5, the test error rate of the SVC-GBRT model reached an astonishing 0.07 (while GBRT was 0.12, and the SVC was 0.14). On the other hand, the training error rate of the SVC-GBRT model reached an astonishing 0.04 (while GBRT was 0.5 and SVC was 0.14). After the model training was completed, we used the model to predict a typical country, the Central African Republic, and the results were as follows( Figure 6).
As we can see from the figure 6, the probability of Africa being set for "Alert" is 0.92, the probability of Africa being set for "Warning" is 0.3, the probability of Africa being set for "Stable" is 0.1. the probability of Africa being set for "Sustainable" is 0.04. Finally, the country was classified as "Alert". Suppose that out annotation is reasonable enough, we made a comparison between machine learning classification results and previous threshold segmentation results.

CONCLUSIONS
Obviously, the traditional threshold segmentation method does not classify the vulnerability class of a country reasonably and fairly. Therefore, we introduce a new classification method, using machine learning method to classify the vulnerability class of a country. First of all, subjective labelling of a part of the national data to allows the machine to learn why a country becomes vulnerable from these data. Through model training and testing, we successfully used the model to separate the country into four types (Alert, Warning, Stable and Sustainable). And the accuracy of the test data set was 93%, which was better than 77% by the threshold segmentation method.

Fund
Qinghai university youth research fund 2016.