An Imbalanced Data Classification Algorithm of De-noising Auto-Encoder Neural Network Based on SMOTE

. Imbalanced data classification problem has always been one of the hot issues in the field of machine learning. Synthetic minority over-sampling technique (SMOTE) is a classical approach to balance datasets, but it may give rise to such problem as noise. Stacked De-noising Auto-Encoder neural network (SDAE), can effectively reduce data redundancy and noise through unsupervised layer-wise greedy learning. Aiming at the shortcomings of SMOTE algorithm when synthesizing new minority class samples, the paper proposed a Stacked De-noising Auto-Encoder neural network algorithm based on SMOTE, SMOTE-SDAE, which is aimed to deal with imbalanced data classification. The proposed algorithm is not only able to synthesize new minority class samples, but it also can de-noise and classify the sampled data. Experimental results show that compared with traditional algorithms, SMOTE-SDAE significantly improves the minority class classification accuracy of the imbalanced datasets.


Introduction
Classification problem is one of the important research contents in the field of machine learning.Some of the existing classification methods can generally perform better when classifying balanced data.However, there are large amounts of imbalanced datasets in the field of practical application, such as network intrusion, text classification, credit card cheat-detection, medical diagnosis, etc., for which minority class recognition rate is more important [1][2].To remedy the deficiency that minority class samples have on distributed information, SMOTE algorithm put forward by Chawla etc. [2].Not only effectively synthesizes minority class samples, but, to a large extent, it also avoids over-fitting problem.The algorithm has already achieved a favourable effect in imbalanced datasets classification, but it brings a new problem such as noise.
Auto-encoder neural network based on thought of deep learning has already obtained huge success in the field of machine learning [3].It initializes the network weights through unsupervised layer-wise greedy learning, and learn about data features and reduces the irrelevant and redundant data through constantly adjusting the network parameters, and then it fine-tunes the network parameter using back propagation (BP) algorithm.Stacked De-noising Auto-Encoder neural network, SDAE, can train more robust expressions of input data through adding noise into original data.Thereby it can improve the generalization ability of auto-encoder neural network to input data [4][5][6].Imbalanced data classification algorithm of de-noising auto-encoder neural network based on SMOTE proposed in the paper can reduce the noise problem from SMOTE, and it de-noises and classifies the sampled data, which improves minority class classification quality.

Related works 2.1 SMOTE Algorithm
Synthetic minority over-sampling technique (SMOTE) algorithm is a kind of typical sampling method proposed by Chawla etc. in 2002 [5].Compared to traditional over sampling technology, it can effectively avoid over fitting phenomenon of the classifier.The main gist is to insert synthesized minority class samples at the nearest neighbours of it, thus, to increase the number of minority class samples to balance the dataset.To be specific: Suppose oversampling ratio is N. First, randomly choose K samples from P nearest minority class neighbours of each minority class sample.Then according to (1)

Auto-encoder Neural Network (AE)
Auto-encoder neural network is an unsupervised learning neural network which reconstructs input data as much as possible.It initializes the network weights using the greedy layer-wise training method, and fines tune network parameter using the back propagation (BP) algorithm to optimize the overall performance.Using hidden layer outputs as the new input features, SAE (stacked auto-encoder) deep structure is formed through multiplication AE.

Stacked De-noising Auto-Encoder neural networks (SDAE)
On the basis of traditional de-noising auto-encoder neural network (AE), after adding noise with a certain probability distribution to the input data, DAE proposed by Vincent etc. [6], makes auto-encoder neural network (AE) learn how to remove the noise, and reconstruct undisturbed input as much as possible.Therefore, the features generated from the learning of input corrupted with noise are more robust, which improved the data generalization ability of auto-encoder neural network model to input data.
Therefore, the cost function of de-noising auto-encoder neural network is defined according to (3).
Here, w is weights between neurons b is bias m is number of samples among which sigmoid activation function with the range [0, 1] is used in the study.Using hidden layer outputs as the new input features, Stacked Auto-Encoder neural networks (SDAE) deep structure is formed through multiplicating DAE.

Dataset description
All the experimental data in the paper are eight binary classification datasets commonly used in the study of unbalanced data classification, which is obtained from UCI machine leaning databases, detailed descriptions are shown in table 1.

Evaluation index based on confusion matrix
In the study, minority class in classification learning is defined as positive, and majority class negative.Confusion matrix to evaluate two-class problem is shown in Table 2 Table 2.Confusion matrix for two-class problem.

Classification Actual Positive Sample
Actual Negative Sample Predict as Positive TP FP Predict as Negative FN TN In Table 2, TP stands for the number of minority class being classified as minority class.TN is the number of majority samples being assessed as majority class.
In learning imbalanced data, the effect of minority class on classification accuracy is far less than that of majority class.So classification learning taking classification accuracy as a criterion usually leads to low minority class recognition rate.The classifier tends to forecast a sample as majority class sample.The classic classification accuracy evaluation criteria do not apply to the classifier performance assessment of imbalanced data.Therefore, for imbalanced data classification, there have been new evaluation criteria, such as AUC, F-value and G-mean [7], etc.Their definitions are as follows: AUC (Area Under roc Curve) provides a method to measure the classifier performance when it is hard to be judged because of ROC (Receiver Operating Characteristic) curve intersection.AUC value of a classifier is the area under the corresponding ROC curve.The larger the area, the better the performance of the classifier will be.

Experimental results and analysis
Experiments are conducted on the proposed SMOTE-SDAE algorithm and SVM SMOTE-SVM SDAE SAE algorithms, the results of which are compared using three evaluation methods mentioned above.The following three tables show the compared experimental results of AUC, F-value and G-means of each algorithm.Libsvm toolbox is used in all the SVM algorithms in the experiment, and RBF is adopted as Kernel function with parameter.The average of 10 times 10 fold crossvalidations is used as a result.The experimental environment is win7 64bit Matlab2012b CPU 3.4GHz RAM 4G.

Figure 1 .
Figure 1.De-noising Auto-encoder neural network [6] In Fig.1, original data x is added noise in a certain probability D q to form disturbed data x ~as auto-encoder input.And f is an activation function used to compute the activation values of each neuron of the hidden layer, as defined in (2).

2 )
Combining the strengths of SMOTE and SDAE, the paper puts forward an imbalanced data classification algorithm of de-noising auto-encoder neural network based on SMOTE.First of all, use SMOTE to balance data set.Then aiming at the new data noise problem brought by SMOTE, more robust features are obtained after unsupervised layer-wise greedy training of SDAE.SDAE improves input data generalization ability of autoencoder neural network, which improves the classification accuracy of minority class and overall samples.Detailed descriptions of the algorithm as follows: 1) Training phase: a) Set parameters: ^b W , T among which W is network weight, b is bias; neuron number of the visual layer is v, neuron numbers of the hidden layer is h1, h2; D q is Gaussian noise, T is total number of positive samples in data set, N is sample synthesis rate k is number of chosen nearest neighbors, default is 5. b) Load dataset: Let Dadaset is the original training set.c) Oversample dataset: newDataset = SMOTE(T, N, k).d) Add noise: Generate D-newDataset through adding Gaussian noise according to D q into newDataset.e) Unsupervised training: Study network parameters ^b W , T with layer-wise greedy learning.f) Supervised training: Use newDataset to finetune network parameter with optimization algorithm L-BFGS.Test phase: Test N with test set, and return AUC,Fvalue and G-mean.

F
-value is a classification evaluation index comprehensively incorporating recall and precision, as defined in (4).

1 E
in the experiment, Fvalue can be used to balance the equally important relation between recall and precision.G-mean is the geometric mean of classification accuracy of minority class and majority class.It is defined according to(5).
is the maximized accuracy of two classes under the condition of maintaining the classification accuracy balance of minority class ad majority class.G-mean is the maximum only in the case of simultaneous high classification accuracy of minority class and majority class.
, synthesize each minority class sample and chosen K samples respectively to generate N new minority class samples; finally, add the new samples to original sample set to form a new training sample set.

Table 1 .
AUC contrasts of different algorithms.