A new learning algorithm based on strengthening boundary samples for convolutional neural networks

CNN is an artificial neural network that can automatically extract features with relatively few parameters, which is the advantage of CNN in image classification tasks. The purpose of this paper is to propose a new algorithm to improve the classification performance of CNN by strengthening boundary samples. The samples with predicted values near the classification boundary are recorded as hard samples. In this algorithm, the errors of hard samples are added as a penalty term of the original loss function. Multi-classification and binary classification experiments were performed using the MNIST data set and three sub-data sets of CIFAR-10, respectively. The experimental results prove that the accuracy of the new algorithm is improved in both binary classification and multi-classification problems.


Introduction
Recently, various strategies have been introduced to boost the performance of Convolutional Neural Networks (CNNs). Irsoy et al. [1] proposed introduced dropout to prevent overfitting to increase the generalization. Zoumpourlis et al. [2] presented a second-order convolutional network that combines linear and non-linear filters. Juefei-Xu F et al. [3] introduced the local binary convolution (LBC) to replace the convolutional layer, which improved the computing time of the network. Uchida et al. [4] showed a couple convolutional layer with mutually constrained weights to produce better performance.
Some existing literatures optimize the learning algorithms for CNN. Zhining et al. [5] combined the genetic algorithm and CNN to extract better expression features than CNN dose. Hu et al. [6] adopted Gaussian convolution into CNN to accelerate the training convergence. Kim et al. [7] used the Extreme Learning Machine to calculate the weights between the hidden and output layers to shorten the training time. Yang et al. [8] combined the dropout and stochastic gradient descent optimizer to form a modified CNN algorithm, which improved the accuracy of CNN.
In this paper, a new learning algorithm based on strengthening boundary samples is proposed to improve the classification accuracy of CNN. The importance of each sample is adjusted in each training step according to its predicted value to pay more attention on the hard-classified samples. The rest of this paper is organized as follows. In Sect. 2, the structure of CNN is briefly introduced. In Sect. 3, we provide our new learning algorithm. Sect. 4 discusses the experimental results. Finally, Sect. 5 is the conclusion of this paper.

Convolutional neural network
Generally, a CNN contains several blocks including the convolution and pooling layers, and a fully connected layer. The convolutional layer has a number of convolution filters to extract the local characteristics of the image ) * ( denotes the j-th feature map in the l layer of the network, is the set of feature maps in the l-1 layer, represents the convolution filter, is the bias of the j-th feature map in the l layer, * is 2D convolution operation, and (•) is the activation function. The pooling layer compresses data and parameters = ( ) is the weight and (•) is the pooling function.

The new learning algorithm
The back propagation (BP) algorithm [9] is a popular algorithm to train CNN. Generally, the mean square error and cross-entropy loss are used as the loss function for two and multi classification problems, respectively. (3.2) where n is the batch size, and represent the target value and predicted value of the i-th sample, respectively.
It can be seen from equations (3.1) and (3.2) that all samples are equally important during the training process. In fact, some samples are classified with higher confidence, while others are classified with lower confidence. The network does not make reasonable adjustments to complex and simple samples during the training process, resulting in low classification performance. In this paper, we solve this problem by constantly changing the importance of training samples during the training process to appropriately strengthen the training for complex samples.
Firstly, we divide samples into hard samples and easy samples. Easy samples are higher or lower predicted values, because the effect of these samples and the corresponding loss are in line with our expectations. For example, the corresponding loss of a sample with a small value is large, therefore, it has a poor effect and a large impact in training, which is in line with our expectations. However, there are some samples whose prediction value are near the classification boundary value, and the effect on classification are very poor, but the corresponding loss values are relatively small, which will lead to lower classification accuracy. We call this part of the samples hard samples. The specific distinguishing method is to take an interval near the classification boundary value, the samples with predicted values within the interval are recorded as hard samples, and the samples with predicted values outside the interval are recorded as easy samples. We use the average of the predicted values of the same batch of samples to select the interval. Take the binary classification as an example, let the classification boundary value be 0.5, the average value be a, let the distance from a to 0.5 be d, take 0.5 as the center value of the interval, and 2*d as the interval length to obtain the interval [0.5-d, 0.5 + d]. Figure 1 shows the sample division in the binary classifications. The penalty term extracts the information of the degree to which the sample is classified difficultly. The loss function for PCNN is expressed as: where Equation ( , are the average of the predicted values of the k-th and the l-th hard sample categories in the batch, respectively. When λ = 0, the new loss function is equal to the original loss function. Therefore, the modified loss function can be regarded as a generalization of the original loss function. In addition, the effect of the penalty term changes as the value changes. By choosing the value of λ, the CNN can be better trained and avoid overfitting. In the experiment, since the accuracy is highest when the value of the penalty term λ = 0.7, λ = 0.7is selected for subsequent experiments. Figure 2 is the corresponding change of the classification accuracy with respect to the value of λ on the MNIST data set.  Output layer 1 10

CIFAR-10
The pictures in the CIFAR-10 are equally divided into ten categories and each category contains 5,000 training samples and 1,000 test samples. We constructed three sub-data sets for the binary classification problem from the CIFAR-10 dataset, of which the first and the second types constitute the first sub-data set, the fifth and the eighth to the second sub-dataset, and ninth and tenth to the third sub-dataset. In these sub-datasets, the loss function is Equation (3.3). Figure 3 shows the test accuracies of PCNN and CNN for the first sub-dataset in different iterations. As we can see that PCNN perform better than CNN at any time. When iterations=1600, the accuracy of CNN reached 95.495%, while the accuracy of PCNN reached 95.885%. And in the whole process, the performance of PCNN is better than CNN, especially when the number of epochs is less than 800. If the number of iterations is 160, the classification accuracy of PCNN is 3.37% higher than that of CNN. On the second sub-dataset, Figure 4 shows the classification accuracy curves of PCNN and CNN under different iterations. We can see that the PCNN curve is always higher than the CNN. With iteration=1600, the classification accuracy of traditional CNN is 86.95%, and the classification accuracy of PCNN is 87.306%. If the number of iterations is 640, the classification accuracy of PCNN is 1.188% higher than that of CNN.   Figure 5 shows the classification accuracy curves of PCNN and CNN on the third sub-dataset. It is obvious that whether on PCNN or CNN, the classification accuracy improves with the increase of the times. After 1600 iterations, the accuracy of CNN reached 94.3%, and the accuracy of PCNN reached 94.54%. In this dataset, the effect of PCNN is not very obvious. This is because the difference between the hard and simple samples is small and the effect of the penalty term is small.

MNIST
MNIST is the dataset consists of handwritten digital pictures, which is divided into 10 categories, and the corresponding labels are 0 to 9 respectively. In this experiment, we use Equation (3.4) as a loss function to explore the impact of PCNN on multi-classification problems.
Experiments are performed on PCNN and CNN with different iterations, and the results are shown in Figure 6. It can be seen that the classification accuracy of PCNN can always be higher than CNN. Specifically, when the number of epochs is less than 11,000, the accuracy of PCNN is significantly higher than that of CNN. And when the number of periods is greater than 11,000, the improvement of PCNN becomes smaller. The reason for this phenomenon is that with time, the prediction accuracy of all samples will increase, and the increased error will decrease. After 33,000 iterations, the accuracy of CNN and PCNN is 99. 165%

Conclusion
In this paper, a new learning algorithm is proposed to enhance the classification performance of CNN. It focuses on dividing samples into easy and hard samples to strength learning the hard samples by adding a penalty term to the loss function of BP. Experimental results on the CIFAR-10 and MNIST datasets show that the new learning algorithm can improve the classification accuracy and speed of CNNs in image classification problems. In future study, we will try to set the weight of the penalty term with a dynamic parameter to make it adaptively to improve the performance of CNN.