Feature selected cost-sensitive twin SVM for imbalanced data

. In this paper, we propose a cost-sensitive twin SVM (cs-tsvm) and apply it to imbalanced data. A weight is added to each instance according to its cost of misclassification which is related to its position. In preprocessing part, features are selected by their difference of majority and minority classes. The feature is selected when its difference value is higher than average one. The experiment is conducted on UCI datasets and G-mean, AUC and accuracy are evaluation metrics. The experimental results show that Feature selection with CS-TWSVM is useful for datasets with high dimension.


Twin support vector machine
In 2007, Twin Support Vector Machine (TWSVM) is proposed by Jayadeva et al [1,2,3]. It obtains nonparallel planes around which the data points of the corresponding class get clustered. Its nonparallel planes are obtained by solving the following pair of quadratic programming problem.
Original TWSVM can be utilized in imbalanced data. We can assign different values to 1 , 2 . The cost of misclassifying an instance in minority class to majority class is higher than the cost of misclassifying an instance in majority class to minority class [7,8].
As for error variant , hinge loss function is used to calculate . Hinge loss function is defined as follows: In this paper, we introduce a new algorithm in section 2 by eliminating the second term in TWSVM and add weight of each instance to first term. In experiment part, we compare the performance of these two classifiers in imbalanced datasets using G-mean and AUC as metrics.

Feature Selection
The feature is selected if its feature deviation value is higher than average. Feature deviation is the deviation between the average value of the feature in majority class and the average value of it in minority class. The FD (Feature Deviation) [9]is defined as follows: Training set:

Instance weights
The weight for each instance in a class is related to the distance between itself to the central of opposite class. The weights of instances in majority class and minority class are defined in different way. The weights of instances in minority class are negatively related to distance between itself and the central of majority class. The closer to the center of majority class the instance is, the higher weight instance has. In the contrary, the weights of instances in majority class are positively related to distance between itself and the central of majority class. The further from the center of minority class the instance is, the higher weight the instance has. The weight of instance in minority class is defined as follows: where is the distance between instance and the center of majority class. The weight of instance in majority class is defined as follow: where is the distance between instance and the center of minority class.

CS-TWSVM (cost-sensitive twin SVM)
The weight introduced in section 3.2 is added to object function. Our goal of using the weight is to minimize the cost of misclassification by pushing hyperplane of majority class away from minority class and setting hyperplane of minority class near the instance close to majority class. A hypothetic situation is described in figure 1. Figure 1(a) is two nonparallel hyperplane of original TWSVM. Figure 1(b) is tow nonparallel hyperplane modified by adding weight [10,11,12]. From Figure 1,we can see that with weight there are less chance ofmisclassifying instance in minority class.
The object function is defined as follows: . . −( 1 2 1 + 2 1 ) ≥ 2 (12) . . −( 2 1 2 + 1 2 ) ≥ 1 (14) where 1 , 2 are weight matrixes of minority class and majority class. 1 and 2 represent two vectors of suitable dimension and having all values as 1's. Equation (1) and (2) are further used to develop CS-TWSVM classifier. Lagrangian of equation (1) is obtained as: ( 1 , 1 , ) = 1 2 ‖ 1 1 1 + 1 1 ‖ 2 + ( 1 2 1 + 2 1 + 2 ) Lagrangain multipliers are denoted by vector . Karush-kuhn-Tucker(KKT) conditions of equation (3)  Following equation (9) is obtained after merging equation (4) With these notation, equation (9) is reformulated as: At times it is difficult to obtain the inverse of ATA. This condition is handled by adding regularization term in the above equation, where ' ' represents an identity matrix of suitable dimension. Equation (17) is reformulated as follows: In the same manner, normal vector and bias for the second class are achieved by solving equation as follows: Normal vectors and biases are further used to generate nonparallel planes in equations below: In this way, CS_TWSVM determines hyper-plane for each class and a new data sample is assigned to a class by using following decision function: The perpendicular distance of the test data sample is calculated from each hyper-plane and pattern is assigned to the class from which its distance is lesser.

Feature Selection
We use FD value of each feature between positive and negative class formatted by (1) as a metric for feature selection [14,15]. The features with higher FD value than average one are selected. For high dimensional datasets, the advantage of feature selection is very obvious. Especially for some practical applications with from tens to hundreds of variables or features, feature selection can reduce computation time, reduce the effect of curse of dimensionality, and improve the prediction performance. The performance of CS-TWSVM classifier with feature selection and without it is compared in experiment 3.3. Cont.Table1

The performance of CS-TWSVM
The performance of CS-TWSVM is compared with SVM, TSVM, Least Square TSVM and Weighted LS-TSVM. SVM, TSVM, and Least Square TSVM are not specific for imbalance data, while weighted LS-TSVM is a refinement of Least Square TSVM for imbalance data. From Table 2, we can see that the performance of CS-TWSVM is better than other classifiers in most datasets. G-mean is more important metric to estimate the performance of classifier for imbalanced datasets.

Conclusion
From Table 3, the performance of CS-TWSVM with feature selection is obviously much better than CS-TWSVM without it in Ionosphere and Sonar datasets. But in Hepatitis and Heart-Statlog datasets, the performance is the same and even in Pima India Diabetes, the performance with feature selection is worse than without it. This is because Ionosphere and Sonar are high dimentional datasets, while Hepatitis, Heart-Statlog, and Pima India Diabetes are not, especially, Pima Indian Diabetes only has eight features. Feature selection is more useful for datasets with high dimension. Feature selection is more useful for datasets with high dimension.