Modified Floating Search Feature Selection Based on Genetic Algorithm

Classification performance is adversely impacted by noisy data .Selecting features relevant to the problem is thus a critical step in classification and difficult to achieve accurate solution, especially when applied to a large data set. In this article, we propose a novel filter-based floating search technique for feature selection to select an optimal set of features for classification purposes. A genetic algorithm is utilized to increase the quality of features selected at each iteration. A criterion function is applied to choose relevant and high-quality features which can improve classification accuracy. The method is evaluated using 20 standard machine learning datasets of various sizes and complexities. Experimental results with the datasets show that the proposed method is effective and performs well in comparison with previously reported techniques.


Introduction
Classification, a process for predicting a class of a given input data, is one of the most fundamental tasks in data mining.A number of methods are commonly used for data classification, such as decision trees; rule-based, probabilistic and instance-based methods; support vector machines (SVMs); and neural networks.Noisy and irrelevant data are major obstacles to data mining.
Selecting features relevant to the problem is a critical first step in classification, especially when applied to a large dataset.The aim is to select a representative subset of highly relevant dimensions while removing irrelevant and redundant ones [1].Feature selection can significantly improve the running time of a machine learning algorithm as well as improve the quality of the model.
Consequently, Bins and Draper [2] proposed a technique to reduce a large set of features (1 000) to a much smaller subset without removing any highly important features or decreasing classification accuracy.There are three steps in the algorithm: first, irrelevant features are removed using a modified form of the relief algorithm [3]; second, redundant features are eliminated using K-means clustering [4]; and, lastly, a combinatorial feature selection algorithm is employed to the current feature subsets using the sequential floating backward selection (SFBS) algorithm.The basic concept is to filter feature subsets on each step until the smallest possible one is obtained.In this article, we propose a technique to improve the effectiveness of the floating search feature selection method which leads to a higher classification rate.Our method employs a genetic algorithm to enrich and improve the resultant features after each iteration of the sequential forward feature search (SFFS) process.

Feature selection
Two important components of the feature selection process: subset generation and subset evaluation are shown in Figure 1.The subset generation engine identifies feature subset candidates and subset evaluation measures the quality of the subsets.Lastly, in order to terminate the process, a stopping criterion is tested at every iteration.
There are three main types of feature selection method: filter, wrapper and hybrid.Wrapper methods rely on a classification algorithm employed as the subset evaluation process for feature subsets [5].Maroño et al. [6] proposed a wrapper method by applying ANOVA.In general, the wrapper approach gives a higher performance than the filter approach since the feature selection process is optimized for the specific classification algorithm.Nevertheless, when wrapper methods are applied to huge dimensional datasets, they will incur high computational cost and may become unfeasible.
Filter methods use an independent criterion which relies on general characteristics of the data to evaluate and select feature subsets without involving a classification algorithm.Common evaluation functions usually are measures such as distance, mutual information (MI), dependency or entropy, calculated directly from the training data.Karegowda et al. [7] developed a filter-based technique in a cascade fashion with a genetic algorithm (GA) using a correlation-based criterion.Dash and Liu [1] proposed Hybrid methods exploit the positive aspects of both wrapper and filter methods.It utilizes a filter-based technique to select highly representative features and applies a wrapper-based technique to add candidate features and evaluate the candidate subsets in order to select the best ones.The sequential forward search (SFS) method operates in a forward search manner starting with an empty set and adds one feature subset during each round until a new feature subset that maximizes the criterion function value is found, whereas the sequential backward search (SBS) method starts with a full feature subset and eliminates a feature on each iteration until a predetermined criterion is satisfied.A drawback of both methods is that they have a nesting effect problem, which means that the features discarded cannot be re-selected, and the features selected cannot be removed later.Since these algorithms do not examine all possible feature subsets, they are not guaranteed to produce an optimal result.Generalized forms GSFS and GSBS based on group collection feature testing are better solutions, but at the cost of increased computational time.The plus l take away r (PTA) method was proposed to take care of the nesting problem [8].

Floating search method
Pudil et al. [9] proposed "floating" search methods based on two main categories: the search process in a forward direction (SFFS) and the process in a backward direction (SBFS).These methods use a criterion function to select a feature and compare candidate subsets.SFFS and SBFS can be classified as a wrapper or a filter approach depending on the criterion function used.They perform well but the computational time is long, especially with large datasets.The floating search methods can be viewed as predictive text algorithms (PTAs) without the use of a fixed parameter.They have been shown to give very good performance (close to optimum results) and to overcome the nesting problem.SFFS, SBFS, and bidirectional selection as a combination of both are greedy search algorithms that add or discard features one at a time [9].The floating search method consists of two phases: forward and backward.SFFS starts with an empty set and sequentially adds one feature at a time.The structure of the floating search algorithm is shown in Figure 2. SBFS, the counterpart of the forward search, is initialized with a full set and sequentially eliminates one feature at a time after execution of SFFS.An SFFS search selects the best unselected feature according to a criterion function to form a new feature subset, and an SBFS search iteratively determines which members of the selected subset are to be removed if the remaining set improves performance according to the same criterion function in the forward search.The algorithm loops back to a forward search until the stopping condition is reached.There are disadvantages when using either algorithm.With SFFS, it is not possible to succeed in eliminating redundant features generated in the search process, whereas SBFS cannot re-calculate evaluation feature usefulness together with other features at the same time.Improved versions of SFFS have been proposed in many researches to obtain better performance.Somol et al. [10] presented the adaptive sequential forward floating selection (ASFFS) algorithm with a parameter "r" which specifies the number of features to be added in the inclusion phase calculated dynamically.Parameter "o" is used in the exclusion phase to remove the maximum number of features if it improves performance.Nakariyakul and Casasent [11] came up with an improved forward floating search algorithm, which has a new search step to check whether to replace a weak feature and remove it again until the replacement can no longer improve the criterion function.They found that this method obtained optimal solutions for many feature subsets and was less computationally intensive than exhaustive search optimal feature selection algorithms.Chaiyakarn and Sornil [12] proposed a filter-based method to return a small subset of features for classification by employing two different criterion functions in the forward and backward steps.The functions helped remove redundant features, maximize inter-class distances, and minimize intra-class distances.

Feature subset evaluation
In order to perform feature selection with the filter approach, a measure is needed to evaluate the relevance of the subset to the classification process.

Mahalanobis distance
The Mahalanobis distance is very helpful solution of determining the "similarity" of a set of values from an "unknown" sample to a set of values measured from a collection of "known" samples.Yongli [13] used the Mahalanobis of candidate feature subsets and selected the best quality subset to be used as input data.One of the main reasons the Mahalanobis distance method is used is that it is very sensitive to inter-variable changes in the training data.The Mahalanobis distance between two points x = (x 1 ,...,x p ) t and y = (y 1 ,...,y p ) t in the p-dimensional space R p is defined as:   x. of norm the is 0 , and

Mutual information
In order to perform feature selection with the filter approach, measures are needed to evaluate the relevance of the subset to the classification process.MI is a widely used measure to evaluate candidate feature subsets.Battiti [14] used MI on candidate feature subsets to select a quality subset to be used as input data for a neural network classifier.MI measures absolute dependencies between random variables and can be calculated as follows: Where H is an entropy function, Y is a class attribute, and X is the feature to select.Given a random variable X such that: A genetic algorithm (GA), introduced by John Holland in 1975 [15], is an adaptive optimization search algorithm to find an optimal solution inspired by natural selection in biological systems.The genes of an organism are gathered into structures called chromosomes, and a set of chromosomes is referred to as a population.In general, there are three operations employed in GAs.First, selection is an operator for selecting potentially useful solutions for recombination, and is achieved by either tournament or roulette wheel selection.Second, crossover refers to the process of producing an offspring chromosome from two matching parent chromosomes.Third, mutation causes genetic diversity of chromosomes by making random binary changes in a chromosome, thus adversely affecting their fitness value.These principles have led to new solutions in the pursuit of better search solutions.

Genetic algorithm
GAs have been successfully applied to feature selection [16] with the objective to save on computational time without processing in an exhaustive fashion, which is achieved by finding promising regions and selecting quality feature subsets.Furthermore, hybrid GAs [17] are involved in a new search method that includes local search operators to improve the fine-tuning quality of a simple GA search.
The fitness function, based on the principle of survival of the fittest, is the process whereby a GA evaluates each individual"s fitness and obtains the optimal solution after appl ing the genetic operators.This process is repeated man times and over many generations until the stopping criterion is satisfied.For feature selection, the feature subsets are represented as a binary; a feature is either included or not included in the feature subset.

The proposed algorithm
We now discuss our algorithm to select the best subset of size d of the total of D features.The inclusion step using MI as the criterion function (J) is executed to create a set of candidates for inclusion.In the exclusion step, a candidate feature subset is used to generate smaller subsets from the result of the inclusion step by removing one feature and reevaluating them.A selection subset of size k + 1 is generated and compared to the previously best subset of size k + 1 from the inclusion part.If evaluation of the new subset is more qualified than the formerly selected set, the exclusion step retains the better one and iterates to smaller subsets, or else the algorithm goes back to the inclusion step.Our feature improvement step based on GA is included after the exclusion step at each iteration.The chromosome structure consists of binary genes, corresponding to individual features.The value of 1 at the ith gene means that the ith feature is selected; otherwise it is 0.
The initial population is generated from the resulted subsets of size k + 1 from the exclusion step by first removing the weakest features from the best subset resulting in a subset of size k.Each remaining feature is thus added to that subset generating the niched initial population for GA.The fitness function used in this study is MI.Then, a new population is created by selection, crossover and mutation operations.The process is terminated when the current feature set reaches the size of D-2 features.We now provide an illustrative example of how the proposed algorithm works and how it improves SFFS.Assume that the first five feature sets selected by the SFS method at each size are {f1}, {f1, f4}, {f1, f4, f5}, {f1, f4, f5, f7} with the corresponding J values of 4.1, 6.2, 9.1 and 10.2, respectively, and the next iteration is to determine subsets with five features.

Step 1: Inclusion
A feature is added to the feature subset.The SFS method adds a feature to the subset up to a total of five: J (f1, f4, f5, f7, f6) = 13.Assume that feature f6 is chosen using the SFS method and J for the 5th features is 14.

Step 3.1: Crossover operation
Once a pair of chromosomes has been selected, crossover can take place to produce child chromosomes.A crossover point is randomly chosen from two randomly selected individuals (parents).This point occurs between two bits and divides each individual into left and right sections.Crossover then swaps the left (or the right) section of the two individuals thus (Figure 3): Suppose the crossover point randomly occurs after the sixth bit, then each new child receives one half of each parent's bits (Figure 4): This algorithm continues to select parental chromosomes to apply the crossover operation.Child chromosomes may have one bit more than the current size of k features subset.In this case, a random bit is automatically flipped to preserve the size of the chromosome (i.e.current feature set size).
The improvement step helps discover subsets not discoverable by the greedy nature of SFFS.From the above example, the SFFS algorithm is not able to produce this best 4-feature four subset because it cannot backtrack to the set {f5, f7, f6}, thus could not add feature f2 to subset {f5, f7, f6}.The example above demonstrates the advantage of our proposed algorithm.The algorithm replaces the weak feature (feature f1 in our example) in the feature set {f1, f5, f7, f6} with feature 2, which results in a new set of four features {f5, f7, f6, f2} which has a larger J value.Therefore, the search strategy of our proposed algorithm is more thorough than the SFFS algorithm, so it is more effective.

Step 4: Terminating condition
After each iteration, the selection/crossover per mutation cycle continues until all possible combinations of chromosomes in the population have been evaluated.The higher the fitness value, the higher the probability of that chromosome being selected for reproduction.This generational process is repeated until a pre-determined termination condition has been reached.We terminate the algorithm when the current feature set reaches d < D features, where D is the total number of features in the dataset).The pseudo-code is depicted in Figure 6.
A fitness function is commonly needed in GAs to evaluate a candidate chromosome of an individual to assess whether the latter should survive or not.At each iteration, calculation of the fitness function is processed repeatedly, which, because of its simplicity, is a fast process, although it still impacts performance.In our model, we use the Mahalanobis criterion as a fitness function.
Input: Y m is a feature set, m is a predefined number of selected features, J is a criterion function.P c is probability of crossover, P m is probability of mutation, Population is set of individuals, max_generation is the maximum number of generations, and Fitness is a function which determines quality of individuals.Output: The best solution in all generation.

Experimental evaluation
To evaluate the proposed feature selection algorithm, 20 standard datasets of various sizes and complexities from the UCI machine learning repository [18] are used in the experiments.These datasets have been frequently used as a benchmark to compare the performance of classification methods and consist of a mixture of numeric, real and categorical attributes.Details of the datasets are shown in Table 1.
Three classification modeling techniques are used in the experiments which consist of Classification and Regression Tree (CART), Support Vector Machine (SVM), and Naïve Bayes.Training and testing data is used as provided in the datasets.For those not providing separate testing data, a 5-fold cross validation is applied.To evaluate a feature subset, MI is applied as the criterion function.

The classifiers
CART is a well-known decision tree algorithm for supervised machine learning that is applied to both classification and regression problems.It was first introduced by Brieman et al. [19].A decision tree represents a series of decisions.The key components of the tree are a set of rules for splitting each node in the tree, and assigning a class outcome to each terminal node.
The Naïve Bayes algorithm is a statistical classifier for supervised learning [19], and is based on the principle of conditional probability.It can predict class membership probabilities, such as the probability that a given sample belongs to a particular class, and its performance has been shown to be excellent in some domains but poor on specific domains, e.g.those with correlated features.The classification system is based on Bayes 'rule under the assumption that the effect of an attribute on a given class is independent from the other attributes.This assumption is called the class conditional independence which makes computation simple.
SVMs, originally proposed by Cortes and Vapnik [20], have become important in many classification problems for a variety of reasons, such as their flexibility, computational efficiency, and capacity to handle high dimensional data.They are a recent method to extract information from a dataset.Classification is achieved by a linear or nonlinear separating surface in the input space of the dataset.SVMs have been applied to a number of applications, such as bioinformatics, face recognition, text categorization, handwritten digit recognition, and so forth.SVM is a binary classifier assigning a new data to a class by minimizing the probability of error

Performance of the proposed techniques using classifiers
We studied the effectiveness of the proposed feature selection using three different classification methods: CART, SVM and Naïve Bayes on 20 standard UCI datasets.The results in Table 1 show that, in 97.7 % of the cases, the proposed technique improved classification effectiveness and greatly reduced the number of features selected, thus increasing classification efficiency, for all of the classification methods.We actually achieved 100 % selection accuracy from four datasets with the proposed method.In a comparison of the classification methods, SVM yielded the highest classification accuracy in 65 % of the datasets while CART gave the highest accuracy in 35 % of the datasets.

Conclusion
Feature selection is critical to the performance of classification.We propose a feature selection algorithm that improves the performance of SFFS by incorporating a feature improvement step based on a genetic algorithm.This step helps discover important subsets that are not possible using SFFS alone.The algorithm employs mutual information as the feature subset evaluation function.The proposed technique was evaluated using 20 standard datasets from the UCI

Fig. 2 .
Fig. 2. The structure of a floating search algorithm.