Feature selection tree for automated machinery fault diagnosis

. Intelligent machinery fault diagnosis commonly utilises statistical features of sensor signals as the inputs for its machine learning algorithm. Due to the abundance of statistical features that can be extracted from raw signals and the accuracy of inserting all the available features into the machine learning algorithm for machinery fault classification, less accurate fault classification may inadvertently result due to overfitting issues. It is therefore only by selecting the most representative features that overfitting outcomes can be avoided and classification accuracy be improved. Currently, the genetic algorithm (GA) is regarded as the most commonly used and reliable feature selection tool for the improvement of accuracy for any machine learning algorithm. However, the greatest challenge for GA is that it may fall into a local optima and be computationally demanding. To overcome this limitation, a feature selection tree (FST) is here proposed. Numerous experimental dataset feature selections were executed using FST and GA; their performance is compared and discussed. Analysis showed that the proposed FST resulted in identical or superior optimal feature subsets when compared to the renowned GA method, but with a 20-time faster simulation period. The proposed FST is therefore more efficient in performing feature selection task than GA.


Introduction
Currently, machine learning is the most efficient tool in eliminating human intervention or assisting human supervision in machinery fault diagnosis. Pattern classification in machine learning is an analytical process used to categorize a dataset into a finite number of assigned divisions based on its characteristics [1]. However, the accuracy of the pattern classification model relies on the quality input features (attributes) and adopted learning algorithm (e.g., artificial neural network, support vector machine (SVM) and Bayesian network). The quantity of features included represents the number of search spaces or dimensions of hyperplanes [2]. The execution time of a learning algorithm and data sampling number required is directly proportional to the number of features necessary to achieve desirable classification accuracy. In other words, larger numbers of the input features require tedious computational and implementation costs, but does not guarantee better performance due to parameter overfitting [3].
Various feature selection techniques have been documented and proven to be capable of efficiently verifying the usefulness of a feature [4]. Forman argued that the emphasis that machine learning algorithms place on quality of features instead of quantity is two-fold: to yield a fulfilling breakthrough and to conserve effort and resources, both at the present time and in the future [5]. Khiabani et al. reported that feature selection significantly improved the performance of a machine learning algorithm, especially when dealing with a vast number of features [6]. Thus, feature selection plays an important role in the quantitative and qualitative measurement of inputs.
The feature selection method can be classified into the filter method, wrapper method, and embedded method [6][7][8]. The feasibility of a particular feature subset when subjected to the target function is described by the statistical characteristics of the filter method. For instance, statistical measures utilizing correlation [9], symmetrical uncertainty [10] and optimal relevancyredundancy relationships [11] have been implemented under the filter method division. A typical wrapper method applies a search mechanism in seeking the best feature combination based on the performance criterion of the classifier. The popular options here chosen are the heuristic, greedy (hill-climbing), stochastic and best-first options [12]. The embedded method integrates the feature selection task into the training process of the classifier [13] before progression to the validation phase.
Noticeably, the wrapper and embedded techniques share a similarity in terms of artificial intelligence involvement in the classifier accuracy comparison and training stages, respectively. In contrast, the filter method independently performs feature ranking of the scoring function in a straightforward manner, while the wrapper and embedded methods acquire feedback from the intended classifier as an indication of the representativeness of a feature subset. Clearly, the application of the performance assessment strategy in possible available subset solutions using an intended classifier is a concern in both the wrapper and embedded methods, especially when dealing with large data. The wrapper method remains an attractive feature selection option as it selects the best combination of features based on its performance with the desired classifier, although a trade-off between computational cost and prediction accuracy is non-trivial. Amongst the alternatives to the wrapper method, GA is regarded as a common and effective technique in this category.

Genetic Algorithm (GA) and Its Limitations
Since its invention in 1975, GAs have been documented to have enormous development and application in various fields, solving multi-dimensional optimization problems, ranging from grinding operations [14], nanofluid density estimation [15] and a video steganography model [16], to name but a few. Nonetheless, the following case studies explicitly show that GAs opting for large, bulky matrices [17] reduced the partial least square (PLS) model complexity by means of adequately tuned GA feature selection and feature elimination, resulting in better near-infrared (NIR) spectral resolution and interpretation, leading to end result enhancement. By comparing this to the grid algorithm, Huang and Wang obtained better SVM classification performance by simultaneously fine-tuning the feature subset option and the kernel's parameter using GA [18]. The measurement of the GA-SVM model is supported by Wang et al. regarding manipulating the generated electroencephalogram (EEG) signal as an input for a typical brain computer interface (BCI) system [19].
Asghari Oskoei and Hu integrated GA feature selection procedures with an artificial neural network (ANN) classifier [20]. The cascaded model was shown to be more advantageous in classifying myoelectric signals (MES) available in both time and frequency domains. With the aim of achieving input adaptability and model classifier simplicity in a biometrics hand system application, Luque et al. utilized a combination of GA and simple classifiers (kNN, LDA) [21]. A better classification performance was acquired by taking less than 6.25% of the total features available. Meanwhile, GA outperformed the classification process without feature selection, and also outperformed other feature selection algorithm options and a handwriting character recognition case study by utilizing a statistical extension index combination of the Fisher linear discriminant method and covariance matrices [22]. Oluleye et al. reported the involvement of GAs in exploring optimal input features for novel kNN-classification functions in image recognition. GAs outclassed WEKA software in terms of flexibility and feature size reduction, achieving better accuracy in both Flavia and Ionosphere datasets [23]. The review clearly suggested that GAs are amongst the most common, effective and versatile techniques for optimization purposes, regardless of the type or size of the data or classifier [24].
GA was built on the analogy of Darwin's fundamental genetic evolution theory, to identify an optimal solution for a targeted function with assistance from the duplication of probable answers in every iteration. Hence, in the case of feature selection, it focuses on discovering the finest binary vector resembling a chromosome with regards to the search space. The fitness is based on optimization criteria, which aim towards preferable values, either the global maxima or minima, located within the search boundary. The heuristic search of the abovementioned learning algorithm starts with the random initialization of a group of candidate solution generations called populations to eliminate bias exploration and possibly cover most of the boundary area [25]. Unless posteriori knowledge is available regarding the targeted data, one is able to reduce it to the high probable area. The candidate solutions are batches of encrypted chromosomes consisting of a unique combination of a binary string (ones and zeros). For each iteration, the fitness calculation process is conducted by involving only values associated with the chromosome value '1' and ignoring those assigned the value '0.' Every individual chromosome is subjected in advance to an independent evaluation function and is compared to other chromosomes during the fitness function in the following stage. Technically, GA-feature selection performs an evolution process by applying a parallel computational system (population evaluation) and intelligence strategy (optimization function) [26]. The computational time and effort are direct proportional to the size of the population group. Thus, a trade-off relationship between accuracy and computational effort (implementation of the algorithm, parameter setting, calculation time and outcome interpretation) is inevitable for the GA wrapper method non-spatial dataset measurement.

Data Collection
The bearing conditions dataset used in this study was downloaded from the Case Western Reserve University Bearing Data Center website specifically to represent ball bearings in healthy and faulty conditions (rolling element, inner raceway and outer raceway faults). The test rig consisted of a 2-horse power (HP) motor, a torque transducer and a dynamometer. The arrangement of the test rig was used to simulate different conditions of the bearing (Figure 1). The motor operated at approximately 1750 rpm with a 1-HP load. Vibration data were collected at a sampling rate of 12 kHz by accelerometers that attached to the bearing housing.  A total of 400 sets of time series vibrations were extracted from the raw continuous vibration signal collected from a 7-mils fault diameter with a 1-HP load. Then, the 400 sets of vibration data were divided into two sets of data, of which one set was used to establish the relationship between the input and output of the machine learning model (training phase), and the other set was used to validate the trained machine learning model (testing phase). The distribution of the vibration dataset employed in this study is tabulated in Table 1.

Fault feature extraction
In this section, time series vibration data from section 3 is subjected to statistical analyses with the purpose of acquiring statistical features. The obtained features, namely the skewness factor, kurtosis factor, crest factor, shape factor, impulse factor and margin factor, were converted from the corresponding equations in Table 2. Subsequently, the statistical features were used as features (inputs) for SVM model training and testing purposes. Each statistical feature shown has unique characteristics and carries informative data regarding system status. Since there was a total of 100 samples for each bearing condition, 50% of the samples were selected randomly as training data to synthesize the machine learning model, while the remaining 50% of the samples were used to validate the trained machine learning model.

Feature Selection Methods
This section discusses two different feature selection methods that were implemented. It covers the theory, characteristics, implementation flow and unique yet advantageous ways used in determining the optimal feature subset to feed into a model classifier, so that the targeted system condition prediction performance could be optimized. Figure 2 displays the process of feature selection initiated by GA, whereby the arrow shows the chromosome flow direction into intermediate stages.

Genetic Algorithm
Briefly, a chromosome commenced as a random candidate solution in a population group and were subjected to an evaluation function to determine whether it would be selected for or eliminated from the next round based on goodness of fit. Genetic operators acted on the chosen chromosomes by performing selection and reproduction processes for seizing a population batch in the next cycle. The cycle continued until one of the stopping criteria were met. The end result was that the most probable feature subset was generated. The implementation of GAs was realized using MATLAB software (version R2014b). A specialized toolbox was developed for GA fulfilment. The gaoptimset extension function was particularly useful because it allows the algorithm parameter setting to be enclosed under a single command. The investigated dataset basically extracted a data matrix describing roller bearing operating conditions. The data matrix was composed of six numerical features derived from the captured vibration signals: skewness factor, kurtosis factor, crest factor, shape factor, impulse factor and margin factor. The interest of the current study was to employ GAs to discover distinctive feature subsets by means of exploiting a standout characteristic to explain the roller bearing situation. The GA-based feature selection algorithm parameter setting is listed in Table 3, below.

The Proposed Feature Selection Tree
In this study, a novel feature selection method named the feature selection tree (FST) is proposed for carrying out the feature selection task. The FST employed the SVM as a wrapper in feature selection. The performance of each feature was based on the SVM training accuracy. The FST shortened execution time by avoiding repeated computations of the performance of identical feature combinations. The repeated assessment of the identical feature combinations usually occurred to the feature selection algorithms that randomly generated feature combinations. This significantly increased unnecessary computational time. Thus, the FST only evaluated unique combinations of features. In addition, the FST generated next level feature combinations based on the performance of the previous level. Fig. 5 illustrates the methodology of the FST algorithm. In the first level selection, the algorithm evaluates each individual feature. Then, the algorithm generates the second level feature combinations by combining unselected individual features with the features that performed above average (red-outlined rectangle in Figure 3). This process terminates when the feature combination has fully utilized all extracted features. Lastly, the algorithm selects feature combinations with the least number of features from the top 5% of the highest training accuracy (yellow-filled rectangle in Figure 3) as the most representative features of the entire dataset. In addition to selecting the most representative features for the dataset, feature selection also reduces the feature dimensionality for machine learning algorithms. As a result, skewness factor and shape factor (i.e., features 1 and 4) were selected in this example.  Fig. 3. The proposed feature selection algorithm (features 1, 2, 3, 4, 5 and 6 represent skewness factor, kurtosis factor, crest factor, shape factor, impulse factor and margin factor, respectively).

Results and Discussion
The following section presents, compares and discusses the performance of the aforementioned two types of feature selection method. The performance quality under consideration includes implementation practicability, execution time and accuracy.

Genetic Algorithm
As the MATLAB software GA toolbox is patterned to target global minimal points as the optimization goal, the SVM classifier confusion function was adopted as the optimization function. Confusion can occur when given a set of features, for instance, an SVM classifier unable to predict a roller bearing condition as defined. The GA is responsible for optimize the fitness function by searching for the minimal point of unwanted confusion rate. The lower the error, the higher the classification success rate.  Figure 4 illustrates GA-simulated fitness evaluation results for 30 generations using the roller bearing dataset. Two different coloured pointers were plotted over the generations. The mean fitness value portrayed in the blue colour pointer was reduced from 48% at initial simulation to a low point of 19% confusion as the simulated period increased. It converged at approximately 20% towards the end of the evaluation. The GA started with an unbiased, large search space and learned to concentrate into a high probability area and local optimal points. Hence, the average population fitness value decreased over time. It was observed that the mean fitness value experienced a minor peak after the eighth generation. This was due to the appearance of a new elite child, which caused a navigation of offspring for the new probable area during the transient period.
The best fitness value, represented by a black colour pointer, remained steady at 23.5% confusion rate until it dropped to 17% during the seventh generation. It then remained fixed for the rest of the simulation period. The simulation stopped at the 27th generation instead of the 30th since the best fitness value was static for 20 iterations. The GA stopping criteria were supported by the convergence of the pointer lines and a minor gap observed between both pointer lines, commencing at the 20th iteration. Based on the figure inspection, as the generations increased, the GA recruited fitter and fitter population groups but was not assisted in earning improved best confusion rates. Hence, with reference to the stopping criteria, the algorithm was satisfied with the feature subset obtained. A minor perturbation in the mean fitness value may have resulted from the recombination process while attempting to improve population diversity. The best chromosome included features 1 (skewness factor) and 4 (shape factor), providing a lowest confusion rate of 17%. The performance of the GA-SVM classifier was encouraging as the classification accuracy efficacy improved, with only 33% of the total features being preferred compared to a 24% confusion rate while utilizing all six features.
Typically, GA feature selection deals with a large dataset. The robustness of a GA against exponential increment in hypercube dimensions and data noise is well documented. In this case study, although GAs managed to identify an optimal feature subset, concerns were raised regarding algorithm implementation. The effectiveness of the algorithm largely relied on parameter tuning, particularly population size design. Overdesign of a population is capable of obtaining a fast outcome despite that severe computation cost, low unique rate and chromosome redundancy prompting relatively high average fitness variation. In contrast, an under-designed population ensures a high unique rate of chromosomes but was an unacceptable option, as it led to longer simulation periods for value convergence and, in the worst case, to a fall in the local optimal point. From the evaluation findings, the trade-off effect could be minimized with a parameter adjustment: a population size of 25 yielded amongst the highest number of unique initial chromosomes and a fast and accurate feature subset. Table 4 shows the training accuracy of the key combinations of features at each level. The yellow shaded feature combinations are feature combinations above the average training accuracy at each level, and the blue shaded training accuracy designates the top 5% training accuracy in the entire table. As a result, features 1 and 4 (skewness and shape factor) were selected to represent the entire bearing conditions dataset. The training accuracy in Table 4 indicated that entering all extracted features into the machine learning algorithm does not guarantee the highest classification accuracy, as the training accuracy for the selected features (i.e., features 1 and 4) was 81%, and the training accuracy for all extracted features was 74%. In contrast, the testing accuracy of the bearing faults dataset was 83% for the selected features and 76% for all extracted features. Therefore, a representative feature combination for the entire dataset was selected using the proposed FST algorithm. Table 4. Training accuracy for the key combination of features (feature 1, 2, 3, 4, 5 and 6 represent skewness factor, kurtosis factor, crest factor, shape factor, impulse factor and margin factor, respectively).  Table 5 compares the testing accuracy for features selected by the GA and FST algorithm. The analysis showed classification accuracy with the features selected by the two types of feature selection method is equivalent. Therefore, it was proven that the proposed FST is capable of selecting an equivalent feature subset in comparison to the GA. In the following section, the execution time of the GA and the FST algorithm will be compared. Execution time comparison analyses for both feature selection methods are tabulated in Table 6. Generally, the FST execution time was relatively faster than that of the GA. The comparison showed execution time of the FST is 13 times faster than GA. Thus, the proposed FST can be embedded into the machine learning training process to execute effective automated feature selection.

Conclusion
In this study, a novel feature selection method designated the feature selection tree (FST) was proposed. The performance of the proposed FST was compared with one of the most commonly used feature selection methods, namely that of the genetic algorithm (GA). Preliminary results showed encouraging feature selection performance, with a notable increase in model classification accuracy using the reduced feature matrix dataset. Evidently, desirable feature size selection led to minimization of overfitting complications, resulting in optimal, unambiguous model interpretation. Further analysis showed that the proposed FST was able to select an equivalent or better feature subset than GA to represent the entire dataset. Although the proposed FST selected identical features as the GA for bearing data, the execution time for the proposed FST was reduced by up to 92% compared to the GA. In summary, the state-ofthe-art FST is able to select an equivalent or better feature subset in a shorter execution time when compared to the GA. This is essential when dealing with a large number of inputs. Therefore, the favourable FST can be embedded into machine learning algorithms in order to improve the performance of these algorithms.