A hybrid k-means-GMM machine learning technique for turbomachinery condition monitoring

. Industrial practise typically applies pre-set original equipment manufacturers (OEMs) limits to turbomachinery online condition monitoring. However, aforementioned technique which considers sensor readings within range as normal state often get overlooked in the developments of degradation process. Thus, turbomachinery application in dire need of a responsive monitoring analysis in order to avoid machine breakdown before leading to a more disastrous event. A feasible machine learning algorithm consists of k-means and Gaussian Mixture Model (GMM) is proposed to observe the existence of signal trend or anomaly over machine active period. The aim of the unsupervised k-means is to determine the number of clusters, k according to the total trend detected from the processed dataset. Next, the designated k is input into the supervised GMM algorithm to initialize the number of components. Experiment results showed that the k-means-GMM model set up not only capable of statistically define machine state conditions, but also yield a time-dependent clustering image in reflecting degradation severity, as a mean to achieve predictive maintenance.


Introduction
Internet of things (IoT) latest development ensure remote information data access for the users. The ease of data availability eventually promotes data analytics in various ways, with the aim of enhance productivity and economic gain. Machine learning (ML) algorithms amongst the effective data-driven options due to its compatibility and reliability [1]. Nonetheless, feasible combination of data-input and algorithm is critical in order to deliver optimal accuracy and minimize bias reasoning. The effect is significant in particular with industrial-related operations [2]. Undesirable data misinterpretation leading to neglect fault development or trigger false alarm will disrupt the already hectic production schedule, causing casualty for worse.
Turbomachinery is an energy transmission system involving fluid and rotor. The fluid dynamic mechanism typically occupies turbine or compressor in heavy industry usage, such as power generation, transportation, petrochemical mining [20]. In view of the non-trivial machine complexity, operating condition tracking is the utmost priority at the expense of cost and safety. The present study observes the degradation process of turbomachinery which often missed by current OEMs standard practise. The insensitivity towards degradation trend are likely due to static OEMs limits recognise receiving sensor reading within allowable range as normal operating condition, regardless of time deviation. As such, predictive maintenance is almost impossible in the absence of time-varying parameter monitoring until major machine failure occurs. Evidently, implementation of turbomachinery ML is tedious not only because existing OEMs priori knowledge is limited, but also due to complicated black-box system behaviour. Therefore, a hybrid ML algorithm comprises of unsupervised kmeans and supervised GMM technique is proposed in the interest of time-dependant discrepancy analysis.
The introduction of combined algorithm is motivated by various factors. From the perspective of black box testing, it is unclear whether there are how many machine state conditions (off, active operation, abnormal operation) within subjected raw sensor readings. Meanwhile, as standard unsupervised learning alone is required to trade-off training data for cross-validation, supervised classification demands predetermined class labels. Inevitably, either dataset dimensionality reduction or premature labelling would contribute to ineffective or bias classifier outcome. Obviously, enabling automated unlabelled data mapping conversion requires interdependent, multi-layer pattern recognition [4]. For example, hierarchical clustering (HC) and selforganizing map neural network (SOMNNs) exhibited newly clustered operation fingerprint to assist empirical gas turbine fault detection and diagnosis (FDD) [19]. Similarly, alternative multi-stage ML algorithm is developed for predicting turbomachinery fault. Following Bayesian interval hypothesis noise reduction, filtered wavelet decomposition input is assigned to dynamic stochastic neural network classification [21].

k-means and Gaussian Mixture Model: Functions and Objectives
The sequence of combined algorithm is explained as follows. Firstly, unsupervised learning performs clustering on a given dataset with unknown class label framework. General clustering involves developing segmentation-based function by means of data statistical distribution, depth, distance, density and spectral [3]. Next, supervised learning train a reasoning function by utilizing class label (supervisory signal) available from previous stage. The supervised function will be accountable for new input data mapping validation, via probabilistic or heuristic mechanism. For turbomachinery application precisely, the k-means-GMM algorithm will identify the number of state condition and determine abnormal operation with mathematical expression. The number of mean observed in time-series data determine the machine state condition available while anomaly is measure by the distance from mean value. Next, GMM illustrates data location in 2-D graph by batches. Similar data behaviours will be grouped together, and degraded data population will separate from majority normal condition on the contrary. Raw dataset will be subject to initial statistical preprocessing before deploying as input for machine learning algorithm.

k-means unsupervised learning
k-means is a clustering method setup on the basis of distance metric. The label assignment for targeted instance is decide by the closest mean. The arbitrary k is a priori positive integer typically set to define the number of centroid, corresponding to the total turbomachinery state conditions. Other than crossvalidation, the fitness of k-means generated cluster can be assess by silhouette analysis. Silhouette plot indicate the segregation between nonparametric cluster area with coefficient range between [- 1 1]. The silhouette value is derive from Equation (1), where a i denoted as the average distance between ith data and peers from same cluster while bi is the minimum distance between ith data and adjacent cluster dataset, respectively. For every data instance, positive silhouette coefficient reflects closer range to assigned cluster as compare to adjacent ones, vice versa [5]. On the bigger picture, overall silhouette coefficient yields the confidence level of designated cluster setting and group cohesiveness. As such, implementation of silhouette plot is to identify unknown k which equivalent to optimal cluster proportion.
In recent research development, sentiment analysis achieved substantial improvement by avoiding domaindependency with nonparametric k-means clustering initialization [6]. An improved Manhattan Freqeuncy kmeans technique is proposed by utilizing modality frequency of features to overcome the partitional clustering application deficiency [7]. Also, k-means is selected to provide a probable predictors clustering framework as covariance matrix input in a novel fused clustered least squares (FCLS) method [8]. On the other hand, k-means is implemented as an alternative to spectroscopy in classifying high-resolution near-infrared stellar spectra dataset [9]. Meanwhile, k-means is offered as a quick option in consulting patient on whether should go for conservative or operative therapy using 3-D curvature analyses [10]. It is profound to acknowledge that k-means proven to be accurate and flexible in various field of study, ranging from engineering, astronomy, speech recognition to medical.

Gaussian mixture model (GMM) supervised learning
Gaussian mixture model (GMM) is describe as a supervised technique to present multivariate probability distribution area [11]. The parametric graphical model consists of summation of weighted Gaussian components densities estimated from iterative Expectation-Maximization (EM) optimization: where component density for k labels/components is the function of corresponding mean vector, (3) Noted that i 6 can be either full rank (rank( i 6 ) = min(M, k)) or diagonal, and the total weighted density is limited to 1, Lately, the EM algorithm combined hierarchical clustering and GMM to develop a computational effective initial estimation for bulky Monte Carlo dataset [12]. A forward stage-wise additive modelling is embedded in the boosted conditional GMM to overcome uncertainty in novelty detection due to statistical dependency of random variables [13]. By applying various phoneme classes as input features, GMM embedded posteriorgram design outperformed standard practise by introducing implicit constraints into probability assignment initialization [14]. To allow face detection under dim setting, GMM is employed in determining related variance based Haar-like features for skin tone segmentation [15].
A revised GMM is performed by taking into account the input parameter as a function of Gaussian [16]. The benefit is threefold for atomic models and electron microscopy 3D density map application: identify similar atomic radius as to input, eliminate singularity issue altogether with least computation time. In a 3D data compression proposal, GMM is engaged to substitute 3D planar model with normal distribution function [17]. Last but not least, a stepwise conditional transformation is introduced to GMM when subject to deterministic geostatistics trend to ensure low noise non-stationary numerical model [18].
Thanks to mean and covariance optimization of each component, GMM algorithm demonstrate adaptive classification by fitting most, if not all data points into corresponding k component. Compared to other supervised methods, GMM allows more than one class label assigned to instance data when components overlapped. Contrary to forcing to pick up one class label only, mixed membership is useful when involving uncertainties. Under such circumstances, false alarm is less likely to be trigger until distinctive degradation traits appears, i.e. two isolated GMM components. Provided with appropriate k component equivalent to number of class label, GMM illustrates both baseline and updated data batch in different component setting. Theoretically, to determine the occurrence of machine degradation using GMM, hypothesis is established as below: 1.Machine condition degraded if baseline and updated batch component separates from each other; 2.Machine condition is acceptable if baseline and updated batch component overlaps.

k-means-GMM algorithm setup
Several steps are required for implementing datadriven turbomachinery condition monitoring using kmeans-GMM. Firstly, dataset cross-correlation is apply to find out the parameter relevancy with respect to turbomachinery efficiency equation E3D2 (4). Next, kalman filtering (KF) estimation technique is adopted to detect any time varying variables. The time dependent variable trajectory is estimate by assigning E3D2 and the relative high correlated parameter subset as output and input respectively, via iterative simulation (5 -10). Then, the estimated parameter subset will undergo two-stage kmeans unsupervised learning. The objective is to filter machine off-active data operation (first stage) before focusing solely on active dataset clustering. Identified kmeans represents number of trends is supply as component k during supervised GMM initialization. Last but not least, 2-D GMM is chosen to perform visual inspection on development of degradation. The degradation severity is determined by comparing the resemblance of baseline and newly updated batch data. The algorithm layout is tabulated in Figure 1.

E3D2 simulation modelling
At time instance k, assume the E3D2 output is random walk, the variation of given input signal subset  Apply sampling data from instance k = 1 until k-1 into one step ahead prediction | 1k k T lead to one step ahead prediction error The estimation error is updated into covariance matrix, k P and correction gain, k L : whereˆk k T is computed to yield least mean square error in the estimate. The revised covariance performs as prior knowledge in readjusting estimated parameter for next iteration. LM 2500 axial compressor cross sectional view governed by OEM GE performance trending manual GEK 92738. GEK 92738 monitors essential parameters measurement including inlet and discharge temperature ( o C), PCD and atmospheric pressure (Bar), vibration (mm/s), blade rotational speed (RPM), fuel flow rate (kg/hr). The aforementioned measurements are recorded in time series with uniform half an hour interval. Later, the sequential time-discrete parameter event is being evaluated by thermodynamic referenced range limit and E3D2.
Present case study involving yearly dataset (October 2016 -November 2017) with machine breakdown is reported at the end of measurement. During manual inspection, the compressor blade has been found severely chipped off, despite the active parameter measurement is in accordance of preliminary range limit assessment. In the interest of identifying the availability of machine state condition and the changes in degradation process, the similar dataset is subject to kmeans-GMM algorithm, as per Section 2. The performance of proposed algorithm will be discussed in detail in the subsequent section.  statistical analysis suggest it is probable to apply Temperature T3 and PCD Pressure into k-means-GMM algorithm since the mentioned subset input associated with E3D2 calculation directly. Even though Temperature T2 and Pressure P2 contributes to E3D2, both indicators are excluded due to relatively low correlation coefficient.

Results and Discussion
The selected parameter T3 and PCD is represent by extracted KF estimated parameter vector, as tabulated in section 2. The KF estimated coefficient could act as ratio in contributing to E3D2 output over active time, since the KF simulated output and actual output in Figure 9 is identical. Moreover, the hindsight of trend availability has strengthened when KF coefficient and E3D2 are tabulated with 3-D scatter diagram (Figure 10). By separating active dataset according to batches, the Euclidean distance of scatter points with respect to origin are found to be altered from left to right direction over period. In other meaning, the proportion of parameter T3 for equation E3D2 increases gradually at the expense of Pressure PCD.
Subsequently, k-means-GMM will be responsible for quantifying KF coefficient trends by means of distance metric. As mentioned earlier, the Euclidean norm generated from selected parameter subset and E3D2 output is assigned to determine the k-cluster mean values. By setting arbitrary k = 2 during first stage clustering, the original KF parameter subset is split into active-off group in accordance with the nearest mean values (Figure 11). The k-means filtering model is proven accurate since the related silhouette plot generated values relatively close to optimal 1 ( Figure  12).
Next, a total amount of 7638 preliminary filtered active Euclidean norm data points from cluster number two are selected for abnormality detection purpose ( Figure 13). As a result, k-means unsupervised learning based on similar arbitrary k = 2 setup identified mean values for active and abnormal condition respectively. Although the mean values are relatively close together, with value 0.001 apart, silhouette plot able to illustrates positive values only ( Figure 14). In other words, second stage k-means clustering model adequately simulate satisfactory classification accuracy under demanding constrained distance. Additionally, particular instance with Euclidean norm value over 0.056 during active operation period is labelled as abnormal activity. Considering unsupervised k-means satisfied the previous three machine state conditions hypothesis, it is realistic to implement GMM machine degradation monitoring.
To monitor machine degradation process, supervised GMM learning cluster active parameter subset on monthly basis. The suggested data segregation allows consistent time period comparison and sufficient active data point aggregate for decision making. The baseline model is the resultant component (cluster area) of first active month. The main target of 2-D GMM graphical modelling is to visualize the degradation symptoms due

Conclusion
The present case study investigated the feasibility of implementing a hybrid machine learning algorithm to monitor turbomachinery state condition, as opposed to OEMs limit. The proposed algorithm comprises of mutually informed unsupervised k-means and supervised GMM. The aforementioned algorithm is designed as to input two-stage k-means validated number of trends, k as initialized GMM components. Based on given dataset, three machine state (off, active, abnormal) characteristic is identified in conjunction with satisfactory degradation tracking visualization. Noticed the number of trend is equivalent to the amount of average value discovered over measurement period. Alternatively, the machine degradation process is performed by timely GMM k components comparison.
The benefit of practising k-means-GMM algorithm in turbomachinery condition monitoring is multifold. By employing two out of nine OEMs suggested parameters dataset into data-driven k-means-GMM, it is probable to develop unique turbomachinery classification mechanism while avoid bias machine condition labelling. Ultimately, k-means-GMM degradation tracking offers visual aid to assist predictive maintenance decision making.