A comparison study between different kernel functions in the least square support vector regression model for penicillin fermentation process

. Soft sensors are becoming increasingly important in our world today as tools for inferring difficult-to-measure process variables to achieve good operational performance and economic benefits. Recent advancement in machine learning provides an opportunity to integrate machine learning models for soft sensing applications, such as Least Square Support Vector Regression (LSSVR) which copes well with nonlinear process data. However, the LSSVR model usually uses the radial basis function (RBF) kernel function for prediction, which has demonstrated its usefulness in numerous applications. Thus, this study extends the use of non-conventional kernel functions in the LSSVR model with a comparative study against widely used partial least square (PLS) and principal component regression (PCR) models, measured with root mean square error (RMSE), mean absolute error (MAE) and error of approximation (E a ) as the performance benchmark. Based on the empirical result from the case study of the penicillin fermentation process, the E a of the multiquadric kernel (MQ) is lowered by 63.44% as compared to the RBF kernel for the prediction of penicillin concentration. Hence, the MQ kernel LSSVR has outperformed the RBF kernel LSSVR. The study serves as empirical evidence of LSSVR performance as a machine learning model in soft sensing applications and as reference material for further development of non-conventional kernels in LSSVR-based models because many other functions can be used as well in the hope to increase the prediction accuracy.


Introduction
Industrial processes employ various hardware sensors such as flow rate, pressure and temperature for data delivery to control systems. This is for process monitoring and control to guarantee consistency in product quality. However, some crucial process variables are not easily measurable and require expensive sensors or offline laboratory testing with a significant time delay to ascertain product quality [1]. This coupled with drawbacks of hardware sensors such as sensor faults, low sampling frequency and maintenance requirements [2], as a result, many researchers attempt to construct a predictive model based on easily measured variables to estimate the difficult-to-measure variables, such predictive models are termed as soft sensors [3]. The potential of automatic control using soft sensors and the economic advantages in doing so leads to great interest in soft sensors from both academia and industry. However, there are still issues unresolved in soft sensors development with measurement noises, co-linear features, missing values, varying sampling rates and data outliers being the common issues [4]. Another issue is the dynamic nature of process plants, the mixture of gradual and abrupt changes in the process poses difficulty for soft sensors and often results in the degradation of prediction accuracy [5]. To mitigate these issues, traditional sensors usually tackle these problems with the linearised model, overdesigning equipment and avoiding complex operating regimes, which these methods usually result in a loss of economic advantages [1].
Vapnik's Support Vector Machine (SVM), first designed for classification [6] and subsequently expanded to regression, is known as Support Vector Regression (SVR), and it has demonstrated considerable potential in applications to high dimensional nonlinear problems. This is because of its excellent generalisation capabilities over PLS-based methods with great performance under limited training data samples [7]. SVR, however, requires complex quadratic programming optimisation, leading to reduced modelling efficiency. Work from [8] incorporated the least-square method with SVR, later termed Least Square Support Vector Regression (LSSVR) has greatly improved modelling efficiency as the complex optimisation is replaced by solving series of linear equations, making LSSVR an ideal algorithm for large scale regression problem while maintaining the great performance of SVR when data are limited. The core idea behind LSSVR is to translate nonlinear samples from lowdimensional space into high-dimensional feature space by employing the kernel function, allowing the nonlinear samples to be partitioned linearly in this space to fulfil the fitting prediction requirement [9].
Due to the lower computational complexity of the optimisation process in LSSVR as compared to the SVR model, therefore, it has been broadly used in the prediction of quality variables in nonlinear processes. For instance, [9] developed an LSSVR model with a Gaussian kernel function that can accurately assess some anomalous observations that may take place in the estimated value and forecast GPS (Global Positioning System) signals with improved precision. Besides that, [10] managed to design a smart LSSVR model by adopting the Gaussian radial basis function (RBF) kernel function to address batch operations' time-varying, multiphase and nonlinear features. Nevertheless, all these studies did not compare the performance of different kernel functions in the LSSVR model in which kernel function is the premise of constructing an LSSVR model. Additionally, different kernel functions may result in varied accuracy levels for LSSVR models. The kernel function type must be carefully chosen in order to construct a highly accurate LSSVR model [11]. Despite that, the selection of kernel function was solely based on the researchers' prior knowledge and experience.
To date, extensive searches in many academic databases for comparison of LSSVR performance with different kernels show limited results. This work seeks to fill this research gap by expanding the use of kernel approaches in LSSVR for the prediction of process variables in the fermentation of penicillin. To evaluate the effectiveness of kernels in LSSVR, the predictive performance of the model with various kernels is assessed and compared with that of other well-established techniques, such as principal component regression (PCR) and partial least square regression (PLSR). By comparing the performance of different kernel functions in an LSSVR model, it is possible to identify the kernel function that yields the best performance for a given dataset. This can help machine learning practitioners to choose the most appropriate kernel function for their specific application, leading to better overall performance of the model.

Methodology
In this section, the kernel functions, LSSVR, PLSR, PCR, data splitting and parameter tuning as well as specifications of computer configuration are described.

Kernel functions
In the kernel functions, the input and output variables are denoted by x and y. Additionally, b is the kernel parameter and is the standard deviation (sigma). There are 12 different kernel functions used in this study and the equations are given in Table 1. Table 1. The general form of kernel functions applied in this study [3].

Corresponding kernel
Kernel function

Least square support vector regression
The LSSVR algorithm was developed following [11] and it is shown as follows: Step 1: Considering the SVM model of the form given in Equation (1).
where ( ) represents the high-dimensional and infinite-dimensional feature space mapping term, B stands for the bias term, and stands for the weight vector with m dimension.
Step 2: Formulate the cost function in Equation (2), which forms the optimisation problem in primal space.
where and stand for the training error for and the regularisation constant, respectively.
Step 3: The optimisation problem in Equation (2) is not solvable when is potentially infinitedimensional. To solve the problem, the Lagrangian function as expressed in Equation (3) is calculated in which the solutions of w and e are determined by using the Lagrange multiplier optimal programming method. The objective function can be attained by converting the constraint problem into a nonconstraint problem [12].
where represents the Lagrange multiplier.
Step 6: The kernel trick based on Mercer's condition is applied to Ω as shown in Equation (6): where ( , ) denotes the kernel function.
Step 7: The LSSVR model is then presented in Equation (7):

Partial least squares regression
PLSR is a multivariate regression technique that is suitable for analysing changes in a large number of highly correlated input variables X and connecting them to a set of output variables Y. PLSR deals with the relationship between X and Y both internally and externally. X is decomposed as follows [13]: where the input matrix is represented by X ∈ R m × n , the score matrix is represented by T ∈ R n × a , the loading matrix is represented by P ∈ R m × a , and the noise matrix can be obtained by substituting TP T with the sum of the product of the score vector tj (the j th row of T) and the loading vector pj (the j th row of P). Similarly, Y can be broken down into: where the output matrix is denoted by Y ∈ R n × m , U ∈ R n × a stands for the score matrix, the loading matrix is Q ∈ R m × a , and the noise matrix is F ∈ R n × m . In the same way, Y = ∑ + =1 can be obtained by replacing UQ T with the sum of the product of uj and qj. Assume that ũj = bjtj, where bj stands for the regression coefficient, U = TB, and the regression matrix is represented by B ∈ R a × a . The equation Y = TBQ T + F can be used to represent the relationship between X and Y.

Principal component regression
The goal of PCR is to identify a subset of all components in order to reduce the number of dimensions by decreasing the effective size of the original space. The derived model structure is [14]: , xn] T are the measuring matrices built from the input and output variables, respectively. The sizes of the input and output variables are represented by m and r, while n represents the size of the data samples. T n × q represents the principal component matrix where q stands for the selected size of latent variables, P m × q represents the loading matrix and the regression matrix is represented by C r × q .

Data splitting and parameters tuning
From 1,500 data generated, the training and testing data are divided in a ratio of 75%:25%. [15]. Hence, Nt is 1,500 whereas the number of training and testing data are 1,125 and 375, respectively, which are represented by N1 and N2. The LSSVR model regularisation parameter, γ and kernel parameter, b are tuned using Leave-one-out (LOO) procedure. The appropriate values for the upper and lower boundaries are set to the following values in the grid region: γ ∈ {2 −5 , 2 −3 , …, 2 15 } and b ∈ {2 −15 , 2 −13 , …, 2 8 } [16]. Searching for the best combination of parameters is crucial in this stage so that the model can accurately make a prediction [17].

Prediction accuracy measurement
For the measurement of quality prediction, the RMSE and MAE error metrics are used as measurement metrics of error. These two metrics are found in the evaluation of performance for both the machinelearning model [18] and the non-machine learning-based model [19]. Given that the model uses two error metrics, and that the metrics are often not ranked equally, the Ea helps determine the best possible model. A lower Ea suggests that more models would be more closely related to the dataset's real nature. [20]. The formula of RMSE, MAE and Ea are demonstrated in Equations (11) to (13) ( [3], [18], [21]).

Specifications of computer configuration
The computer setup required to execute the LSSVR, PLS and PCR models in this study is described in Table 2.

Results and discussions
The industrial-scale penicillin fermentation simulation (IndPenSim), whose data are publicly accessible for download at www.industrialpenicillinsimulation.com and perfectly suitable for the development of data analytics, machine learning, or artificial intelligence algorithms, is chosen as the case study. [22]. A detailed description of the fermentation process for penicillin production can be found in [23]. The input variables and one output variable employed in this case study are shown in Table 3. The results of the 12 kernels investigated as well as PLSR and PCR models for this case study are tabulated in Table  4. The errors are ranked in ascending order of Ea as it indicates the performance of the model. Of all the kernels investigated, the MQ kernel function gives the lowest RMSE and MAE for both training and testing datasets as well as the lowest Ea values as bolded in Table 4 (RMSE1 = 1.14×10 -5 , RMSE2 = 0.0293, MAE1 = 2.24×10 -6 , MAE2 = 0.0266 and Ea = 0.0366) when the kernel parameter, b is tuned at 2 −3 . It demonstrates that MQ has the highest ability to forecast the nonlinear penicillin fermentation process as compared to other models. Followed by the SIG kernel with RMSE1 = 0.0661, RMSE2 = 0.0684, MAE1 = 0.0559, MAE2 = 0.0579 and Ea = 0.0690 with b tuned at 2 −15 . By comparing the results of MQ and SIG with RBF kernel in terms of Ea, it is revealed that MQ and SIG have improved the results by 63.44% and 31.07%, respectively.
It is interesting to note that the LIN, LAPLACE, RQ, INMQ, GTS, ANOVA and RBF kernels are found to give similar results for all the error metrics which are RMSE1 = 0.0433, RMSE2 = 0.0887, MAE1 = 0.0366, MAE2 = 0.0822 and Ea = 0.1001. The reason most likely has to do with the size of the feature space after kernel transformation in which the transformation is virtually identical [24]. Besides, the similar performance of LIN and RBF kernels shows that the LIN kernel function which is also a popular kernel used by many researchers due to its advantage of small computational requirements ( [25], [26]) is comparable with the RBF kernel. Based on the aforementioned, it is also noted that many kernel functions have outperformed or shown similar performances as compared to the commonly used RBF kernel, which reinforces the need for research in LSSVR kernels to allow an appropriate selection of kernels.
Generally, as presented in Table 4, the LSSVR models with the appropriate kernels have also outperformed the conventional PLSR and PCR models. However, it is not always true since it can be seen that the LSSVR models equipped with POLY, FOURIER and CAUCHY kernels are found to perform worse than PLSR and PCR with the values of Ea of 0.5610, 1.3887 and 1.5120, respectively. This may be due to the parameters used in this study that minimises the model error [27]. Although it can be utilised in the analysis of multidimensional spaces, the CAUCHY kernel is discovered to be the worst kernel in this application with the highest value of Ea = 1.5120 and may not be effective for the prediction of nonlinear processes.
In this case study, PLSR displays a better result than PCR with a prediction accuracy of 2.82% improvement probably due to the adverse effect of systematic variation in the input variable matrix which is not related to the output variable matrix. In other words, there exists systemic variation that is not part of the joint correlation structure of the input and output variable matrix, and it negatively affects the model performance [28].
To elucidate this matter, the plots of actual and predicted penicillin concentration (output variable) from the LSSVR model with selected kernel functions for both training and testing data are presented in Figures 1 and 2. In these two figures, only the selected kernel functions i.e., RBF, MQ, CAU, POLY, FOU and SIG together with PCR as well as PLSR models are plotted since some kernel functions exhibited the same results as each other as evident in Table 4. From the indication in Figure 1, it is worth noting that most LSSVR models with different kernel functions (except for the POLY kernel) can better predict the penicillin concentration for the training dataset as compared to PLSR and PCR models. This is further justified by the small values of RMSE in Table 4. However, the POLY kernel has shown worse performance in the prediction of output variables for both training and testing datasets as can be observed in Figures 1 and 2 although the Ea for the POLY kernel is not the lowest among all the kernel functions studied. This may be due to the POLY kernel dealing best with the normalised [24].  For the testing dataset in Figure 2, MQ has the closest output prediction as compared to RBF and this has strengthened the earlier findings from Table 4 that MQ has outperformed the conventional RBF kernel. POLY kernel once again has shown poor prediction with no particular trend observed and underpredicts the output variable. As for the FOU kernel, the prediction seems fine within the first 35 samples but worsens thereafter. Henceforth, it indicates that these two kernels are not suitable for the prediction of penicillin concentration. Based on this case study, different kernel functions may perform better as evidenced in this penicillin fermentation process even though RBF is frequently used in SVMbased models. This resonates with research suggesting that while RBF kernels are usually applied for data transformation in kernel mapping, there is a certain dataset where RBF is outperformed by other types of kernels [29].

Conclusion
The research has extended the LSSVR model with non-conventional kernels beyond typical RBF, linear and polynomial kernels. Comparative studies of the models are tested with one industrially validated penicillin fermentation simulation. Comparison of the LSSVR model against the PLSR and PCR model is also performed as a benchmark of machine learning-based soft sensors against conventional soft sensor models.
A comparison of the LSSVR model with the different kernels is performed, and it is found that whilst MQ and SIG kernels have the potential to outperform the conventional and popular kernels such as LIN, POLY and RBF kernels, but the model performance metrics used in this study which are RMSE and MAE generally have high variance and non-convex in nature. This suggests that while it is possible for MQ and SIG kernels to surpass the widely used kernels, however, it would require careful development of a hyperparameter optimisation algorithm as well. Since the main idea of the kernel function is to optimise the objective function in the LSSVR model, therefore, kernel function with small error is crucial in helping to minimise the cost function. However, the widely used RBF kernel still performs properly although other kernels such as MQ and SIG may perform even better in this case study.
In conclusion, the LSSVR model with the appropriate kernel can outperform conventional PLS and PCR models as well as the LSSVR model with RBF kernel function, with MQ and SIG kernel potentially being superior under certain circumstances based on the case study conducted. The result of LSSVR outperforming the PLS and PCR model may require more case studies in other industrial processes to validate that LSSVR performance generalises well to industrial processes. Besides, the study of the optimal amount of training and testing samples in ensuring the optimum prediction accuracy can also be investigated in choosing a suitable kernel function.