Optimizing the prediction accuracy of load-settlement behavior of single pile using a self-learning data mining approach

Pile foundations usually are used when the upper soil layers are soft clay and, hence, unable to support the structures’ loads. Piles are needed to carry these loads deep into the hard soil layer. Therefore, the safety and stability of pile-supported structures depends on the behavior of the piles. Additionally, an accurate prediction of the piles’ behavior is very important to ensure satisfactory performance of the structures. Although many methods in the literature estimate the settlement of the piles both theoretically and experimentally, methods for comprehensively predicting the load-settlement of piles are very limited. This study develops a new data mining approach called self-learning support vector machine (SL-SVM) to predict the load-settlement behavior of single piles. SL-SVM performance is investigated using 446 training data points and 53 test data points of cone penetration test (CPT) data obtained from the previous literature. The actual prediction accuracy is then compared to other prediction methods using three statistical measurements, including mean absolute error (MAE), coefficient of correlation (R), and root mean square error (RMSE). The obtained results show that SL-SVM achieves better accuracy than does LS-SVM and BPNN. This confirms the capability of the proposed data mining method to model the accurate load-settlement behavior of single piles through CPT data. The paper proposes beneficial insights for geotechnical engineers involved in estimating pile behavior.


Introduction
Pile foundations are usually used to transmit the axial load from upper structures to the hard soil layer. At times, a pile foundation can be more advantageous than a shallow foundation due to the cost-effectiveness of its construction [1][2][3]. One important aspect in the design of the pile foundation is the evaluation of its loadsettlement. Poulos and Davis [4] showed that the elastic settlement of the pile makes a major contribution to the total settlement. Especially in pile on sand, the elastic settlement is almost as much as the total settlement. Usually, the elastic settlement is analyzed using the semi-empirical method.
Although many methods in geotechnical engineering predict the pile's settlement, both theoretical and experimental methods of thoroughly predicting the loadsettlement of the pile are very limited. In the civil engineering world, data mining techniques have become an important research area. Several studies have shown the advantages of data mining technique in producing better prediction models than traditional methods [5,6]. Shahin [7] developed an artificial neural network (ANN) model to predict the load-settlement of a steel pile using a recurrent neural network (RNN). This RNN model had been calibrated using 23 in situ, full-scale load tests, as well as cone penetration test (CPT) data. Even though the RNN model from Shahin [7] showed good results, this model was derived from limited data, i.e. 23 fullscale load tests. In addition, the Shahin model is focused on steel driven piles and has only one input parameter to calculate the variation of the soil strength along the pile shaft, i.e. the mean value of cone resistance of CPT, qc.
Lately, the least squares support vector machine (LS-SVM) has become one of the most prominent data mining techniques used to solve a complex problem in the world [8,9]. Although LS-SVM has produced more accurate prediction results, an incorrect tuning parameter can reduce the accuracy of LS-SVM. The objective of this study is to improve the accuracy of the prediction model using parameter optimization. Identifying the most optimal parameters is an optimization problem. Therefore, the latest studies integrate a machine learning technique with a metaheuristic-based optimization tool instead of using only a machine learning technique [10][11][12][13]. This study introduces a new hybrid data mining model called the self-learning support vector machine (SL-SVM) to accurately predict the individual pile behavior in test records. Tests were conducted directly in the field and took into account various types of soil, several types of pile, and various geotechnical problems commonly encountered in the field. The hybrid approach used by SL-SVM combines techniques from SOS and LS-SVM. SOS is used to optimize the γ and σ parameters of LS-SVM; then LS-SVM creates an improved input-output relationship from a dataset by performing a supervised-learning-based predictor.
In this study, 499 test records were obtained from the previous literature. The proposed SL-SVM model can fully predict the load-settlement behavior of concrete, steel, and composite piles, as well as bored or driven piles. To accurately model the non-uniformity of the soil along the pile shaft, the length of the embedded pile is divided into 5 segments of equal length. In each segment, the mean value of qc and shaft friction of CPT (fs) are calculated.

Regression model: LS-SVM
LS-SVM was first developed by [8] as an improved version of the support vector machine (SVM). As a data mining technique, LS-SVM has been successfully applied in many civil engineering-related problems [14][15][16][17]. LS-SVM utilizes a cost function based on the least squares principle as opposed to the quadratic loss function that had been used in the original SVM [18]. The objective function and constraints for minimizing the cost function of LS-SVM are shown as follows: where γ is a regularization constant, ek denotes the error variable, and xk and yk are the input and output data points of the given training dataset of N data points.
For function estimation, the following equation expressed the LS-SVM model: where k and b represent the solutions to the linear system.
This study employed the radial basis function (RBF) kernel with the following formula: where σ denotes the kernel function parameter.

Optimization algorithm: SOS
Initially developed by Cheng and Prayogo [10], the SOS algorithm took its inspiration from the symbiotic interactions among a group of organisms. Its initial application was to solve continuous optimization problems [10] and it has been used to solve various problems in multiple disciplines [19][20][21][22][23][24][25][26][27]. SOS utilized nature-inspired operators -the mutualism phase, commensalism phase, and parasitism phase -to guide the organisms (solutions) to the global optima region (best solution).
In the "mutualism phase," each organism is modified as follows: where Oi and Oj denote the i-th and j-th organism vectors, respectively, such that i ≠ j; U(0,1) denotes the uniform random numbers between 0 and 1; Obest represents the best organism; and new_Oi and new_Oj are the generated candidate solutions after Oi and Oj perform the interaction.
In the "commensalism phase," each organism is modified as follows: where U(-1,1) denotes the uniform random numbers between -1 and 1.
In the "parasitism phase," each organism is modified as follows: where Opar denotes the parasite that attempts to eliminate the host Oj; ub and lb represent the lower and upper bounds of the given problem, respectively; and F and (1 -F) are the binary random matrix and its inverse, respectively.

SL-SVM system integration
In this study, two different forms of artificial intelligence (AI), which are SOS and LS-SVM, are combined to form a new hybrid data-mining technique called SL-SVM. The relationship between the input and output variables of a given set of data is accurately mapped out through the LS-SVM that has a key role as a predictor. The SOS is utilized to find the most suitable LS-SVM parameters γ and . The architecture of SL-SVM is shown in Fig. 1. Throughout these test phases and training, the six main steps of the SL-SVM are conducted and are as delineated below: (1) Dataset: The dataset is usually grouped into a test set and a training set. Furthermore, the datasets were scaled into a (0,1) range [28] to curb circumstances in which one or some of the input variables are dominant over others.
(2) Hyperparameters' initialization: Using the formula written below in the first iteration, the parameters are randomly initialized within the boundary range.
(3) Model selection: This is a critical step in building an accurate learning model. Utilizing the initial hyperparameters and the training set, the LS-SVM model is trained with a key focus on determining the true nature of the relationship between the input and output variables. The training process is conducted in an iterative manner and the tuning parameters from LS-SVM are gradually optimized by utilizing the SOS algorithm. A fitness function that correlates with the accuracy of the prediction model is now developed in the bid to evaluate the accuracy of the learning system. kfold cross validation, a well-known sampling technique, is incorporated in the fitness function. The dataset is now grouped into k-folds in which the (k -1)/k part of the given dataset is assigned to training and the remaining part is assigned to validating the trained model.
Thus, a k sets of training and validation subset are formed and carried out for model selection. For measuring the model accuracy, the root mean square error (RMSE) is selected as the fitness function, as shown in the following equation: where fit_val is a fitness value calculated from RMSE between the predicted output and actual output from the validation subset and S is the total number of folds.
(4) SOS for parameter search: To identify the best set of these hyperparameters, the hybrid AI system utilizes SOS to explore various simulations of γ and . Through the generation of the initial population, the search process commences. The initial population, however, serves as the initial candidate for the hyperparameters searched. SOS uses the parasitism, commensalism, and mutualism phases for each iteration to gradually bring about improvement in the fitness value of every candidate solution present in the population.
(5) Optimal hyperparameters: When the stopping criterion is met, the loop stops. This implies that the prediction model has identified the input-output mapping relationship with optimal γ and  parameters.
(6) LS-SVM predicting: To predict the test set, the prediction model must be established. Thus, the given training phase brought about the optimal LS-SVM γ and  parameters that were utilized to establish the prediction model.

Data preprocessing
Four fundamental parameters are used in many established methods to predict the load-settlement behavior of single pile. These main parameters are: the geometry of the pile, material properties of the pile, soil properties, and load applied to the pile. In addition to the main parameters are several extra parameters, such as: the pile installation method and load test type, as well as whether the pile tip is open or closed. The geometry of the pile, material properties of the pile, and load applied to the pile are easy to quantify and identify. However, soil properties are tricky to quantify and identify.  In this study, the dataset is derived from load tests which comprised 499 data points, obtained from Pooya Nejad and Jaksa [29]. In the literature, CPT is used to quantify and identify soil properties. The 499 data points are divided into 446 training data points and 53 test data points. To accurately model the non-uniformity of the soil along the pile shaft, the length of the embedded pile is divided into 5 segments of equal length. In each segment, the mean value of qc and fs are calculated. Finally, the attributes of the dataset are shown in Table 1 alongside the statistical description of the dataset.

Model selection and training results
This study implements the parameter setting of SOS as follows: ecosystem size = 50 and total iterations = 30.
The searching range for the tuning parameters, γ and    was between 10 −5 and 10 5 . To have a balance between training and validation data points, cross-validation was used. To have a splitting ratio of 2:1 between training and validation, 3-fold cross validation is used. SOS is then performed on the model selection using the 3 sets of training and validation data subsets. The fitness value was determined as the average validation errors in the model selection. The model performance in the training process is shown in Fig. 2. The optimal hyperparameters found by SOS were as follows: final  = 28.9507 and final   = 0.0547 with the fitness value of 10.5514 mm.

Prediction results
The accuracy of the training and test results between the predicted output (y') and actual output (y) of n data points can be compared using three metrics: correlation coefficient (R), root mean square error (RMSE), and mean absolute error (MAE). Each metric can be expressed as shown in Table 2. The developed SL-SVM was validated and compared to other predictive models, including the original LS-SVM and back-propagation neural network (BPNN). The comparison between SL-SVM and other predictive algorithms may indicate the advantages of using the optimization method to tune the optimal parameters. BPNN settings included: learning rate = 1, maximum hidden layers = 1, and number of neurons in the hidden layer = 21 (following the total input variables). Finally, the LS-SVM parameters for  and   were set to 1 as suggested in [8].
The experimental results between the proposed method and other prediction method are shown in Table  3. It is shown that the SL-SVM model outperformed LS-SVM and BPNN in all performance metrics. The SL-SVM produces the best value in R, RMSE, and MAE. Meanwhile, Fig. 3 further illustrates the actual and predicted settlement of the developed model in both the training and test datasets.

Conclusions
In this study, we propose an automatic-tuning data mining technique called the self-learning least squares support vector machine (SL-SVM) to predict the settlement of a single pile.