Data-driven modeling and optimization of an industrial phosphoric acid production unit

In this work, a supervised machine learning (ML) multi-output regression approach is investigated to build predictive models for an industrial unit of phosphoric acid production. More specifically, multi-output data-driven regression is applied to simultaneously estimate nine outputs (Reactor temperature, chemical yield (RC), P 2 O 5 concentration in the phosphoric acid, and chemical losses in gypsum) under different operating conditions. The presented methods are linear regression and decision tree regression models. The use of decision tree regression provides high accuracy compared to linear regression. The decision tree model leads to a high value of the coefficient of determination (R 2 = 0.994, on the testing set not used for the modeling), and to low values of the mean squared error (MSE) and mean absolute error (MAE). The best parameters of the decision tree provide higher fitness values than other depth levels. The optimal values in the training stage are 0.002, 0.007, and 0.994 for MSE, MAE, and R 2 , respectively. Applying decision tree regression can correctly model the data of the phosphoric acid manufacturing unit with satisfying fitness criterion and important conclusions on the process coherent with phenomenological models, as well as supplementary and novel insights.

that can rapidly predict the quality of phosphoric acid and gypsum, the temperature of the reaction, and the yield of the process. The objective is to produce the purest possible phosphoric acid, to minimize the P2O5 losses in gypsum, and to increase the yield of the unit [3,5]. The modeling of the industrial data, especially those from the attack-filtration section, is used to achieve these goals. It is worth noticing that the evaluation of the effects of the inputs of the attack-filtration section is complicated since it includes a wide range of chemical features. At this level, machine learning (ML) approaches can be useful for analyzing data and providing essential support tools for making good decisions. The predictive ML models are based on the knowledge extracted from a training data set, which typically consists of a set of continuous values, that can be used, once fitted, to predict output values for new input data. On the other hand, industrial units still need chemical surveillance and monitoring through phosphoric acid and gypsum analysis. Using ML processes, the autonomously collected phosphoric acid and gypsum quality data, by deployed sensors, for example, can be integrated and the chemical product status predicted and regressed, allowing stakeholders to act in conformity and prevent, e.g., low phosphoric acid concentration or act against the causes of its decrease in almost real-time.
The regression uses many parameters such as chemical composition of products, temperature, and the yield of the process. The process is sensitive to several parameters and an ML-assisted decision process would improve the decision on parameters that do not always show coherence. In this study, we adopted a supervised ML methodology and tested multi-output regression to build predictive models for the chemical composition of products (ACP, gypsum), temperature, and yield of the process.
To our knowledge, the estimation of multi-output regression using linear regression and decision trees has not been previously addressed in research works, and many of the insights of this model are not reported in the literature. The most important contributions of the paper are as follows:  It demonstrates the use of multi-output regression to estimate the chemical composition of phosphoric acid and gypsum and the yield of the process.  The approach used produces an accurate estimate in a short amount of time.  The model provide insight on the important parameters to control the performance of the process.
The remaining part of the paper is structured as follows: the dataset for the case study as well as the framework and implementation of the approach are presented in Section 2. The experimental design is in section 3. The numerical results with various parameter settings are shown in Section 4. Finally, the conclusion provides some insights and suggestions for future works.

Experimental data
The raw data was collected during the first half of 2021 through daily analyses at the OCP phosphoric acid production site of Jorf Lasfar in Morocco. A large database of 5325 instances for 43 features has been obtained. Such features characterize the chemical composition of the input phosphate rock, the phosphoric acid product, and the gypsum. In addition to the solid content, the density, and the particle size distribution of phosphate rock, the hourly measurements from the sensors located in the attackfiltration section of the production line are included. They consist of the flow rates of phosphate pulp, sulfuric acid, water, recycled phosphoric acid, produced phosphoric acid... etc. The dataset is split into two parts for the training and testing stages. 70% and 30% of the dataset are sampled randomly for the training and testing stages, respectively. Tables 1 to 3 show thirty-four features of industrial data, with their value ranges, and table 4 shows the nine targets to predict.

Modeling methodology
Predictive modeling aims to find good rules for predicting the values of some variables in a dataset (outputs) from the values of the other variables in the dataset (inputs). Multi-variate linear regression and multi-target regression tree models are the most frequently used predictive modeling methods [4]. The algorithms developed in these modeling techniques arise from methodological research in various disciplines, including statistics and machine learning. These two techniques are applied to analyze the data described in Section 2.1.

Multi-variate linear regression (MLR)
One of the most used methods for predictive modeling is regression analysis is the multi-variate linear regression (MLR). It is a statistical technique that predicts the result of an output variable, using several explanatory variables. The objective of MLR is to establish a linear relationship between the independent variables x and a dependent variable y that will be analyzed. The basic model of MLR is expressed as: The equation to determine the coefficients matrix is given as: The popularity of the regression model may be attributed to the interpretability of the model parameters and its ease of use.

Multi-target regression tree
Multi-target regression tree, which is an empirical decision tree modeling, is a segmentation of the data produced by the application of a set of straightforward binary rules. These models generate a set of rules that can be used for prediction through the repetitive splitting process [6]. Multi-target regression trees have two main advantages over building a separate regression tree for each target.
First, a single multi-target regression tree is usually much smaller than the total size of the individual single-target trees for all variables, and second, a multi-target regression tree better identifies the dependencies between the different target variables. Another advantage of regression trees model is the direct interpretability of its binary rules, allowing straightforward access to modeling insights about the process and its phenomenology. On other hand this kind of model may be limited in term of fitting performances (i.e., R 2 ), with respect to more complex regression models such as random forests and deep neural networks. However, in order to favor interpretability to model accuracy, we choose to adopt a multi-target regression tree model in this work.
In Multi-target regression trees, a regression model is fit to the target variable using each of the independent variables. The dataset is split into several values for each independent variable. At each split value, the performed algorithm calculates the error between the predicted values and the actual values according to the predefined fitness function. The split point errors across the variables are compared, and the variable yielding the lowest fitness function value is chosen as the split point. This process continues recursively.
A major advantage of the decision tree over other modeling techniques is that it produces a model which may represent interpretable rules or logic statements.

Model selection criterions
The evaluation of the error rate of the predictions from the regression model, helps to determine the best model. In this study, the most popular metrics are applied, i.e., MAE, MSE, and R 2 , they are defined as: with the number of points, the data points, and the predicted points. In our work, many model architectures were tested in term of tree depths and other numerical parameters. To choose the optimal parameters, the R 2 metric is used, using the gridsearch methodology described in the next subsection. The adopted model, along with the obtained results and error metrics are presented in section 4.

Grid search and cross-validation technique
Grid search is one of the methods used in machine learning to fine-tune the hyperparameters and identify the optimal ones for the model. This approach uses a brute-force search technique with all possible combinations of the parameter values of interest and tries to find the best combination. Crossvalidation is one of the data splitting methods used to evaluate the generalization performance of the model. In the cross-validation data splitting technique, the data is split in multiple folds and a model will be trained for each fold available. Thus, it allows the dataset to be fully utilized and used effectively for training and testing. Grid search and cross-validation techniques can be used together to find the best parameter for a ML model with stable and accurate predictions [7].

Experimental design
This section outlines the experimental methodology used for our empirical assessment. The process involves the steps displayed in Fig 2. First, the data acquired had to undergo a pre-processing phase. In this study, the min-max scaling is used as a feature scaling technique. It is used to scale the continuous variable for input features in the dataset to a range of 0 to 1. Feature scaling is crucial for data that will be used to train MLR and MRTs. After the data pre-processing phase, the data is used to train the MLR and MRTs, with a grid search cross-validation technique. Examples for the structure of the data are presented, to illustrate its complexity, in Fig. 3. After fitting to training data, the models are evaluated by different metrics to choose the best model.

Numerical results
This study uses anaconda software to build, train and evaluate the machine learning models for predicting the nine outputs of the attack-filtration process mentioned in table 4. The methods of MLR and MRTs are used. A cross-validation is applied in this paper for the training data to avoid overfitting. Cross-validation divides the training dataset into ten subsets. The validation procedure holds one of the ten subsets, and the rest of the subsets are trained using the fitness function. Experiments are carried out ten times for each kept subset in the training stage. Later we initialized the hyperparameters ("max_depth" : [5,15,20,30], "max_features" :["auto","log2","sqrt"]) using gridsearch to find the best parameters for our decision tree model.
The metrics of MLR model (    The importance of sulfuric acid flow rate is in coherence with previous phenomenological modeling works [2,3]. The particle size distribution is also an influencing parameter in the industrial plant at the dissolution step [8]. Moreover, the model showed clearly that the role of large particles is much important than fine particles (e.g., features 19 and 26). The large effect of K2O is an interesting and novel modeling insight that needs to be analyzed more at the phenomenological level, with models considering the chemical impurities [9]. The features that are less important are cadmium content: ''Cd ppm'' (feature 4) and fine particles rate: ''sup_80µm'' (feature 27). Decadmiation of the produced phosphoric acid is an important industrial concern for environmental and regulation constraints. Our analysis shows that Cd species are rather inert in the process for the manufacture of phosphoric acid by the wet process, with means that decadmiation would not have a significant effect (nor positive or negative) on the performances. This also may explain how this decadmiation by chemical means is a challenging task for the wet process. The low effect of fine particles clarifies the observation at the industrial plant on the effects of granulometry, in the sense that only the distribution at high particle size plays a significant role.

Conclusion
In this work, we have successfully carried out modeling of chemical process data using MRTs for an industrial unit of phosphoric acid manufacturing. The results obtained indicate that the MRTs model can predict the composition and losses of P2O5 in phosphoric acid and gypsum products, as well as the chemical yield and temperature of the industrial unit with a max depth equal to 20 for the decision tree. The MRT models show much better performances than linear regression. This justifies the use of such highly nonlinear and complex models.
From the analysis of the model, the main important features for controlling the process are flow rate of sulfuric acid, large particles distribution and the rock content of K2O. This is coherence with historical observations at the industrial level and with phenomenological models [2,3,5]. The effects of KO2 and Cd rock content, are interesting insights to be analyzed further at the thermodynamical level. The exploitation of these results and insights for the optimization of the process will be part of future work.