Study of quantitative structure-property relationship for density of ionic liquids based on Monte Carlo optimization

. Ionic liquids (ILs) have attracted increasing interests and applications due to its unique physiochemical properties. Density is a vital physical property of ILs. In this work, a comprehensive collection of density data is conducted on 184 variable ILs. The study of quantitative structure-property relationship (QSPR) is carried out for the selected density data of ILs using simplified molecular input line entry specification (SMILES) as the representation of the molecular structure of ILs by means of CORAL software. QSPR relationships were constructed with the balance of correlations (BC) and the classic scheme. Results from three random splits displayed desirable models for predicting the external test set with the correlation coefficient (R2) and cross validated correlation coefficient (Q2) in ranges of 0.8234–0.9770 and 0.7599–0.9745, respectively. The best predictions obtained by the balance of correlations along with the global SMILES descriptors are included in the modeling process. The average statistical characteristics of the external test set are the following: n =36, R2 =0.9770, Q2= 0.9745, standard error of estimation (s)=0.023, mean absolute error (MAE) =0.018 and Fischer F-ratio (F)=1443


Introduction
Ionic liquids (ILs) are composed of organic cation and organic/inorganic anion with melting point less than 100℃. Because of their unique physical and chemical properties, such as negligible vapor pressure [1][2], nonflammability [3][4][5], high ionic conductivity [6][7][8], large liquid range, and good ability to dissolve organic and organic materials [9][10], ILs have been intensively researched and been successfully applied in many fields during the past decades [11][12][13][14][15][16][17][18]. While, the wide range of combinations of anion and cation can lead to a great number of ILs which are also regarded as task specific materials. That is the reason that the development of these ILs is increasingly being pursued by many researchers around the world.
Density of ILs is a fundamental and important physical property which is required in many design problems of chemical engineering process calculations, analysis and material science. For example, the appropriate value of density is desired for process piping, transfer rates, equipment sizing and other unit process in various chemical, oil, clay and food industry. However, due to the unbelievably high number of possible ILs, Experimental data for density of ILs are still scare and always only available for a few classes of common researched ILs. More density data and deep understanding of the property are demanded for designing new processes or developing task-specific ionic liquids.
For this purpose, the methods of predicting the density of ILs are desirable. Quantitative structure-property relationship (QSPR) is a robust and useful computational tool used to predict the properties of homologous series [19]. Some various QSPR models have been proposed for the prediction of density of ILs. For example, Palomar et al [20][21] predicted the density of imidazolium-based ionic liquids by combining molecular descriptor with artificial neural network (ANN) algorithm. Trohalaki et al. [22] employed electrostatic, quantum mechanical and thermodynamic descriptors obtained from CODESSA software to derive QSPR model to predict the density of triazolium bromides. Keshavarz et al [23] introduced two simple correlations to predict the density of a wide range of ionic liquids based on the size, elemental composition and type of cations and anions. Valderrama et al. [24] advised ANN and group contribution method (GCM) to estimate the density of 103 of ionic liquids. Shen et al. [25] utilized GCM combined with Patel-Teja equation of state to estimate the density of ionic liquids at ambient temperature and atmospheric pressure. El-Harbawi et al [26] developed a QSPR model based on the combination of multiple linear regression (MLR) and polynomial equation using the same descriptors reported by Shen et al. Among the QSPR models mentioned above, some complex molecular descriptors and algorithm were used to develop correlation models for the estimation of the density of ILs. In this work, a new model is proposed based on the simplified molecular input line entry specification (SMILES) descriptor. The experimental density data at 298.15K and 100Kpa for  184 pure ILs were collected from IL ThermoDatabase [27] The studied ILs covers a wide range of cations including imidazolium, pyridinium, pyrrolidinium, piperidinium, guanidinium, phosphonium, ammonium, and different anions such as and different anions such as hexafluorophosphate, tetrafluoroborate, halide, bis (trifluoromethylsulfonyl) mide, bis(perfluoroethylsulfonyl)imide, carboxylic acid, alkyl sulfate, dialkylphosphate, trifluoromethyl sulfonate, trifluoroacetate, alkoxy-alkylsulfates, dicyanamide, tricyanomethanide, tris(trifluoromethy lsulfonyl) methide, aminoacid, etc.

Method
SMILES has been employed as an alternative descriptor in the QSPR models [28][29]. The SMILES of 184 ionic liquids were obtained from the free software ChemSketch [30].The CORAL software generated the SMILES-based optimal descriptors, which are described as follows: Where DCW is for correlation weight of a molecular descriptor; The Sk is one character (e.g., C, N, O, etc.) or two characters which cannot be separated (e.g., Cl, Fe, Br, etc.); The SSk is a combination of two SMILES characters; The SSSk is a combination of three SMILES characters. α, β, γ, x, y, z and w can be 1 or 0 [31]. NOSP stands for the chemical elements: nitrogen, oxygen, sulfur and phosphorus; HALO stands for halogen elements except for iodine element. BOND symbolizes a mathematical function for the presence or absence of double, triple or stereo chemical bonds. CW means the correlation weight of a molecular descriptor. The numerical value of CW(Sk), CW(SSk) and CW(SSSk) are obtained from Monte Carlo method. Nepoch and Threshold are parameters of Monte Carlo optimization. Combinations of their values produce possibility of generating various results of the SMILESbased optimal descriptor. The best model was obtained with the best combination of the threshold and the Nepoch for the correlation coefficient of the external validation set. The correlation coefficient between density and DCW(Threshold, Nepoch) is a mathematical function of the correlation weights of the SMILES attributes. The predictive model was built up by the following equation.
There are two basic modeling methods in CORAL software. The first one is the mode of classic scheme based on the split of all collected ILs in the training set and validation set. The second one is the balance of correlations based on the split into three sets: training, calibration and validation sets. The training set is the core part of the model. The task of the training set is to calculate correlation weights which can give as large as possible correlation between the experimental and predicted density for the training set. The calibration set can detect the beginning of over-fitting. The validation set is the final estimator of the predicted QSPR model. In this paper, the calculation for 184 density data of ILs were performed in two versions for three random splits. Version 1 was based on the balance of correlation and version 2 was based on the classic scheme.   The Nact is the number of SMILES attributes classified as not rare; N is the number of ILs in set; R2 is Correlation coefficient; S is standard error of estimation; F is Fischer F-ratio; Subscript t is for training set; Subscript c is for calibration set; Subscript v is validation set; R2m metric should be N >0.5 [32].

Results and discussion
Optimal descriptors were calculated with SIMLES. The balance of correlations (version 1) and the classic scheme (version 2) have been applied in the Monte Carlo method optimization. The statistical characters of models of two versions were shown in Table 1. The values of statistical characters were obtained in 3 probes of the Monte Carlo optimization for all splits in both versions. It was seen from Table 1 that the prediction with CORAL is acceptable because of its good statistical quality for all splits. The best model was a model with a threshold of 3 for split 2 in version 1. Figure 1 graphically displayed best models for all splits. The figure showed that there was an excellent agreement between experimental density and predicted density using the best models.
The values of the best T and the Nepoch for all splits in both versions presented the following QSPR models for the calculated density of ILs according to equation 2.
QSPR mathematical model of density for Split 1 in version 1:  From equation 4 to equation 8, R 2 is the correlation coefficient, Q 2 is the leave-one-out cross-validated correlation coefficient; s is the standard error of estimation; MAE is a mean absolute error and F is the Fisher ration. From the statistical point of view, Q 2 can be used to determine the predictive ability of the model.Higher value of Q 2 means better model prediction. Table 2 displays statistical criteria of the predictability of the best models for all splits in two versions. It is concluded that the predictability for best models of all splits is good and robust according to the criterion [32][33]. The 184 ILs experimental and calculated density values for best model, namely split 2 in version 1 are showed in Table S1.

Conclusion
The density of an ionic liquids is an essential physical property which is important in industrial process design involving ionic liquids. In this paper, CORAL software was first applied for building up QSPR models for predicting the density of ionic liquids. Density data set of 184 pure ILs at ambient temperature and pressure were used for developing and validating the model by two different versions built in CORAL software. The results show that the proposed QSPR models are robust and have good predictive ability by using the global SMILES descriptors. The version of the balance of correlations displays higher accuracy than the version of the classic scheme. The average statistical characteristics of the external test set for the best QSPR model are the following: n=36, R 2 =0.9770, Q 2 =0.9745, standard error of estimation (s)=0.023, mean absolute error (MAE) =0.018 and Fischer F-ratio (F)=1443. This work was supported by the by the National Natural