Free and Open Source Chemistry Software in Research of Quantitative Structure-Toxicity Relationship of Pesticides

Pesticides are toxic chemicals aimed for the destroying pest on crops. Numerous data evidence about toxicity of pesticides on aquatic organisms. Since pesticides with similar properties tend to have similar biological activities, toxicity may be predicted from structure. Their structure feature and properties are encoded my means of molecular descriptors. Molecular descriptors can capture quite simple twodimensional (2D) chemical structures to highly complex three-dimensional (3D) chemical structures. Quantitative structure-toxicity relationship (QSTR) method uses linear regression analyses for correlation toxicity of chemical with their structural feature using molecular descriptors. Molecular descriptors were calculated using open source software PaDEL and in-house built PyMOL plugin (PyDescriptor). PyDescriptor is a new script implemented with the commonly used visualization software PyMOL for calculation of a large and diverse set of easily interpretable molecular descriptors encoding pharmacophoric patterns and atomic fragments. PyDescriptor has several advantages like free and open source, can work on all major platforms (Windows, Linux, MacOS). QSTR method allows prediction of toxicity of pesticides without experimental assay. In the present work, QSTR analysis for toxicity of a dataset of mixtures of 5 classes of pesticides comprising has been performed.


Introduction
Pesticides are used extensively to control agricultural pest and to improve crop yields.However, small fraction of the pesticides is moving up from surface into stream, rivers and lakes and cause of considerable environmental concern as a result from application drift, rainfall runoff, or residue leaching through the soil into groundwater [1].The contamination of water by pesticides increasing around the world, so the knowledge of eco-toxicological effects for aquatic organisms for the environmental risk assessment is essential.
Before pesticides are registered they must undergo laboratory testing on animals for short-term (acute) and long-term (chronic) health effects.Laboratory animals are purposely fed doses high enough to cause toxic effects.Small planktonic crustaceans Daphnia, fish, and algae are the most common organisms tested for the evaluation of toxic effects of pesticides.In order to reduce expensive and time-consuming experiments and reduce animal testing quantitative structure-toxicity relationship (QSTR) method is valuable [2].Twodimensional (2D) and three-dimensional (3D) molecular structure considerable influence on properties of pesticides, such as, absorption, distribution, metabolism, and excretion (ADME).QSTR method allows prediction of environmental toxicity derived from the molecular structure and fills an important gap in risk assessment studies (REACH) [3].
QSTR method involves representations of molecules or molecular patterns in the form of numerical descriptors that capture the structural features and properties of molecules, generally known as molecular descriptors.Molecular descriptors describe: chemical properties (electrophilicity, hydrogen bonding), physicalchemical properties (lipophilicity, polar surface area), 2D structure (topological, information, connectivity, information indices, 2D frequency fingerprints), 3D structure (RDF, WHIM, GETAWEY, geometrical descriptors).Correlation of toxicity of molecule and molecular descriptors is most often expressed by linear equation calculated by multiple linear regression (MLR), or partial least squares (PLS) [4].Computational neural networks (CNN) is usually performed if there is an assumption about a nonlinear and a highly complex relationship between the structure and the observed toxicity [5].
There are many commercial and free academic packages developed for calculation of molecular descriptors.Most of the molecular descriptors can be calculated by using commercial software packages such as CODESSA [6] and DRAGON [7].Limitations of most of those packages are high price and hardly interpretable calculated molecular descriptors in terms of structural features.To overcome this, we have developed, PyDescriptor, a new script implemented with the commonly used visualization software PyMOL for calculation of a large and diverse set of easily interpretable 1D-to 3D-descriptors.They are also easy interpreting in terms of structural moieties, applicable for representing local environment or structure, simple to understand, independent of experimental properties, sensitive to changes in conformation molecule.PyMOL is free open source molecular graphics tool for 3D visualization of proteins, small molecules, density, surfaces, and trajectories [8].PyDescriptor is a useful addition to the currently existing molecular descriptor calculation software.It has several advantages like free and open source and it is able to works on all major platforms (Windows, Linux, MacOS).The script is freely available for academic use [9].
In the present paper we have generated QSTR models using molecular descriptors calculated by PyDescriptors for estimation of toxicity of 43 pesticides obtained on aquatic vertebrates bluegill sunfish (Lepomis macrochirus) [1].
Table 1.Experimentally obtained toxicity endpoint and estimated values by eq. ( 1) of pesticides for Lepomis acrochirus.

Toxicity data
Toxicity data for aquatic vertebrates bluegill sunfish (Lepomis macrochirus) were retrieved from literature.Toxicity of 43 pesticides is expressed as LC 50 (lethal concentration that kills 50 % of the animals in a test population / molL -1 ).LC 50 were converted in the form of a logarithm (log LC 50 ) (Table 1).

Calculation of molecular descriptors
Molecular descriptors were calculated using open source software PaDEL [9] and a new in-house built PyMOL plugin (PyDescriptor) [8] followed by extensive objective and subjective feature selection to avoid redundant descriptors.

Regression analysis and validation of models
For model building, the dataset was divided into training (80%) and test (20%) sets.The best QSAR models were obtained using a Genetic Algorithm using QSARINS v 2.2 [11].
The models have been assessed by: fitting criteria; internal cross-validation using leave-one out (LOO) method and Y-scrambling; and external validation.Fitting criteria included: the coefficient of determination (R 2 ), adjusted (R 2 adj ), cross-validate R 2 using leave-oneout method (Q 2 LOO ), global correlation among descriptors (Kxx), difference between global correlation between molecular descriptors and y the response variable, and global correlation DPRQJ GHVFULSWRUV ǻK), standard deviation of regression (s), and Fisher ratio (F).Internal and external validations also included the following parameters: root-mean-square error of the training set (RMSE tr ); root-mean-square error of the training set determined through cross validated LOO method (RMSE cv ), root-mean-square error of the external validation set (RMSE ex ), concordance correlation coefficient of the training set (CCC tr ), test set using LOO cross validation (CCC cv ), and of the external validation set (CCC ex ), mean absolute error of the training set (MAE tr ), mean absolute error of the internal validation set (MAE cv ) and mean absolute error of the external validation set (MAE ex ) [12], predictive residual sum of squares determined through cross-validated LOO method (PRESS cv ) in the training set and in the external prediction set (PRESS ex ).The analysed external validation parameters also include the coefficient of determination (R 2 ex ).Robustness of QSAR models was tested by Y-randomisation test.New parallel models were developed based on fit to randomly reordered Ydata (Y scrambling), and the process was repeated several times (2000 iterations) [12].Investigation of the applicability domain of a prediction model was performed by leverage plot or Williams plot (plotting residuals vs. leverage of training compounds).Detection of outliers was carried out for compounds that have values of standardized residuals greater than two standard deviation units using QSARINS.

Result and discussion
The best three-descriptor based QSTR model for prediction of toxicity for the Lepomis acrochirus is: logLC 50 = 1.948 -0.588 ALogP + 1.223 FP747 -0.375 fPH3A (1) The statistical results of the obtained QSTR model are presented in Table 2. Satisfaction of fitting criteria implies the following: the closer R 2 values are to unity, the more similar calculated values are to the experimental ones, that is, R 2 $OVR ODUJHU F statistic and lower standard deviation means that the model is more significant.In order to avoid overfitting, inter-correlation between the descriptors included in the equation is detected based on Kxx and ǻ.. Low Kxx and ǻ. 0.05 implies no chance correlation between descriptors.The minimum acceptable statistical parameters for internal and include the following conditions: R 2 ext 0.60; CCC ext 0.85; RMSE cv and MAE cv close to zero; and RMSE tr < RMSE cv .Robust QSAR models should have low R 2 y scr and low Q 2 y scr values and R 2 y scr > Q 2 y scr.In order to investigate the applicability of a prediction model and detect possible outliers, the applicability domain of the selected model was evaluated by a leverage analysis expressed as Williams plot, in which residuals and the leverage values were plotted.Williams plot is given in Figure 1.A scatter plot of experimentally obtained toxicity calculated by QSTR model versus values calculated by Eq. ( 1) is presented in Figure 2.
Table 2. Statistical parameters of the obtained QSAR models.ext and CCC ext , as well as small difference between RMSE tr and RMSE ex , and between MAE tr and MAE ex .As can be seen from the Williams plot (Figure 1), toxicity of pesticides 30 (fonofos) predicted my model ( 1) must be used with reserve, because its leverage value is greater than the warning leverage (h * = 0.353).Also, the same model has generated one outliers, pesticides 2 (chlorimuron) because its standardized residual is greater than ± 2.5.
The Considering the negative coefficient of ALogP in Eq. ( 1) highly toxic compounds have a high lipophilicity.High lipophilic compounds my easily pass lipidous membranes and accumulate in fat tissue, therefore cause enhanced toxic effect [13].Negative coefficient of PyDescriptor fPH3A implies that frequency of occurrence of hydrogen within 3 Å from phosphorus positively influence on increased toxicity of pesticides.QSAR study of toxicity of phoshorhydrazide (PHA) derivates revealed that the NH-P(X) moiety has a much higher inhibitory activity than the NH-C(X) moiety.The presence of the electron acceptor substituent around the P=X group increases the inhibitory potential of the PHA derivatives [14].Obtained results are in accordance with previous findings of QSTR modeling of toxicity of organic molecules to Daphnia magna [4].Obtained PLS models suggests that higher lipophilicity and electrophilicity, and hydrogen bond donor groups are responsible for greater toxicity.Figure 3a presents a chemical structure of the most toxic compound (35), an aliphatic organothiophosphate insecticide, terbufos.Thiophosphates are a very toxic class of organophosphorus compounds, especially if possess reactive functional groups such as: methyl, phosphate ester (P=O type) and unsubstituted phenyl group [15].QSTR study of some organophosphorus compounds performed by using the quantum chemical and topological descriptors relieved that the sulphur atoms instead of oxygen atoms improved toxicity [16].
Figure 3b shows a structure of minimum toxic compound (6) imazapyr, an imidazolinone herbicide.Imazapyr does not contain phosphorus atom.According a positive coefficient of fingerprint descriptor FP747 in eq. ( 1) imply that higher values of this descriptors mean lower toxicity.

Conclusion
In the present work, we have used an open source molecular descriptor calculation PyMOL plugin PyDescriptor for calculation easily interpretable and informative molecular descriptors.Robust QSTR models with good external predictive ability have been developed for the toxicity of pesticides for the fish, bluegill sunfish.The developed models, since, satisfy the threshold values for many statistical parameters could be useful for the prediction of experimentally undermined toxicity of known pesticides, as well as new pesticides.The model can also be employed to better understand the mechanism of toxicity of the various families of pesticides on the aquatic organisms, as well as the identification of potential aquatic pollutant.
Our results indicates that future QSTR analysis of pesticides should apply a specific group of descriptors relates with lipohilicity and structure fragment involved in electron transfer.

Fig. 2 .
Fig. 2. A scatter plot of experimentally obtained toxicity calculated by QSTR model versus values calculated by eq.(1).Obtained model has satisfactory results of fitting parameters and internal validation and low collinearity between the three descriptors.The results of Yscrambling demonstrated that model wasnot obtained by chance correlation.Model 1 may be considered as predictive due to the high values of R 2ext and CCC ext , as well as small difference between RMSE tr and RMSE ex , and between MAE tr and MAE ex .As can be seen from the Williams plot (Figure1), toxicity of pesticides 30 (fonofos) predicted my model (1) must be used with reserve, because its leverage value is greater than the warning leverage (h * = 0.353).Also, the same model has generated one outliers, pesticides 2 (chlorimuron) because its standardized residual is greater than ± 2.5.The best QSTR model obtained include the following descriptors: lipophilicity (ALogP), PaDEL fingerprint
The leverage h best QSTR model obtained include the following descriptors: lipophilicity (ALogP), PaDEL fingerprint