Prediction model of Populus simonii seedlings based on growth characteristics in China

In this paper, we originally apply the BP neural network to predict the plant height of Populus simonii seedlings. Firstly, we explore correlation among the section length variables of Populus simonii seedlings in four growth periods by using principal component analysis and hierarchical clustering method, which obtain 5 principal components. In addition, we utilize Fuzzy C-Means Clustering (FCM) to classify the Populus simonii seedlings, and are obviously classified into two subpopulations. Furthermore, we utilize BP neural network to establish seedlings height growth model and aboveground biomass prediction model, respectively. Through numerical experiments, prediction accuracy of the seedling height growth models in four periods reaches about 84.89%. Meanwhile, the prediction accuracies of stem and leaf fresh weight and stem and leaf dry weight are 91.15% and 83.79%, respectively. This paper provides an effective method for studying phenotypic characteristics and predicting the height of Populus simonii seedlings, which supplies a reference for genome-wide association analysis.


Introduction
For the past few years, P. simonii has become the main species of protection forest in China for its adaptability, drought resistance and cold resistance. It has important values in the fields of wind-break and sand fixation, civil architecture and papermaking [1]. Therefore, the research of P. simonii's growth characteristics is of practical significance in the selection of seedlings and genomewide association analysis. On the other hand, biomass is not only the energy base and nutrient source of ecosystem operation, but also the important embodiment of ecosystem productivity [2]. It is also an important index to reflect the function of community or ecosystem [3].
At present, there are many researches on the growth of woody plants. In 2016, Peng et al. explored the growth rhythm of cuttage seedlings accorded with S-shaped curve [4]. On the other hand, Tang et al. took advantage of the modified Richards model, exponent function and the growth prediction functions to predict the tree height, volume and diameter at breast height (DBH) for P. simonii in the gully rolling region of Shaanxi Loess Plateau in 2000 [5]. And in 2001 Zhang et al. used logistic model to explore the growth law of P. simonii [6]. In this paper, we originally apply the BP neural network to establish the prediction model of plant height with phenotypic variables (leaf number, section number, section length) and the aboveground biomass prediction model for 442 individuals as core populations that represented almost the entire geographic distribution of P. simonii in China. Firstly, we use principal component analysis to obtain 5 principal components of the section length variables of P. simonii seedlings in four growth periods. In addition, we apply FCM Clustering obviously to classify the P. simonii seedlings into two subpopulations, which shows significant differences in phenotypic traits (plant height, leaf number and section number). Furthermore, we let phenotypic variables be independent variables in order to propose the prediction model of plant height with phenotypic variables (leaf Meanwhile, the prediction accuracies of stem and leaf fresh weight and stem and leaf dry weight would reach about 91.15% and 83.79%, respectively. Our research provides an effective method for studying phenotypic characteristics and predicting the height of P. simonii seedlings, which supplies a reference for genome-wide association analysis.

Plant material
In 2017, the 385 individuals with phenotypic and physiological traits sampled from 16 different localities in China were studied in this paper. P. simonii seedlings were growing under the conditions of 16 hours light and 8 hours dark in a greenhouse at Beijing Forestry University, Beijing, China (40°0'N, 116°20'E). The phenotypic traits of growth were recorded about every 20 days, including plant height, leaf number, section number and length. Especially, we recorded the data of 57 phenotypic and physiological traits individuals. The steps of our experiment are as follows: 1) cut the corresponding tissues of plants, then quickly put them into the known weight aluminum boxes and weigh the fresh weight (Wf); 2) put the plant tissue together with the aluminum box into the oven which has been heated to 105℃, fixing 30min, and then bake to the constant weight at 80℃, so got the dry weight (Wd). The relative water content of samples can be obtained by Wf and Wd.

Principal component analysis
In order to compress the section length variables, we use principal component analysis method to reduce the variable dimension. The principal component analysis transforms the original variables into new axes, or principal components (PCs), which are orthogonal, so that the data are uncorrelated with each other. As we all know, the goal of PCA is to replace a large number of correlated variable with a smaller number of uncorrelated variables while capturing as much information in the original variables as possible [10]. In 2000, Destefanis et al. analyzed the first three principal components of beef characteristics, which explained about 63% of total variability [11]. In this paper, we obtain 5 principal components of the section length variable of P. simonii seedlings in four growth periods, and the contribution rate would reach about 80%. Table 1 shows that the section length variables of four stages of P. simonii seedlings are all integrated into five principal components. We combine the section variables of larger loads into a principal component and reduce the original variable to 5 dimensions. Thus, the section length variables of P. simonii seedlings are roughly divided into five parts from bottom to top, among which the last two principal components on April 29 and May 20 are more dispersed and less representative, while the other principal components are more evenly distributed.

Correlation analysis of phenotypic variables of P. simonii seedlings
Furthermore, we explore the relationships among the phenotypic characteristics of P. simonii and the adjacent section length variables by hierarchical clustering method (the result can refer to Figure 1).

Fuzzy c-means clustering
Among the 385 P. simonii samples collected in China, we apply FCM clustering algorithm and NbClust package (version3.0(https://cran.rproject.org/web/packages/NbCl ust/index.html)) in the R software to cluster P. simonii samples based on phenotypic characteristics in order to analyze the difference of P. simonii seedlings in different regions. In detail, the steps of the FCM clustering algorithm are as follows: 1. Determine the optimal clustering number c , denote the iteration stop threshold be ε . And let the iteration counter be 0 t = .
2. Let X be a matrix: where m represents the number of characteristic variables, n represents the number of samples and x ( 1, 2,..., ) j j n = represents the th j sample. We firstly determine the optimal clustering number (2 ) c c n ≤ ≤ . Thus we initialize the fuzzy membership matrix U : where ij u represents the membership relation between sample x j , and satisfies we obtain the matrix of random-initial clustering centers (c is the number of clustering center): In our work, during the clustering analysis, objective function is defined:  satisfying the additional conditions: Solving the equation (1), we would have： 4. Update fuzzy cluster matrix ( ) U t and cluster center matrix ( ) P t according to the equation (2) 5. If ( ) , stop calculation, output fuzzy membership matrix U and cluster center matrix P . Otherwise, let 1 t t = + , and return to step 2. We consider optimal number by the application of the NbClust package of R language software. The specific results are shown in Figure 2. There are seven indicators of the 26 clustering evaluation indexes suggested to be clustered into two categories ( 2 c N = ). The bar graph is visualized using NbClust package of R language software.
Next, we take advantage of Euclidean distance to perform FCM clustering, and let the maximum number of iterations be . Through multiple simulations and calculations, the seedlings of P. simonii are divided into two distinct subpopulations (refer to Figure 3 and Table 2).
Further analysis, we find that the samples of the first subpopulation are mainly distributed in Shaanxi Province, Hebei Province, Liaoning Province, and Qinghai Province. These provinces and cities are mainly located in the middle and lower reaches of the Yellow River. The samples of second subpopulation are mainly from Shaanxi, Qinghai, and Henan provinces, all located in the upper and middle reaches of the Yellow River.  In four growth periods, the first subpopulation samples are characterized by lower plant height (with an average of only 13.37cm, 20.38cm, 21.49cm and 21.87cm, respectively), less leaf numbers and less section numbers. For the second subpopulation, samples have higher plant height (with an average of 18.31cm, 30.03cm, 31.02cm and 32.09cm, respectively). Concretely, on April 29, there is a 5cm difference in plant height between the two subpopulations of samples. At June 17, this difference unexpectedly reaches 10.22cm. The difference between leaf numbers and section numbers are all fluctuated between 2~4 in four growth periods. Therefore, the phenotypic characteristics of P. simonii seedlings in different geographical locations are significantly different.

Prediction model based on BP neural network
In this part, we establish a prediction model based on BP neural network to research the relationship between the plant height and P. simonii's phenotypic characteristics (leaf number, section number, section length). The steps of BP neural network algorithm are as follows: 1. Network initialization. Give each connection weight a random interval of one interval ( 1,1) − . Set the error function is e , the given calculation accuracy is ε and the maximum number of learning times is M .
2. Select the k input sample and the corresponding expected output randomly: Applying the output of each neuron ( ) o k δ in the output layer and the output of each neuron in the hidden layer to modify the connection weight ( ) o h w k , that is: To determine whether the network error reaches the preset error value, the algorithm would ends when the error reaches the preset accuracy or the number of learning is greater than the maximum number set. Otherwise, select the next learning sample and corresponding expected output, and return to step 3 for the next round of learning. Through the above BP neural network algorithm, we obtain ideal prediction results of the seedling height of P. simonii (refer to Figure  4), which show that the phenotypic variables of P. simonii seedlings are closely related to their own growth, and the prediction accuracy of the four stages is 84.89%. Numerical experiment point out that BP neural network has properties of not only higher accuracy but also good stability.

Aboveground biomass prediction model of Populus simonii seedlings
The physiological characteristics of P. simonii seedlings can be obtained by using the observed phenotype characteristics. We utilize BP neural network to obtain the predicted results of fresh and dry weight of stem and leaf ( Figure 5 and 6), and the numerical results are shown in Table 3. We utilize three commonly numerical prediction and evaluation criteria as follows:    We obtain that the predicted results of the aboveground biomass of poplar seedlings, the prediction accuracies of stem and leaf fresh weight and stem and leaf dry weight are 93.18% and 83.36%, respectively. The average absolute errors are 1.6224 and 0.8116, and the relative errors are 0.0258 and 0.0091, respectively, which can save time and cost for scientific experiment of P. simonii.

Discussion
In this paper, we first utilize the principal component analysis to combine section length variables into five principal components and analyze the correlation among phenotypic variables, then use FCM clustering to divide the P. simonii into two subpopulations. In addition, For P. simonii seedlings in four periods, non-linear regression models are established to predict plant height, which prediction accuracy is 84.89%. Finally, we also established a model of aboveground biomass of P. simonii seedlings. Through our paper, we can accurately predict P. simonii's plant height in four periods just based on the phenotypic data (leaf number, section length and section number), and aboveground biomass, which can effectively reveal the growth regularity of P. simonii seedlings and save time for scientific experiments, it also provides reference for selective breeding of P. simonii seedlings.