Mid-long term runoff forecasting model based on RS-RVM

In view of the two key problems in hydrological mid-long term runoff forecastingthe selection of key forecasting factors and the construction of forecasting models, an analysis is made on, taking Danjiangkou Reservoir as an example, the basis of preliminarily identifying the sea-air physical factors such as atmospheric circulation, sea surface temperature and Southern Oscillation, et al. The rough set theory is used to establish the data decision table and reduce the factors, and the relevance vector machine method is adopted to establish the mid-long term runoff forecasting model based on reduced factor set. Meanwhile, this paper simulates and predicts the amount of runoff of the reservoir in September and October during the autumn floods from 1952 to 2008, and makes comparison with the model adopting support vector machine. The result shows that the relevance vector machine has better robustness and generalization performance. According to the standard of 20% annual variation, the simulation accuracy of September and October reaches 93.9% and 95.9%, respectively, and the accuracy of the trial forecasting is all up to standard. Moreover, this model better reflects the characteristics of ample flow period and low water period of the forecasting years.


Introduction
High-precision mid-long term hydrological forecasting plays an important role in the safety and flood control of reservoirs, effective drought relief, scientific arrangement of water resources scheduling, and improvement of hydropower generation stability and efficiency. It is also an indispensable component of efficient operation of reservoirs. Due to the complexity of the mid-long term runoff changes, there are many factors that cause it to change in the future, including a series of physical factors such as atmospheric circulation, ocean, underlying surface, astronomical earth and human activities，et al. [1][2][3][4] Due to the limitations of current science and technology, the specific physical mechanism of the above factors affecting the hydrological process is still unclear. It takes time to establish a complete physical process drive model. Therefore, at the beginning of the physical causes that affect the long-term runoff change process in the basin, finding the key predictors that effectively reflect mid-long term runoff changes, and then developing a prediction method that fully expresses the relationship between predictors and mid-long term runoff are the key to improving the accuracy of mid-long term hydrological forecast.
At present, the mid-long term hydrological forecasting is generally based on the linear correlation coefficient between the factors and the forecasting objects. There may be a nonlinear correlation between the forecasting factors and the forecasting objects, and there may be an approximate linear relationship between the forecasting factors, that is, there is complex collinearity. This causes imperfections in the extraction of forecast information, which in turn leads to instability of forecast results or deviations in forecast accuracy. Commonly used hydrological mid-long term forecasting techniques are involved in statistical methods such as multiple regression, autoregression, and time series analysis. [5][6][7][8] these techniques are based on linear changes in hydrological processes. In fact, the mid-long term hydrological process is a highly dynamic and highly nonlinear process. If only considering the linear angle or the approximate linear problem, the forecast bias will inevitably occur.
In order to improve the accuracy of mid-long term hydrological forecasting, this paper starts from the two aspects -the selection of key forecasting factors and the construction of forecasting models, and uses rough set(RS) theory and Relevance vector machine(RVM) method in statistical learning theory to explore mining key forecast information. The new approach establishes a hydrological forecasting model that can fully absorb the information of multiple forecasting factors and adaptively eliminate information redundancy. Finally, the case study of runoff forecast in the Danjiangkou Reservoir during autumn floods is carried out.

Study area
This paper is illustrated with an example of the Danjiangkou Reservoir, and it forecasts the amount of runoff of the reservoir in September and October during the autumn floods.
As the source of the South-to-North Water Diversion Project, Danjiangkou Reservoir is located in the famous autumn rain area in western China. It has a total capacity of 30 billion cubic metres, a usable storage of 16.36 to 19 billion cubic metres and a control basin area of 95217 square kilometers, which is 60% of the catchment area of Hanjiang River Basin (Figure 1).

Data used
The data used in this paper mainly include inflow runoff data and climate hydrological data of Danjiangkou Reservoir, as shown below.
(2) Global 500hPa monthly average height reanalysis data provided by the US National Center for Environment Prediction (NCEP) and the US National Center for Atmospheric Research (NCAR). Years: 1948~2008, spatial resolution: 2.5°×2.5°.

Rough set method (RS)
Rough set theory (RST) was a new data analysis tool proposed by Polish scientist Pawlak in 1982 to deal with fuzzy and uncertain information, [9,10] which has gradually attracted attention of scholars all over the world since 1990, and has become one of the most active research fields in information science. RST can not only effectively analyze and deal with inaccurate, inconsistent and other incomplete information, but also discover hidden knowledge and reveal potential laws. This paper mainly uses the concept of approximate quality and reduction in rough set theory to identify massive hydrometeorological factors. [ gives a contradictory measure of the subset of selected data sets. If ( ) 0 p r X = , then the knowledge X is completely independent of P .
When an attribute is removed from a specified set of conditional attributes, the importance of the attribute can be defined by calculating the change in the dependency.
, the attribute importance sgf( , ) p Q is: (2) The greater the dependency change, the more important p is. Therefore, attribute selection refers to excluding attributes that have no significant impact on the current pattern classification task.
Common used attribute reduction algorithm includes discernibility matrix algorithm, quick reduction algorithm, attribute reduction algorithm, and genetic algorithm, et al.

Relevance vector machine method (RVM)
A brief introduction of theoretical basis of RVM for Regression is provided in this section. A more detailed description on the subject is available in the paper by Tipping. [12,13] The idea of learning machines was firstly proposed by Turing (1950). Vapnik (1995) discussed the feature of learning machines and proposed Support Vector Machine (SVM) based on statistical learning. [14] Tipping (2000) put forward a Sparse Bayesian learning model like SVM. [15] However it can derive more accurate prediction and utilize dramatically fewer basis functions than SVM. [16] And when being applied in regression prediction, it can output the distribution function of predicting variable because its training is in the Bayesian probabilistic framework. = , ( n x is the input vector, n t is independence target value and N is total number of data patterns), the output for RVM is as follows: Where n ε are independent samples from some noise process which is further assumed to be mean-zero Gaussian with variance 2 σ .The likelihood of the complete data set can be written as: . With many parameters in the model, we would expect maximum likelihood estimation of w and 2 σ from (5) to lead to severe over-fitting. To avoid this, we impose some additional constraint on the parameters. We encode a preference for smoother (less complex) functions by making the popular choice of a zero-mean Gaussian prior distribution over w : Whereα a vector of N+1 hyperparameters.
Having defined the prior, Bayesian inference proceeds by computing from Bayes'rule, the posterior over all unknowns given the data: Then given a new test point, * x , predictions are made for the corresponding target * t , in terms of the predictive distribution: After finding the most optimizing hyperparameters MP α and 2 MP σ , we can compute the predictive distribution.
Since both terms in the integrand are Gaussian, this is readily computed, giving: So the predictive mean is intuitively ) ; ( * μ x y , or the basis functions weighted by the posterior mean weights, many of which will typically be zero. The predictive variance (or 'error-bars') comprises the sum of two variance components: the estimated noise on the data and that due to the uncertainty in the prediction of the weights.

Research on mid-long term hydrological forecasting model based on RS-RVM
This paper proposes a method of combining RS with RVM. The RS is used to pre-process the input data, which means the RS network is used as the pre-system in advance, and then the information is predicted based on the structure of the RS. The information prediction system based on RS-RVM is shown in Figure 2.

Reduction and determination of predictors
On the basis of preliminarily identifying predictors, this paper used rough set theory to establish the data decision table and reduce the factors.
The values in the decision table are generally required to be represented by symbolic data when rough set theory is used to deal with decision tables. However, the forecasting factors and runoff data used in this paper are all numerical, so the related data needs to be preprocessed first. When a conventional discretization algorithm such as equal width and equal frequency is used to convert a numerical attribute into a symbolic attribute, information loss is inevitably brought about. The result of the computational processing is highly dependent on the effect of the discretization. In order to solve this problem, this paper uses the neighbourhood relationship model to granulate the data of each factor. When a non-empty finite set on a given real space is given, for any object i x on U , its neighborhood of δ is: x The value range of δ is between 0.05 and 0.5, and δ =0.125 is used in this paper. The 39 factors in September and the 44 factors in October are granulated, according to the granulation data, a decision relationship table corresponding to the monthly forecast is formed.
If the number of attributes in a decision table is N, it is found that all reductions of the decision system need to test 1 2 − N subsets of attributes. When the number of attributes is too large, the amount of calculation is not tolerable. For this reason, the forwarded greedy algorithm based on attribute importance is used to reduce the granulated decision table. The steps of the numerical attribute reduction algorithm based on the neighborhood rough set model are as follows: Input: decision table matrix and neighborhood radius; output: attribute reduction table and importance; Step 1: Granulate the decision table; Step 2: Initialize the reduction matrix; Step 3: Calculate the importance of all remaining attributes; Step 4: Select the attribute with the highest attribute importance value to add to the reduction matrix; Step 5: If the dependency value of the reduction matrix does not change after the new attribute is added, proceed to step 6; otherwise, go to step 3; Step 6: The program ends. According to the calculation, the number of factors after the reduction of runoff forecast in September and October is seven and nine, respectively. The basic forecasting factors in September and October are shown in Table 1 and 2.
It can be seen from Table 1 and 2 that the correlation of each factor after reduction had passed the bilateral test with 0.05 confidence (the critical value of the correlation coefficient is 0.273 at this time), indicating that there was a significant correlation between each factor and the forecast object. The multiple correlation coefficient of the factor in September was 0.908, and in October it was 0.929, indicating that the selected factors had higher predictability. Meanwhile, according to the attribute importance of each factor, the sum of the factor attribute importance of runoff in September and October reached 0.995 and 0.992, respectively, indicating that the basic factor set after reduction basically contained all the forecasting information, and used the least predictors.

Relevance vector machine modeling
The relevance vector machine method was adopted to establish the mid-long term runoff forecastingmodel based on reduced factor set. The structure of the model is shown in Figure 3: The input of the model was the reduced factor set, and the output was the runoff of the autumn floods(September and October). In order to avoid the magnitude difference between the various factors, the input data needed to be normalized first to eliminate the influence of each factor due to different dimensions and units: In the relevance vector machine, selecting different kernel functions will form different algorithms. Commonly used kernel functions include polynomial, Gaussian radial basis kernel function, B-spline kernel function and so on. Experience has shown that the Gaussian radial basis kernel function has good nonlinear processing ability. Therefore, the Gaussian radial basis kernel function is selected in this study. The specific formula is shown above. Based on the Cross-validation method, the width of the Gaussian kernel function of the runoff forecast in September and October is 5 .

Forecast result analysis
According to the established model, the runoff in  It can be seen from the figures that for the mid-long term forecasting of runoff of the Danjiangkou Reservoir in September and October during the autumn floods, the relevance vector machine model and the support vector machine model can both achieve good simulation, and the fitted curves of the simulated and measured values are both consistent.
In order to evaluate the prediction accuracy and performance of the prediction model, the following error criteria are used to analyze the prediction accuracy, including correlation coefficient (R), root mean square error (RMSE), and Nash efficiency coefficient (E). Meanwhile, combined with the scheme for assessing the accuracy of mid-long term forecast in Forecasting norm for hydrology intelligence (SL250-2000) [17] as the evaluation criteria for forecasting model: For quantitative forecasting, the water level (flow) is 10% of annual variation, the other elements are 20% of annual variation, and the occurrence time of the elemental extremum is 30% of annual variation as the permissible error. It is expected to fully reflect the performance of the model in terms of accuracy, efficiency and error response. The correlation coefficient (R), root mean square error (RMSE), and Nash efficiency coefficient (E) are shown below.
correlation coefficient (R): (14) root mean square error (RMSE) In the above three formulas, n is the length of the time series, is the predicted runoff, o Q and f Q are the mean of the measured runoff and the predicted runoff, respectively. The larger the correlation coefficient (R) and the Nash efficiency coefficient (E), the smaller the root mean square error (RMSE), indicating that the better the prediction effect is. Tables 3 and 4 show the accuracy evaluation result of the correlation coefficient (R), root mean square error (RMSE), and the Nash efficiency coefficient (E). Table 5 shows the accuracy evaluation result of the Forecasting norm for hydrology intelligence (SL250-2000). [17]   From the accuracy evaluation results of the Norm, the advantages of the RVM model are also reflected. According to the standard of 10% annual variation, the fitting qualification rate of the RVM and SVM model of historical runoff in September reached 83% and 77%, respectively. In the 8 years of the forecast period, the RVM is qualified for 5 years and the SVM is qualified for 4 years. The fitting qualification rate of the RVM and SVM model of historical runoff in October reached 81% and 73%, respectively. In the 8 years of the forecast period, the RVM is qualified for 6 years and the SVM is qualified for 4 years. According to the standard of 20% annual variation, the fitting qualification rate of the RVM and SVM model of historical runoff in September reached 93,9% and 87.8%, respectively. In the 8 years of the forecast period, the RVM is qualified for all years and the SVM is qualified for 6 years. The fitting qualification rate of the RVM and SVM model of historical runoff in October reached 95.9% and 91.8%, respectively. In the 8 years of the forecast period, the RVM is qualified for all years and the SVM is qualified for 7 years.
Overall, the RVM has better robustness and generalization performance, both the simulation and the trial report accuracy are satisfactory. Moreover, this model better reflects the characteristics of ample flow period and low water period of the forecasting years.

Conclusions
(1) The selection of forecasting factors is to extract effective information from a large amount of forecasting information as an input to the model. The traditional statistical methods have great defects in dealing with such problems. The rough set theory has the advantages of dealing with incomplete information, reducing massive data information and obtaining key knowledge expression. The attribute reduction algorithm using rough set theory can directly obtain the basic forecasting factor set with the highest forecasting effect, which can provide an effective forecasting information source for model establishment later. The results of the Danjiangkou Reservoir show that the basic factor set after reduction basically contains all the forecasting information and uses the least forecast factor. (2)As a new machine learning method of statistical learning theory, RVM can effectively deal with the nonlinear relationship between forecasting factors and forecasting objects , and the nonlinear relationship among forecasting factors. The simulation results show that the RVM is, under the same noise condition, similar to the SVM for the sinc(x) function. However, the number of relevance vectors in the RVM is significantly smaller than the number in the SVM, which indicates that the RVM has better robustness and generalization performance.
(3)The RVM was adopted to simulate the runoff of the Danjiangkou Reservoir in September and October from 1952 to 2000, and conduct a trial report on runoff from 2001 to 2008. Meanwhile, this paper made comparison with the model adopting SVM. The results of four evaluation criteria show that, for the mid-long term runoff forecastingof the Danjiangkou Reservoir in September and October during autumn floods, the RVM model has better forecast performance in the same situation than the SVM model.