Long-term runoff prediction for reservoir based on Mahalanobis distance discrimination

An accurate and timely forecast of medium and long-term runoff forecast is of great significance to reservoir safety and water resources scheduling. In order to improve the long-term runoff forecast accuracy of the reservoir, a long-term runoff forecasting model was constructed based on the principle of Mahalanobis distance discrimination analysis. The data sequence from 1952 to 2008 of Danjiangkou reservoir was selected, the correlation coefficient method and AIC criterion were used to sift out the highly correlated and independent factors, a long-term runoff forecasting model was constructed based on the principle of Mahalanobis distance discrimination analysis. The result showed that under the permutation error of 10%,the pass rate during the simulation period was 93.9%, and the pass rate during the inspection period was 87.5%. The research results serve as a reference for the operation of Danjiangkou reservoir.


Introduction
Runoff forecasting is one of the important application fields of hydrology, and it is also the premise of reservoir operation scheduling, flood control, drought resistance and water resources emergency dispatch [1] . Long-term runoff forecast is difficult to meet people's demand for social and economic production arrangements because of its long foresight period and low forecasting accuracy, and it has always been one of the difficulties in the field of hydrology research. There are many hydrological prediction methods, which can be roughly divided into two categories: process drive and data drive [2] . The process-driven method is a model based on the mechanism of production and flow, and is a development direction of runoff prediction. However, because the runoff is affected by many uncertain factors such as climatic meteorology, underlying surface and human activities [3], the formation mechanism and laws have not been fully grasped, and the prediction accuracy is not high, which makes the application of this method very difficult. With the development of data acquisition ability and computing power, the application of datadriven model in hydrological forecasting is more and more extensive. The prediction methods such as neural network [4] and support vector machine [5] have also achieved certain results in practical applications. In recent years, the establishment of runoff forecasting model based on physical genetic background is an important direction of current research by exploring the relationship between future runoff and large-scale climatic-hydrological variables such as pre-rainfall, sea surface temperature and atmospheric circulation index [6][7][8][9] . With the global climate change, underlying surface changes and high-intensity human activities, the applicability of traditional runoff forecasting methods has gradually deteriorated, which poses a challenge to the accurate forecasting of meteorological hydrology [10] . In order to further improve the accuracy and reliability of the forecast, this paper intends to use the combination of qualitative and quantitative forecasting methods to select forecasting factors from the physical causes of long-term runoff, and use AIC criteria to screen key factors. The long-term runoff prediction model was constructed by using Mahalanobis distance discrimination.

Single correlation coefficient
Correlation coefficients are often used in hydrological mid and long-term predictions to investigate whether linear correlation between predictors and forecast objects is used as a basis for factor selection [3] . The formula for the single correlation coefficient is: Where X i and Y i are the series of factors and forecast objects respectively; X and Y are their mean values; n is the length of the sequence, r is a single correlation coefficient, and the degree of significance is commonly used by t test to examine the reliability. The t α can be found from the t-distribution table after α is determined. When t>t α , it is considered that the two are linearly correlated under this reliability, otherwise it is considered to be linearly uncorrelated.

Akaike information criterion
The selection of the number of factors p has a great influence on the prediction accuracy and stability of the model. Take a smaller p, the degree of fitting is poor; taking a larger p, it is too much affected by accidental changes, and even over-fitting phenomenon, the forecast effect will be affected. AIC is a standard for measuring the goodness of statistical model fitting. It was proposed by Japanese statistician in 1974. It is based on the concept of entropy and provides a standard for weighing the complexity of the model and the goodness of the fitted data. Now define a criterion function as: Where σ 2 (p) is the variance of the error; p is the number of factors; n is the number of samples. The first term represents the goodness of fit, the second term represents the penalty after the factor is added, the both are weighed, and the smallest AIC(p) is chosen as the reasonable number of factors p.

Mahalanobis distance discrimination
The Mahalanobis distance was proposed by the Indian statistician Mahalanobis. It represents the covariance distance of the data. The Mahalanobis distance discriminant analysis method is a statistical analysis method for classifying and recognizing newly acquired samples based on the observed number of samples, and discriminating the type of the samples to effectively calculate the similarity of the sample sets.

Multiple linear regression
Multiple linear regression is a method for studying the correlation between a random variable and multiple variables. Multiple regression equations are established using multiple factors X 1 , X 2 , ..., X m and object Y: Where b 0 , b 1 , ..., b m are regression coefficients, and the regression coefficients of the multiple linear regression equation are determined by least squares.

Model accuracy assessment method
The accuracy and stability of the model are evaluated by the deterministic coefficient and the average absolute error in the Hydrological Information Forecasting Specification (GB/T22482-2008) [11] . The formula for determining the deterministic coefficient D C and the mean absolute error e is as follows (8) and (9).
Where D C is the deterministic coefficient; e is the average absolute error; n is the sequence length; Y c is the predicted value; Y 0 is the measured value; Y is the measured sequence mean.

Research area
The Danjiangkou reservoir is located in the middle and upper reaches of the Han River. It is located between 106°12'~111°26' east longitude and 31°24'~34°11' north latitude. It is the water source for the Middle Route of the South-to-North Water Transfer Project. Danjiangkou reservoir has five functions: flood control, power generation, irrigation, shipping and aquaculture. It is one of the large-scale comprehensive utilization reservoirs in China, which the average annual reservoir water volume of the reservoir is 39.48 billion m 3 and the controlled watershed area is 95217 km 2 .

Primary selection of forecast factors
Using the collected reservoir runoff data of September from 1952 to 2008 in Danjiangkou reservoir, the monthly average sea surface temperature in the North Pacific, the 100 hPa and 500 hPa monthly mean height field in the Northern Hemisphere, and 74 monthly circulation characteristics data. The factors of the factor field were correlated and passed a significance test of

Mahalanobis distance discriminant analysis
The Danjiangkou reservoir's storage runoff data from September 1952 to September 2008 was divided into simulation period  and inspection period (2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008), and according to the Hydrological Information Forecasting Specification (GB/T22482-2008) [11] , the inflow runoff of the Danjiangkou reservoir during the simulation period is divided into three sections according to the anomaly value, and the detailed division criteria is shown in Table 1. The selection of key predictors was carried out by using the AIC. The key factors selected are shown in Table 2.
According to the established regression equation, the simulation of the inflow runoff in September of the simulation period from 1952 to 2000 can be carried out.
According to key factors, the discriminant analysis results are shown in Table 3.

Forecast result analysis
The simulation period forecast result is shown in Figure  1. According to the "Standards for Hydrological Information and Forecasting"(GB/T22482-2008), the evaluation criteria for assessing the accuracy of longterm forecasting models: for quantitative forecasting, the water level (flow rate) is 10% of the multi-year variable, and other elements are 20%.The occurrence time is 30% of the change in the years as a license error. Taking 10% of the multi-year variable as the license error, 46 years of the 49-year simulation period meets the requirements, the pass rate is 93.9%, and the model's deterministic  With 10% of the multi-year variable as the license error, in the 8 years of the inspection period, only the error in 2003 exceed the allowable value, the qualified rate of the forecast is 87.5%; when the permissible change is 20% of the multi-year change, all the license errors are met the inspection period in 8 years, the pass rate is 100%.

Conclusion
a) The forecast application result shows that the prediction model based on Mahalanobis distance has good prediction accuracy and stability.
b) The key to improving the accuracy of forecasting is to improve the representativeness of factors rather than just increasing the number of factors. Using AIC criteria to determine the number of factors can help improve the accuracy and stability of the forecast.
c) The correlation analysis between forecasting factors and the analysis of the causal mechanism between forecasting factors and forecasting objects should be strengthened to further improve the representativeness of forecasting factors and reduce the error of forecasting in the future.