Prediction of Stream Flow in Humid Tropical Rivers by Support Vector Machines

Stream flow (SF) prediction is considered as a very complex due to the hydrological systems of surface water are complex and dynamic. The reliable prediction of stream flow (SF) can be performed by either conceptual or data-driven based models. In the modelling of hydrological processes, the support vector machine (SVM) is a novel, data-driven approach. Hence, six SVM-based models were generated in this study to predict real time hourly SF in the Selangor River Basin from the water level and rainfall of upstream stations. These models composed of six different combinations of input variables and were trained and tested under hourly records of SF, rainfall, and water level over one year (2011). Among the SVM-based models, SVM-M6, which has nine input variables, was the most effective. Under the training and testing data sets, its correlation coefficient and mean absolute error values were 0.992, 0.953, 0.061 and 0.253 respectively.


Introduction
To predict SF, various models have been established.These models can be classified into two main types: knowledge-driven and data-driven.Each type has specific advantages and disadvantages based on data availability and modelling condition [1,2].Knowledge-driven models are also known as physical or conceptual models.They are designed to simulate interior sub processes in prototypes, as well as physical mechanisms that dictate the natural process.These models use a mathematical structure that depends on basin features, such as the specific characteristics of rainfall (intensity and duration), the basin (area, shape, slope and land use, vegetation cover, and soil nature), and climate (temperature, humidity, and wind speed) to model and predict SF [3][4][5].However, these models are too complex and demanding.In some cases, conceptual models cannot predict SF accurately and reliably given the lack of required data, especially in developing countries [6], furthermore, the physical process is complicated by the gathering of data on multiple model variables that vary spatially and temporally [7][8][9][10].
Data-driven models include those developed using artificial intelligence (AI) techniques, such as artificial neural networks, genetic algorithms, support vector machine (SVM), and fuzzy rule-based systems.These models are adequate alternatives in many hydrological applications, especially when data are inadequate to generate conceptual models [11][12][13][14].
This study mainly aims to develop SVMbased models to predict hourly SF of downstream area from the water level and rainfall records of upstream stations in the river basins of humid tropical regions.These models are generated based on hourly records of SF, rainfall, and water level throughout one year (2011).The performance of the models are assessed based on two criteria, namely, correlation coefficient (R) and mean absolute error (MAE).

Methodology
In developing SVM-based models for SF prediction, we primarily consider data collection and analyses, followed by the selection of adequate input and output variables for the model.In small basins, these variables depend completely on the estimated lag time between the upstream and downstream stations.Thereafter, we determine model structure.Finally, we assess the developed models according to the evaluation criteria to obtain the model that best predicts hourly SF.

Case study
In this study, we investigate the Selangor River Basin, which is one of the main rivers in Malaysia.It is located in the Selangor state and has an approximate area of 1960 km 2 [15].From northeast to southwest, the Selangor River is approximately 110 km long [6,16,17].Moreover, the Selangor River Basin provides approximately 50% of the water consumed in Selangor and Kuala Lumpur [18,19].Figure 1 presents the location map of the Selangor River Basin in Peninsular Malaysia, as well as its topography maps.
Fig. 1 Location and topography map of the Selangor River Basin [20].

Data collection and analyses
The SF data of the downstream station were obtained from the Rantau Panjang gauging station, which is located in the downstream of the Selangor River.Before this station, all of the major tributaries of this river converge.Thus, the SF at the Rantau Panjang Station is the best indicator of the stream flow at the study area.Water level and rainfall data were obtained from four upstream stations.The study stations were selected based on data availability and modelling requirement.Moreover, the stations that gauge rainfall and water level are very close to one another.Figure 2 displays the location of the hydrological stations and the flow paths among them in the Selangor River Basin.
Table 1.Hydrological stations and the statistical characteristics of the data used.

Fig. 2 Location of the hydrological stations in the Selangor
River Basin.

Determination of model variables
In the development of AI-based models, determining the adequate input and output variables is a key issue.
In models of SF prediction, model variables are commonly selected based on a priori knowledge of river basin hydrology, which provides initial indications of potential inputs and outputs [21].The SF in tropical rivers can be characterized as the function of several influential variables, including rainfall, water level, and the physical characteristics of the river [22].This study mainly aims to predict hourly SF of downstream area from the water level and rainfall records of upstream stations.Thus, we use the hourly records of water level and rainfall at the upstream stations as input variables and those of SF data in the downstream station as the output variable.Eq. 2 describe the relationship between SF and the influential variables: where Sf (t) represents the SF; X(t) is the input vector that includes the input variables (i.e., rainfall and/or water level); and e is the random error.
We consider three scenarios in selecting the input and output variables of the models.First, we apply the rainfall data of upstream stations as input variables.Second, we regard the water level data of these stations as inputs.Third, we utilize both water level and rainfall data from these stations as inputs.
In these three scenarios, we apply two input vectors.In the first, we use the single antecedent record of upstream stations.In the second, we obtain the average of these antecedent records.Given six input vectors, the single antecedent record of SF in the downstream station is considered another input variable that predicts the SF for a head period equal to the lag time between the upstream and downstream stations.The estimated lag time between these stations determines the final input variables for the six input vectors.Using these vectors, we generated six SVMbased models to predict hourly SF.

Model description
SVM is a new learning system that has been developed based on the statistical learning theory aiming at minimizing the generalized model error rather than just minimizing the training error, which consequently increases SVM generalization ability [23,24].SVM is a comparatively new AI modelling technique based on statistical learning theory introduced by Vapnik in the 1970s.SVM has been developed as a classification tool and it was applied successfully in a wide range of classification and clustering applications in.Recently, SVM have been successfully extended to apply in regression and prediction applications [11,25,26].
Figure 3 presents the Schematic diagram of SVM, where the K(xi,x) is the output of the ith hidden node for input vector x, it is a mapping of the input x and the support vector xi by selecting the kernel function (Chen & Yu, 2007).SVM has been applied in the time-series prediction of river flow by Samsudin, Saad [6]; in SF prediction under multiple time scales by Asefa, Kemblowski [27]; in the real-time forecasting of flood stage by Yu, Chen [25]; in flood forecasting by [28]; in long-term discharge prediction by Lin, Cheng [29]; in the long-range forecast of SF by [30]; and in the monthly forecasting of SF by Guo, Zhou [31], Noori, Karbassi [32], Shabri and Suhartono [33], and Ch, Anand [34].To generate the models, we determined approximately 8,753 patterns of hourly SF, water level, and rainfall records throughout a single year (2011).Table 1 lists the basic statistical characteristics of the hourly records obtained from the stations, such as minimum, maximum, mean, standard deviation, and skewness.The modelling data were divided into two data sets: 75% for training with 6,580 patterns and 25% for model testing with 2,193 patterns.The training data set is used to train the models, and the testing data set assesses the performance of the SVM-based models [35].
In this study, we are applying the SVM as modelling tool.The training process was performed internally using close-source programing which is available in AI toolbox in Statistica software.Hence, the best training algorithm was selected by the software using built-in optimization technique, where Levenberg-Marquardt technique is adopted as the training algorithm because it provides the best performance over other algorithms.
The main attention of the this paper is maximize the correlation and minimize the error, regardless the details of training algorithms and techniques.

Performance evaluation criteria
The performance of the models was assessed based on two criteria: R and MAE.R is a statistical technique that indicates the strength and direction of a linear relationship between two variables [36,37].In this study, R was used to validate the agreement between the observed and predicted hourly SFs.R 2 describes the variance between two variables as determined by the linear fit.R can be calculated under different modes, but the most popular one is the Pearson R.This value is computed by dividing the covariance of the two variables by the product of their standard deviations, as described in the following equation: where n is the number of data pairs and x and y are the variables.
In a perfectly increasing linear relationship, R is +1.By contrast R is −1 in a perfectly decreasing linear relationship.The R values between +1 and −1 indicate the strength of the linear relationship between the variables.R = 0 signifies that the variables are not linearly related.MAE evaluates the residual or the differences between the observed and predicted SF.Theoretically, its minimum value should be zero to indicate the perfect fit of the model.However, this value is difficult to obtain.Moreover, MAE has no maximum value and is calculated using the following equation: where X m represents the data predicted by a model and X o is the observed data.

Results and discussion
To predict hourly SF in the Rantau Panjang Station, we selected six AI-based models under different combinations of input variables.Table 2 presents the model structures.The six models were trained and developed by SVM to predict hourly SF.The performances of the models were assessed based on the training and testing data sets, as well as the overall performance of the data sets.The best fit model to predict hourly SF is thus determined according to the performance of the testing data sets.SVM-M6 model displays the highest R values (0.992 and 0.953) and the lowest MAE (0.061 and 0.253) in both the training and testing data sets, respectively.Figure 4 shows the correlation between the observed and predicted hourly SF in the SVM-M6 model given training and testing data sets.The observed and predicted hourly stream flow of the training and testing data sets, seem to be in good accord with R 2 0.986 and 0.909 respectively.Figure 5 compares the observed and predicted hourly SF in SVM-M6 for the period of September 2013.These flows are highly consistent.Although SVM performed well in real-time SF prediction, higher R in the Selangor River basin can be investigated by employing other AI techniques, such as FRBSs and GAs.

Fig. 4 Fig. 5
Fig. 4 Correlation between the observed and predicted hourly stream flow in the SVM-M6 model: (a) training data set and (b) testing data set

Table 2 .
Input and output variables of the AI models.