Characteristics and Influence Factors of Water Consumption in China Pro-vincial Capital Cities by Means of Multivariate Regression Algorithm

The city is a typical natural and social dual water circulation area. The water use characteristics of the city have obvious dual attributes, and there are many factors (both on nature and social side) affecting the urban water consumption (UWC). This article aimed to research the structure and characteristics of the UWC. Taking the provincial capital cities of China as the research objects, 24 index factors and data of the year 2015 were selected to construct a multivariate regression model between urban UWC and index factors. The results showed that the combination of correlation analysis and full subset regression could effectively screen the prediction variables of UWC. Principal component analysis could effectively reduce variable dimensions of UWC while preserving raw dataset information as much as possible. The main factors affecting UWC on the social side include the built-up area, the urban population, road cleaning area, residential electricity consumption, and per capital water consumption, and the main factors of the natural side include per capital green land and precipitation.


Introduction
The urban hydrological cycle process becomes more complicated under the influence of climate change and urbanization [1]. Urban water consumption (UWC) is an important part of urban hydrological process, which determines the process change and flux of urban hydrological cycle [2]. In addition, UWC is an important reference for urban water resources management and allocation, as well as an important index for urban planning, design and construction [3]. In recent years, the shortage of urban water resources has become a bottleneck restricting urban development [4]. Therefore, it is necessary to understand the influencing factors of UWC and analyze their characteristics.
However, due to the interference of strong human activities, the process of UWC becomes more complicated [5]. It is not only disturbed by human activities, but also affected by natural factors such as hydrology and meteorology [6]. The growth of urban population promotes the increase of the total amount and intensity of UWC, while the water price controls the amount and intensity of UWC [7]. Water consumption per capital per day is highly influenced by meteorological factors, socioeconomic status, water supply and conservation factors [8]. From a technical point of view, water supply facilities, water use efficiency and reuse rate also have a great impact on urban water use [9].
Multiple linear regression, time series methods and artificial neural networks were often used to forcast the UWC in many existing research [10][11]. But these methods or methods are basically a black box model, so it is difficult to analyze the influence characteristics of influencing factors on UWC [12], which is the premise and basis for improving the situation of urban water use, the scientific basis for realizing the fine management of urban water resources [13]. Data mining is a process of extracting hidden and potentially useful information and knowledge from a large, incomplete, noisy, fuzzy, random real data [14], which is applicable to the study of complex relationships that are not of a non-mechanical rationality [15]. This paper selected the methods of correlation analysis, regression analysis, and principal component analysis (PCA) in data mining methods to study the influencing factors and their influencing characteristics of UWC. It was expected that the natural and social duality of UWC can be analyzed, and the influence of natural and social factors on UWC could be considered as comprehensively as possible.

Materials and methods
The objects of this study were the 31 provincial capitals and municipalities cities directly under the central government of China in year of 2015, which did not include Taipei because of insufficient information. Data sources were "China City Statistics Yearbook", "China City Construction Statistical Yearbook", "Statistical Yearbook" of each city and so on. According to statistics data, there were six main categories of water consumption in cities, namely, residential water, water for public administration and services, water for production, free supply water, leakage and other water supply. Figure 1 showed the structure of sample cities in 2015. The residential water was the largest type of water for most cities [16]. In 2015, more than 50% of the total consumption of water for residents and public services was more than 80% in these cities. The main influencing factors included urban construction, social economy, climatic characteristics and so on. Specific indicators included the area of builtup areas, urban population, per capita road area, per capita income per capita, per capita water consumption, per capita road area, per capita green area, tertiary industry ratio, average temperature, rainfall days, sunshine hours and so on. The detailed names and signs of 24 index variables were shown in the table 1. R language was the main research tool in this study, which was a powerful tool for data analysis, summary, exploration and mining. The correlation analysis in R language was used to test the index variables and reduce the variables with good correlation. Then, the all-subsets regression was used to select the best predictor variables by adjusting the value of R 2 . On the other hand, the principal component analysis (PCA) method was used to analyze the prediction variables, and the principle was to select the latent relationship structure of the predicted variables, and to reduce the dimension of the variables. Then the multivariate linear regression analysis was carried out by ordinary least squares (OLS) method, and data mining research route was shown in figure 2.

Correlation analysis
The corrgram () function in the R language was used to graphically show the correlation between variables, as shown in figure 3. The diagonal line was 24 variables, and the triangle area at the lower left of the diagonal line was a rectangle with slashes of different colors. The blue rectangle pattern indicated that the two variables in the cell were positively correlated, and the direction of the slash was also from the bottom left to the upper right. The deeper the blue was, the greater the correlation was. Similarly, the red rectangle represented a negative correlation between the two variables, and the slash slanting from the top left to the bottom right, the deeper the red was, the greater negative correlation was. The diagonal upper right showed the correlation between the two variables in the form of a pie chart. In the same way, the blue represents the positive correlation, and the red represented the negative correlation. The positive correlation was to fill the pie chart clockwise from 12 points, and the negative correlation was filled in the counter clockwise direction. And the larger the filling range was, the deeper the color was, and the greater the correlation was.
The positive correlation between the variables such as ua, ie, pi, iww, pnd, rca, re, pn, ba, wsp was good, and the negative correlation between s and t, p, pd was better. It needed to be selectively deleted from these variables to improve the unrelevance and effectiveness of the premeasured variables. Because the variable s had good correlation with the other three variables, the variable s

ALL-subsets regression
The all-subsets regression is an analysis of all possible models, with the number of variables increasing from 2 to all until R 2 reaches its maximum and becomes stable. We used the regsubsets () function in the leaps package in R language to achieve all-subsets regression. The values of "nbest" was set to 4, which mean that showed four best predictive variable models for each number of variable in the regression process. Figure 4 showed the best model for a all-subsets regression of UWC. In the graph, the horizontal axis was the intercept and 24 predictive variables, and the vertical axis was the R 2 value under the combination of different numbers of variables. In this study, the optimal R 2 reached 0.96, and the fitting effect was better. The corresponding variable number was 8. In order to quantitatively analyze the frequency of occurrence of different variables in the full subset regression and combine the correlation analysis to select the best variables, table 2 gave the proportion of 24 predictive variables in the all-subsets regression. Among them, the probability of the occurrence of the ua, t, s, gr, sr, tr, pr was 0, so the seven variables were deleted. Because the frequency of iww was lower and the rate of occurrence of ie, the iww was deleted. The occurrence rate of pi, pd, pnd, dp was also low, so these variables were also deleted.
By means of correlation analysis and all-subsets regression analysis, 12 variables were deleted from 24 prediction variables. The OLS multivariate linear regression was used to fit the UWC, and the results of 12 variables and 24 variables were compared. In the case of 24 variables, the multiple R-squared was 0.988, the adjusted R-squared was 0.934 ， and the p-value was 0.00057. In the case of the 12 variables filtered, the multiple R-squared was 0.977, the adjusted R-squared was 0.962，and the p-value was 3.018e-12.

Regression fitting of UWC based on principal component analysis
Using principal component analysis (PCA), the related variables were converted into unrelated combinatorial variables to reduce the dimension of variables and retain the original information as much as possible.
Through the score expression of principal component, the value of six principal components were calculated by using the data set of predictive variable, and then the multivariate linear regression fitting was carried out by using lm () function. The six principal components were represented by z1, z2, z3, z4, z5, and z6, and the water consumption of city was represented by y. The relation was obtained as follows: specific indicators mainly included the built-up area, the urban population, road cleaning area, residential electricity consumption, per capital water consumption, per capital green land, precipitation, and so on, which covered factors on both the social and natural sides.