Utilization of a Bayesian probabilistic inferential framework for contamination source identification in river environment

In the environmental event of hazardous release into river, quick and accurate identification of the contamination source is important for emergence response. Generally, given a noisy and finite set of monitoring information, determining the source items (i.e. location, strength and release time) is an ill-posed inverse problem. In this study, a Markov chain Monte Carlo method combined with advection-dispersion equation (ADE) was proposed for the source identification of contamination event in river system based on a Bayesian probabilistic inferential framework. Case study with analytical solution for one-dimensional ADE showed that the proposed methodology was effective and the mean posterior errors for all source parameters were lower than 3%. Case simulation based on two-dimensional ADE with numerical solution obtained similar results and further demonstrated the utility of the proposed approach for source identification. We hope the study will provide a helpful guidance to develop approach for contamination event source identification to support environmental risk management of river system.


Introduction
Rivers are spatially open system and inevitably vulnerable to a variety of outer risks [1]. According to the China Statistical Yearbook, a total of 9339 accidents associated with water contamination took place in China from 1997 to 2008. Among them, one of the most serious threats is the contamination injection of toxic chemical compounds into river. For instance, in November 2005, approximately 100 tons of benzene were spilled into Songhuajiang River, China and caused severe social and ecological problems [2]. Once occurring, quick and accurate identification of the contamination sources is essential to manage the emergency response and mitigate environmental consequences in the river system [3]. However, identifying source terms in environments faces challenge since it is a problem to find what happened in the past with finite set of monitoring information, such as, a limited number of observations [4,5].
In the past two decades, several deterministic and stochastic methods have been developed successively to facilitate the solution to the source identification problem for identifying the release locations of unknown sources, and estimate the release time and emission loads [2][3][4][5][6][7][8][9][10][11]. Among them, Bayesian approach has been applied widely for source identification in recent years as it has a number of distinctive attributes [10,11]. The method converts contamination source identification into the reiterative computation of posterior probability distribution of the source parameters and can quantify random errors in the concentration data and their uncertainties [5]. For example, Yang et al. [10] constructed differential evolution algorithm based on Bayesian inference to identify multi-point sudden water pollution sources. Compared with the deterministic calculations, the Bayesian framework estimates the posterior distribution of historical contaminants in an expanded stochastic space with current measurement concentrations and thus restates the ill-posed inverse problem as a well-posed problem [8,12]. Generally, the posterior distribution is composed of the prior distribution and the likelihood function. The former is the information related with an uncertain environmental parameter given the concentration history and the latter represents the probability of the data.
With rapid development of urbanization and quick increasement of population growth, the river environments are facing serious challenge on the contamination problems. Protecting ecological safety of river system has become a security issue because the demand for clean water is increasing in the world. Although contaminant source identification has been studied in river and lake pollution cases [3,4], limited publications have been found for quickly identifying contamination event source in river system with a very finite set of monitoring information.
In this study, a Bayesian probabilistic inferential framework combined with advection-dispersion equation (ADE) was proposed for source identification of contamination event in river system given a set of limited concentration data. Bayesian inference, which can provide the probability density function (PDF) of the source parameters, has been used to formulate the problem of source event determination and the Markov Chain Monte Carlo (MCMC) stochastic sampling method with Metropolis algorithm was employed to generate the posterior distributions of source parameters. Summary statistics associated with each variable can be estimated once a series of MCMC samples have been obtained. Finally, two hypothetical cases were conducted to demonstrate the developed methodology.

Bayesian formulation
By assuming the parameter vector m which indicates the leak characteristics: m (x 0 , y 0 , q 0 , t 0 ), where {x 0 , y 0 } represents the spatial location of contamination source, q 0 is its strength, and t 0 is the release duration, the posterior probability for the m vector with the concentration C at measurement point and prior information I can be formulated as following: In the posterior distribution, the role of P(C|I) is nothing more than a normalizing constant as C is the known data. Therefore, the posterior probability function P(m|C,I) is proportional to the prior probability P(m|I) and the likelihood function P(C|m,I).
2.1.1 Assignment of the prior density function P(m|I) is the prior probability for the m vector and hypothetically satisfy the independent uniform distribution in environmental hydraulic model. Generally, this probability is set as a constant: Given a bounded estimation domain ℜ , m (x 0 , y 0 , q 0 , t 0 ) ∈ ℜ , for example, the source location x 0 is assumed to be greater than zero but less than the length of river.

Assignment of the likelihood function
The likelihood function P(C|m,I) was employed herein to represent the probability between an array of measurements and a certain set of observed concentrations. Commonly, the difference between the measured concentrations (C) and modelled values (R) largely arises from the measurement and model errors [11].
where, C true,i, e meas,i, e model,i are the true value of measurement concentration, the measurement noise and the model error at the ith observation point, respectively.
Generally, the noise of measurement error is generally assumed to be normally distributed with mean of zero and variance of 2 1 σ [13], and then can be defined as following: where n is the number of observations. Likewise, the difference e model,i between the modelled concentrations and true values is also assumed to be normally distributed with mean of zero and variance of 2 2 σ : Therefore, the likelihood function is then obtained by marginalizing the joint probability density function of C and C true with respect to C true :

The posterior probability density function
Based on the prior distribution (Eq. 3) and likelihood function (Eq. 8), the relation of posterior probability density function P(m|C,I) can be expressed as following:

Markov chain Monte Carlo
There is no doubting it will be difficult to analytically compute the posterior distribution with Eq. 9. In the majority of cases, we are interested in the highest density region of the posterior distribution. Therefore, the Markov chain Monte Carlo (MCMC) simulation is adopted in this study to estimate the posterior distribution of the parameter vector m by constructing the posterior density from which the mean, median, and other statistical analysis can be done.
Essentially, the MCMC algorithm can be used to explore the state space of a random parameter using Markov chain mechanism and generate samples from the posterior distribution while spending most of the sampling steps in the highest density regions of the posterior state space [10,14]. In other words, it generates a point series as a chain, and the distribution of these points follows the posterior probability density function P(m|C,I). For a set of parameter vector m (x 0 , y 0 , q 0 , t 0 ), the posterior probability is computed in a circular manner using the MCMC algorithm until the chain has converged on the target distribution. Generally, each new Markov chain m k depends on the previous part m k-1 and will be accepted if it improves the posterior probability density of the previous set [15]. Based on the MCMC results, a statistical analysis including histogram, mean and deviation can be obtained for each parameter.

Metropolis algorithm
Metropolis algorithm, a special case of Metropolis-Hastings [10,16,17], was adopted in this study. Its proposal distribution (p) satisfies symmetrical random sampling, q(m (*) |m (i) )=q(m (i) |m (*) ), and the acceptance probability (A) of Markov chain ranges from m (i) to m (*) is as follows.
where m (i) is the sample generated in the ith iteration for parameter vector m.
The procedure flow of Metropolis algorithm is shown in Figure 1.

TWO CASES
In this study, two hypothetical cases were used to demonstrate the developed methodology for identifying contamination sources. Forward simulation of the spilled chemical compounds in a predetermined river was conducted to generate the concentration data set which was then regarded as the "measurements" from the observation points. Source identification was preformed based on the Bayesian probabilistic inferential framework combined with advection-dispersion equation, assuming that the location of spill location (x 0, y 0 ), the release duration (t 0 ) and the initial total leak mass (q 0 ) were unknown.

Case study Ⅰ: analytical solution for onedimensional advection-dispersion equation
The one-dimensional transport processes of a mass concentration, C, in the fluid flows is generally governed by the advection-dispersion equation: where t is the forward time, s; u is the flow velocity, m·s -1 ; E x represents the dispersion or diffusion tensors in x directions, m 2 ·s -1 ; k describes the reaction coefficient, d -1 ; q is the source number; M i is the source strength of the ith source; δ is the Dirac delta function; x i is the location of the ith source; L is the river length domain, m; T is the time domain, s. As descripted by Wu et al. [18], the advectiondiffusion Eq. 11 can be solved using a brute-force approach.
In this scenario, a kind of contaminant (k=0.1 d -1 ) with the total mass of 1.0×10 3 kg (q 0 ) was assumed to be released into a river at the point S and transported to the downstream river channel about 1.0×10 4 m. Two hours later (t 0 ), the contaminant plume was found and limited water samples (i.e. five) were quickly collected from five observation points (O 1 , O 2 , O 3 , O 4 and O 5 ). Point S was defined as coordinate origin, and the distance from point S to point O 1 was 5.0×10 3 m (x 0 ). The distance among each observation point was 500m. The dispersion tensors in x directions was 0.6 m 2 ·s -1 ; the mean velocity was about 0.36 m·s -1 . Then, the source parameter vector of the contaminant event that should be identified is m (x 0 , q 0 , t 0 ). Their actual values were 5.0×10 3 m, 1.0×10 3 kg, and 7.2×10 3 s, respectively.
The structural parameters ( 1 σ , 2 σ and proposed distribution step) were optimized using the iterative Expectation-Maximization method, a general iterative scheme for estimating the marginal PDF of a parameter from the joint PDF of the parameter and vector m. In specific, the optimization for the variances of measurement noise and model error was performed by using the likelihood portions of objective function. Meanwhile, the proposal distribution step was optimized using the prior PDF portion. In the presented study, the optimized values for 1 σ , 2 σ and proposed distribution step were 0.01, 0.01 and percent 5 of prior density range, respectively. Based upon the optimization results, the metropolis algorithm was conducted to run for 10 000 propositions of set of parameters. The iteration curves of source parameters showed that the Markov chain of all modeled parameters became convergence after 2000 iterations. Further, in order to verify the MCMC results, the posterior distribution was calculated directly, and the marginal distribution of each source parameter was evaluated by numerically integrating the full posterior distribution over the remaining source parameters, shown in Figure 2. The solid vertical line represents the true parameter value, and the dashed line is the mean of the MCMC samples. Shaded regions represent 95% highest probability density intervals based on the MCMC samples. It can be seen the stair-step appearance of the marginal distributions for q 0 , x 0 , t 0 was concentrated near to the initial assumed value. The posterior histogram of total leak mass q 0 had maximum probability near to 1.0×10 3 kg, which was very well resolved. The posterior probability density of source location x 0 was concentrated in the range of 4.8×10 3 and 5.3×10 3 m (actual value= 5.0×10 3 m). With regard to the spilled time t 0 , its maximum posterior probability was distributed at the range from 6.5×10 3 to 8.0×10 3 s, which included the actual value 7.2×10 3 s.
To determine the robust of the Bayesian methodology, Monte Carlo simulation was utilized to generate twenty data sets for random model parameters with errors of 10%. Results showed that the method of Bayesian computation for source event identification was steady and robust. The average errors of modeled parameters (q 0 , x 0 , t 0 ) for the twenty simulations were 2.9%, 2.5% and 2.2%, respectively. Relatively, the source strength had somewhat higher variation ranged from 0.5% to 8.5%, followed by source location x 0 (2.5%, 5.9%) and spilled time duration t 0 (2.2%, 4.7%).

Case study Ⅱ: numerical solution for twodimensional advection-dispersion equation
The two-dimensional transport processes of a mass concentration, C, in the fluid flows are governed by the advection-dispersion equation: Where t is the forward time, s; u x and u y are the depth-integrated velocity components in x and y directions, respectively, m·s -1 ; D x and D y represent the dispersion or diffusion tensors in x and y directions, respectively, m 2 ·s -1 ; K describes the reaction coefficient, d -1 ; q is the source number; s i is the spilled rate of the ith source; x i is the location of the ith source.
The above equation can be solved with finite element method by which partial differential equations can be solved approximately [19]. In the present research, ANSYS software (v10.0) which offers a comprehensive field of engineering simulation was employed to compute the modelled concentration.
In this scenario, the total spilled mass was 5.3 tons (q 0 ) and the contaminant plume was found one hour (t 0 ) later since the mass-release. Ten time histories of "measured" concentration were generated with ANSYS software (v10.0) at intervals of 5 minutes as the assumed observations. The dispersion tensors in x, y directions was 0.6 and 0.01 m 2 ·s -1 , respectively, and the mean velocity in x, y directions was about 0.36 and 0.002 m·s -1 , respectively. Then, the contaminant event source parameters to identify were x 0 , y 0 , q 0 and t 0 .
Likewise, the Expectation-Maximization approach was employed to optimize the structural parameters. With the optimization values ( 1 σ = 2 σ =0.01; percent 5 of prior density range for the proposal distribution step), the metropolis algorithm was conducted to run for 10 000 propositions of the set of parameters. Markov chain of all target parameters became converged after 800 iterations. Figure 3 shows the posterior histogram of model parameters generated from MCMC samples. Similarly, the solid vertical line and dashed line represent the true parameter value and the mean of MCMC samples, respectively. It can be seen that the stair-step appearance of the marginal x 0 , y 0 , q 0 , t 0 distribution was concentrated near to the initial assumed value.
Although the posterior histogram showed that the Bayesian methodology combined with the numerical solution for two-dimensional advection-dispersion equation could yield a good approximation for the source terms, including the total discharge mass q 0 , the source location (x 0 , y 0 ) and the contamination elapsed time t 0 , it was important to further assess the identification errors related to the contamination event source. It can be seen from Figure 3 that the chain was quickly converged after 800 iterations. Therefore, the first 800 "burn-in" samples were discarded conservatively, and the remaining 9200 samples were used to generate the posterior statistical analysis. The results demonstrated that the Bayesian methodology generated an honest identification of the contamination event in that the source items were accurately predicted according to the available context parameters, such as the data set of 'measured concentrations'. The posterior histogram of source location (x 0 , y 0 ) had maximum probability near to (484m, 33.9m) with mean error of (3.14%, 3.17%) and the mean posterior PDF for source strength q 0 concentrated between 5.2×10 3 kg and 5.5×10 3 kg with mean of 5.3×10 3 kg. Regarding the spilled time t 0 , its maximum posterior probability distributed at the range from 3.5×10 3 s to 3.9×10 3 s, which included the actual value 3.6×10 3 s. These results suggested that the agreement between the MCMC results and the posterior distribution was good.

CONCLUSIONS
During the emergency response of accidental incidents in river system, it is important to quickly determine the contamination source terms using limited measured information for protecting the water resources and minimizing the social losses. This study proposed a Bayesian probabilistic inferential framework to identify the contaminant event sources in river by combining with advectiondispersion equation given a finite monitoring data.
Regarding both measurement and model errors in the probability density function, the method was coupled with a fast stochastic sampling algorithm to promote the convergence efficiency for quickly determining the source term parameters. Results of the two case simulations showed that the proposed methodology was effective and robust for source event identification on chemical spill in river environment. The posterior mean errors for all source parameters were lower than 5%, which demonstrated the utility of the approach for inverse source identification.