Development of Seasonal ARIMA Models for Traffic Noise Forecasting

In this paper, a time series analysis approach is adopted to monitor and predict a traffic noise levels dataset, measured in a site of Messina, Italy. In general, acoustical noise shows a high prediction complexity, since its slope is strongly related to the variability of the sources and to intrinsic randomness. In the analysed site the predominant source is road traffic, that has a periodic and non-stationary behaviour. The study of the time evolution of this hazardous agent is very useful to assess the impact to human health and activities. The time series models adopted in this paper are of the stochastic seasonal ARIMA class; these types of model are based on the strong periodicity registered in the acoustical equivalent levels. The observed periodicity is related to the highly variability of urban traffic in the different days of the week. Three different seasonal ARIMA models are proposed and calibrated on a rich dataset of 800 sound level measurements. The predictive capabilities of these techniques are encouraging. The implemented models show a good forecasting performances in terms of low residuals, i.e. difference between observed and estimated noise values. The residuals are analysed by means of statistical indexes, plots and tests.


Introduction
The health of the population of urban areas is often threatened by the action of different pollutants [1].Among them, the most common and dangerous are gases from combustion phenomena, high intensity electromagnetic fields and acoustical noise produced by transportation infrastructures and other human activities [2].It is therefore essential to constantly monitor these pollutants but it is generally expensive and not always easy to implement.In addition, mitigation actions on pollution sources are usually adopted only when malicious agents are particularly high and hence have already influenced the health of citizens.High and dangerous acoustic levels are mainly caused by anthropic activities, in particular vehicular traffic and other transportation infrastructures.Therefore, it is clear that it is useful to implement analytical models that can provide a reliable assessment of the levels of pollution (see for instance [3][4][5][6][7][8][9][10][11][12][13]).If a reliable forecast is available, it is possible to limit the overcoming of certain levels of pollution by mitigation measures, also acting on the sources, before physical agents can affect the population.The authors gave a large contribution to transportation noise assessment and prediction (see for instance [14][15][16][17][18][19][20][21][22][23][24][25][26][27][28][29][30]).
Largely adopted forecasting models are often based on the study of correlations or causal effects that influence the sources of the noise pollution levels.However, due to the nature of the physical phenomenon, in the case of acoustic noise, it is very difficult to predict the effects in a restricted area by studying only the sources.Such a method can be strongly influenced either by the architecture of the area where the measurements are acquired or other environmental interferences which are random and variable over time.Therefore, the interest in predictive models that exploit the information contained in measurements at the receiver, is increasing: these techniques include analytical models of time series.The authors have developed and improved various deterministic and stochastic models useful for modelling and predicting univariate time series, see for example [31][32][33][34][35][36][37].
In this paper, three different models of time series analysis, useful for predicting the noise level in urban areas, have been developed.The predictive models considered here are of the stochastic class: in particular three different types of Seasonal Auto-Regressive Integrated Moving Average (SARIMA) functions have been developed.The SARIMA can predict the evolution of noise levels for a given time interval in a specific area of interest, i.e. the area where the data used to estimate the parameters (calibration) of the forecasting function have been acquired.
A set of noise measurements recorded during the daytime in the city of Messina, Italy, is used to calibrate the models.These data consist of daily equivalent sound pressure levels (L A,eq ), averaged over sixteen hours (from 6:00 a.m. to 10:00 p.m.).
To determine the best model to describe the analysed time series, a complete statistical characterization of the 800 measured data was realized.The noise level series is characterized by an high auto-correlation for a seven-day lag: this characteristic is due to the strong dependence of vehicular traffic from the day of the week.In fact, since on Saturdays and Sundays the traffic flows are lower than during working days, the measured acoustic levels were significantly influenced by this periodicity and during the weekend showed lower values with respect to other days of the week.For this periodic behaviour, seasonal stochastic models are adopted that take into account the high periodicity of the series studied.Due to the non-stationary nature of the data, some models provide for the adoption of differences operators that make the proposed models of the "integrated" type.To point at the best model, a comparison between the measured data and the forecasted level is performed.Therefore, in the final part of the paper, the analysis of the residuals is carried out both qualitatively by graphs and quantitatively using different error metrics.The three models provide a good approximation of the observed series and may indeed be useful to describe and predict the acoustic noise in the studied site.

Methods
In many scientific fields, it is useful to mathematically describe and predict the evolution over time of a given variable under study.This univariate time series can be modelled, for instance, by a deterministic decomposition model, able to extend the forecast to many periods in the future.Such model typology has been widely adopted by authors for the study of acoustic noise [31][32][33][34][35], concentration of gaseous pollutants [36] and for the evolution of electricity consumption [37].
However, when the studied phenomenon presents rapid fluctuations and a short-term forecast is useful, a stochastic model can be more suitable.In this paper, three different stochastic models of the auto-regressive moving-average type have been implemented, also using differentiation operators on an acoustical noise dataset.The proposed models are of the seasonal class, so they use the weekly periodicity present in the analysed data.Therefore, the adopted models belongs to the multiplicative Seasonal ARIMA type, generally indicated by the acronym ARIMA (p, d, q)x(P, D, Q) s , where p indicates the degree of autoregressive polynomial, d indicates the number of applied differentiations, q indicates the degree of the moving average polynomial.The seasonality period is indicated by the number s, with its seasonal autoregressive (P) and moving average (Q) polynomials, and D seasonal differentiations.
In general such a model is defined as a model with AR characteristic polynomial ‫)ݔ(ߔ)ݔ(߶‬ and MA characteristic polynomial ‫,)ݔ(߆)ݔ(ߠ‬ [38], where: The forecast is built using the latest available data in the series, so the model is able to easily track the time fluctuations of the studied series by adapting rapidly to data changes.In addition, it is necessary to estimate few coefficients to construct the model function, so the principle of parsimony is respected.In this paper, the method of likelihood function maximization is adopted to estimate model coefficients, using the 800 acoustical data measured in Messina in the calibration phase.

Models accuracy evaluation
Model diagnostics is concerned with testing the goodness of fit of a proposed predictive technique.An effective methodology to test model performance is based on the residual analysis: graphs and plots, statistical indexes and error metrics are good strategies to test models adequacy.Estimated residuals (݁Ƹ ‫ݐ‬ ), in the ideal situation, correspond to the stochastic component present in a time series (e t ) that is considered to take into account irregularity of the dataset and it is not deterministic predictable.
Estimated residuals may be computed in the calibration phase of the modelling process, i.e. when the measurements are available, as the subtraction between actual data and predicted values at a given period t: The general assumption is that the irregular term e t is normally distributed and, if the model is correctly specified and the parameter estimates are reasonably close to the true values, estimated residuals should appear distributed like white noise.They should behave roughly like independent, normally distributed variables, characterized by a null mean [38].
In the next sections, a comparison between the residuals of the three proposed SARIMA models, calibrated in the same data range, will be presented.
In order to estimate the models accuracy, the statistical features of the estimated residuals, will be studied.Descriptive plots of the computed residuals will be shown to perform a qualitative analysis; frequencies histograms will be presented together with quantilequantile plots and autocorrelation plots.A quantitative analysis will be performed using residual statistics, that are mean, standard deviation, median, min and max values.In addition, skewness and kurtosis indexes will be calculated to evaluate the normality of the distributions.
To evaluate residuals distortion from the mean value and dispersion around mean, quantitative metrics of error are given by the "Mean Percentage Error" (MPE) and "Coefficient of Variation of the Error" (CVE), according to the definition reported in [32].
An effective measurement of forecast accuracy is also the "Mean Absolute Scaled Error" (MASE) [39].The MASE for seasonal time series is computed according to the following formula: MASE is computed using as "naïve" model [40] in the denominator, the value measured in the series k periods before the period t , assuming that the period t can replicate the observed value at time t-k.
Considering the fact that parameters of the models are estimated using the method of the likelihood maximization, also the Akaike's Information Criterion (AIC) is proposed to evaluate models performances.This criterion suggests to select the model that minimizes: where k = p + q + 1 if the model has an intercept or a constant terms, k = p + q otherwise.

Data analysis and model specification
An acoustical data set related to the city of Messina, located in Sicily, in the south of Italy, has been analysed with the proposed forecasting models.The main sources of noise pollution are car traffic and typical anthropic activities of a medium-sized city.Indeed, the city of Messina has about 240000 inhabitants and, among the various problems of pollution distinctive of a medium urban agglomeration, also shows persistent acoustical noise caused by vehicular traffic.In different areas of the city the local administration has located noise monitoring stations, in which fixed and mobile sound level meters are installed.Therefore, time series of acoustical levels are available to study the noise pollution phenomenon.In this paper, the authors have taken into account the equivalent daily level, weighted with the "A" weighting curve, registered in the site of "Via La Farina".
The measurement of road traffic noise was carried out by the "Environmental Monitoring Service" of Messina.
The noise series used for this study refers to the daily time period: from 6:00 am to 10:00 pm and 800 data will be used in the calibration phase of the models.This time series is composed by 628 measured data and 172 data imputed using the technique described in [41].The observational period starts on the 22 nd of April 2008 and finishes on the 30 th of June 2010.The summary statistics of the data are shown in Table 1.The observed mean level of about 71 dBA is very high considering the urban residential area under study and this level, considering the low standard deviation and spread, is persistent during the observation period.
In Fig. 1, the analysed series is plotted in the time domain: the seasonal nature of the acoustical level is evident.The weekly periodicity is confirmed also by the correlogram plot of the observed data, that shows the maximum autocorrelation for a lag of seven days (Fig. 2).
To infer the structure of a stochastic process from the time series of that process is necessary to make some simplifying and reasonable assumptions; the most important of these is stationarity [38].Thus, to build an effective seasonal ARMA model is useful that the series is stationary: to achieve this goal, the technique of the differentiation is adopted.Figure 3 shows the autocorrelation plots of the series after three diverse differentiations choices: 3(a) refers to a difference at lag one; 3(b) refers to a difference at lag seven (seasonal difference); 3(c) refers to the series after a first difference at lag one and a second difference at lag seven.

SARIMA models details
In the previous section, the time series under study has been described from the statistical point of view.In this section three different SARIMA models are designed and applied to mathematically reproduce the observed acoustical phenomenon.Two models use the differencing technique to obtain a stationary series.All the three adopted models belong to the seasonal multiplicative class.Parameters estimation for all the proposed models is obtained using the maximum likelihood method implemented in the R software.

Seasonal autoregressive moving average (0,1,1)x(0,1,1) 7 model
The first model adopted is a Seasonal ARIMA with seasonal lag equal to seven days (s = 7): according to the most used notation the model is a SARIMA (0,1,1)x(0,1,1) 7 type.Recalling what presented in Section 2, the model can be formulated as follows: Looking at Fig. 3(c) seems reasonable that a simple model is adequate, so after taking both a first difference at lag one and a second difference at lag seven, only a negative autocorrelation remains, for lag 1 and 7.In table 2, the numerical values of the two moving average coefficients are reported.Those two estimated coefficients have a low standard error with respect to their absolute value, so the null value for the coefficients can be neglected.Figure 4 shows that this parsimonious model is able to correctly reproduce the general behaviour of the studied seasonal series.The measured acoustical level is tellingly lower than predicted one around the 550 th period, but usually, in this kind of environmental application, a model that overestimates a pollutant level is preferable than an underestimating one.

Seasonal autoregressive (7,1,0)x(0,1,0) 7 model
The second model adopted is again a Seasonal ARIMA with seasonal lag equal to seven days (s = 7); more precisely the model can be identified as a SARIMA (7,1,0)x(0,1,0) 7 type.In the following formulas, ‫ݕ‬ ‫ݐ‬ is the differenced (i.e."stationarized") series: ‫ݕ‬ This second model is based on the assumption that all the useful information is contained in the past seven periods of the series respect to the forecasted day.The model adopt both first and seasonal differentiation and only autoregressive terms.In table 3, it is possible to notice that the coefficients of the AR5 and AR6 terms are not different from zero by a statistical point of view, thus one can conclude that noise levels measured five and six days before do not significantly affect the forecasts of acoustical levels.The good predictive performances of this model, in a one step ahead forecasting during the 800 observed days, is shown in figure 5.The third model adopted is a SARIMA (0,0,1)x(1,0,0) 7 , that is slightly different from the previous, since the intercept m is not null.The basic model and the forecast equation are: Fig. 6.Comparison between the observed 800 calibration data and the levels in the same periods predicted by the SARIMA (0,0,1)x(1,0,0) 7 model.
This third model differs from the others because it does not apply the differentiation operator on the time series under study.In general, the differentiation has the aim to make the data stationary and the standard assumption is that stationary series have a zero mean.In the adopted model, a nonzero constant mean is fulfilled introducing the intercept term m, in this case about 70.8 dBA as shown in Table 4.
Also the third proposed model, with the intercept term, shows good predictive performances: in Figure 6 the general behaviour of the series is well reproduced.

Model diagnostics and residual analysis
In this section a statistical analysis of the forecasting errors, denoted in the calibration phase whit the term "residuals" and defined in the formula (3), is performed.In table 5, measurements of central tendency and dispersion for the three models are reported, together with the minimum and maximum value of the residuals and skewness and kurtosis indexes.These results show that all the three models are capable of a good forecasting in the calibration dataset: mean and standard deviation are very low and the distributions of the error are almost symmetrical.In some specific periods, the three models are not able to follow drastic fluctuations of the acoustical noise, so the minimum and maximum values of the residuals exceed 5 dBA.An outlier detection and removal analysis could improve these parameters, reducing the value of minimum and maximum residuals.
In table 6, values of the error metrics presented in section 3 are shown.No one of the three models appears to be better than the others: MPE and CVE are always very low, MASE is lower than 1 for all the models and AIC values are similar.
Table 7 reports the values of Shapiro-Wilk [42] and Jarque-Bera [43] normality tests.Both the tests reject the null hypothesis of a normal distributed sample: rapid fluctuations of the acoustical noise are not well described by the three models, so the tails of the residuals distribution deviate from a normal shape.In figures 7, 8 and 9, three plots that describe the features of the residual distributions are reported, respectively for the three models.The almost normal distribution of residuals for all the models is a good result, since it is characteristic of casual and not systematic forecasting errors.In figure 7a, a residual autocorrelation for lags of one and seven days is still present.The second model shows a significant negative autocorrelation in the residuals, only for a lag of 14 days (see figure 8a).Finally, the third adopted model has autocorrelation in the residuals for lags of two and seven days (figure 9a).

Conclusions
In this work a time series of urban noise levels has been analysed.The aim of this analysis was the design of a forecasting model able to monitor and predict noise pollution in any urban area.The studied time series was composed by sound level measurements registered in a large and crowded road of the city of Messina, Italy.Since the analysed series showed an evident periodic behaviour, the three proposed models were of the seasonal ARIMA typology.Higher noise levels have been measured during working days of the week, while, on the contrary, Saturday and Sunday the reduction of the anthropogenic activities (and of the traffic flows) implicated a lower level of noise pollution.The mean value of the measured acoustical levels (70.78 dBA) was quite high considering that the studied location was near many residential buildings.
Two of the three proposed models used the data differentiation technique to achieve better stationarity of the time series.The 800 daily acoustical levels of the studied time series have been used to estimate the coefficients of the adopted models in the calibration phase of the modelling procedure.
These calibration data have been also used for a quantitative comparison between the performances of the three models.The comparison was performed by means of residuals analysis.The three models showed good performances in terms of low standard deviation and close to zero mean value of the residuals distributions.Frequency histograms and Q-Q plots of the residuals showed that the residuals are quite normally distributed.However, no one of the proposed models was able to follow casual and sudden fluctuations of noise level, thus the tails of residuals distributions differed from the normal shape.
Finally, thanks to the MASE error metric, it has been observed that the three proposed models offered a prediction that on average was better than the chosen naive model.As naive model, taking into account the strong autocorrelation of seven period, it was assumed that a reference forecast can be obtained by assuming that the period t may replicate the observed value at period t-7.The MASE confirmed this remark since for the three proposed models this error metric is less than one.

Fig. 2 .
Fig. 2. Correlogram plot for the first 800 days of the series.The value of autocorrelation is plotted as a function of the lag.

Fig. 3 .
Fig. 3. Correlogram plots for the first 800 days of the series after differencing.(a) correlogram of the series after a difference at lag one; (b) correlogram of the series after a difference at lag 7; (c) correlogram of the series after a first difference at lag one and a second difference at lag 7.

Fig. 7 .
Fig. 7. Residuals of the first model applied to the 800 calibration data: (a) correlogram plot; (b) histogram; (c) normal probability plot that describes the residuals behaviour compared to a normal distribution.

Fig. 8 .Fig. 9 .
Fig. 8. Residuals of the second model applied to the 800 calibration data: (a) correlogram plot; (b) histogram; (c) normal probability plot that describes the residuals behaviour compared to a normal distribution.

Table 1 .
Summary statistics of the 800 observed acoustical levels measured during the calibration period.

Table 5 .
Summary statistics of the residuals distribution evaluated on the calibration dataset for the three models.

Table 6 .
MPE, CVE, MASE (error metrics) and AIC values calculated in the calibration phase, for the three different models.

Table 7 .
Shapiro-Wilk and Jarque-Bera normality tests performed to residuals of the models applied to the 800 calibration data.