A concept of the air quality monitoring system in the city of Lublin with machine learning methods to detect data outliers

This paper presents a concept of the air quality monitoring system design and describes a selection of data quality analysis methods. A high level of industrialisation affects the risk of natural disasters related to environmental pollution such as e.g. air pollution by gases and clouds of dust (carbon monoxide, sulphur oxides, nitrogen oxides). That is why researches related to the monitoring this type of phenomena are extremely important. Low-cost air quality sensors are more commonly used to monitor air parameters in urban areas. These types of sensors are used to obtain an image of the spatiotemporal variability in the concentration of air pollutants. Aside from their low price , which is important from a point of view of the economic accessibility of society, low-cost sensors are prone to produce erroneous results compared to professional air quality monitors. The described study focuses on the analysis of outliers as particularly interesting for further analysis, as well as modelling with machine learning methods for air quality assessment in the city of Lublin.


Introduction
With the development of urban conurbations and industry, air pollution has become a serious problem. Poor air quality due to energy sources (in particular, poor quality coal-fired furnaces) has become a problem that affects not only city dwellers but has also become a serious issue for residents of villages and small towns.
Air quality has a fundamental impact on the quality of life and health of the inhabitants of the globe [1]. According to the World Health Organization (WHO) reports, air pollution is a major contributor to the increase in respiratory disease. During the year, around 4.2 million people die due to poor air quality globally. In Poland, according to data published by the WHO, in 2012 more than 26 thousand deaths were recorded that were caused by toxic chemicals contained in the inhaled air [2].
Informing residents about the state of air quality in their region of residence is of great importance for immediate response to the existing conditions, but also of great educational importance. Residents of a given area knowing the exact location of air quality sensors will be able to protect themselves against harmful conditions or to intervene in cases when a source of pollution is found. However, such actions can only be taken in cases where quality monitors are available in large quantities and work in a connected sensor network. The construction of such a system is possible assuming that monitoring devices will be cheap and generally available.
In the case of the city of Lublin, public data is available only from two stations: the first station is located in the city centre on Obywatelska Street and the second station located in the town of Wilczopole publishes very limited data (concentrations of O3). The amount of data on the air quality condition in the Lubelskie Voivodeship is very limited -on the website of the Chief Inspectorate for Environmental Protection, air quality data from but a few stations are published [3]. Residents of small towns do not even have the opportunity to learn about the state of the air they breathe. The problem is intensified in the winter months, where the largest of the furnaces with very low energy efficiency fired with hard coal are used [4].
A positive phenomenon that has recently been observed is the increase in public awareness of the impact of air quality on health. This implies further phenomena: the desire to change the sources of energy for greener ones and the desire to observe their own surroundings oriented to the purity of air [5]. The general availability of tools for building the Internet of Things (IoT) systems of low-cost sensors that collect air quality data makes it possible to popularise applications monitoring the state of air in local communities. Considering the small amount of public data available from sensors belonging to governmental institutions, a favourable environment is created to build air quality monitoring systems based on social systems using cheap air quality monitors. In addition to the main objective of obtaining information on temporary air quality and atmospheric conditions, in the future such systems will enable collection of a large amount of historical data, enabling the creation of models describing environmental changes, air quality, and atmospheric conditions. Moreover, such systems in the future will form a part of smart city systems.
The use of generally available air quality monitors is associated with their low price, and with their quality, which is lower compared to professional air quality monitoring stations. Therefore, it is important to develop a system that will be able to distinguish variable measurement values from measurement errors with a large variability of atmospheric conditions. In the article, the authors describe the assumptions of the system collecting measurement data from basic sensors measuring selected gas compounds and present practical methods for capturing outliers in time series data.

Related works
Techniques for detecting outliers can be divided into four groups: (1) statistical approaches -the earliest algorithms used for outlier detection; (2) distance-based and density-based methods; (3) profiling methods; and (4) model-based approaches. In statistical techniques, data points are usually modelled using a stochastic distribution, and points are marked as outliers depending on their relationship to this model. The general idea is that a certain kind of statistical distribution is known and computing parameters assuming all data points have been generated by such a statistical distribution (e.g. mean and standard deviation). In the statistical approach, outliers are points that exhibit a low probability of being generated by the overall distribution (e.g., deviate more than 3 times the standard deviation from the mean). However, in practice, the use of statistical methods has serious limitations related to the fact that in many cases it is difficult to determine what statistical distribution is to be dealt with. Another problem is that the average value and standard deviation are highly sensitive to occurring outliers. For this reason, such methods can be treated as global, analysing data on one specific and known distribution.
Distance-based approaches detect outliers by calculating the distance between points. Cluster-based techniques can also be used to detect outliers: points that do not belong to clusters; or clusters that are much smaller than others. Basic assumptions in this technique are that normal data objects have a dense neighbourhood, on the other hand, outliers are far apart from their neighbours, i.e., have a less dense neighbourhood. Similarly, in density-based approaches, they compare the density around a point with the density around its local neighbours and then the relative density of a point compared to its neighbours is computed as an outlier score. The density-based approach was proposed because the distance-based outlier detection models have problems with a neighbourhood with different densities. In the density-based approaches, the following models are used: -reachability distance; -local reachability distance (lrd) of point p; -local outlier factor (LOF) of point p. The LOF detection algorithm has many variations, one of them being the top-n exploration of local outliers. In this case, the method relies on compression of data points into micro-clusters using the clustering features (CFs) of balanced iterative reducing and clustering using hierarchies (BIRCH) [6], next upper and lower bounds of the reachability distances are derived along with, lrd-values, and LOF-values for points within micro-clusters. The following step is to compute upper and lower bounds of LOF values for micro-clusters and sort results w.r.t. ascending lower bound. Finally, prune micro-clusters that cannot accommodate points among the top-n outliers (n highest LOF values). In the literature on methods for analysing sets for outliers, we can find further proposals for coefficients determining outliers. Proposed by [7] local outlier correlation integral (LOCI) has an idea of LOF with differences to LOF as follows: -take the εneighbourhood instead of kNN as reference set; -test of multiple resolutions (called "granularities") of the reference set to get rid of any input parameter.
In profiling methods, normal behaviour profiles are built using different data mining techniques or heuristicbased approaches, and deviations from them are considered outliers. Model-based approaches usually first characterise normal behaviour using some predictive models (e.g. replicative neural networks or supervised support vector machines) and then detect outliers as deviations from the trained model. Neuralnetwork approaches are generally non-parametric and model-based; they generalise well to unseen patterns and are capable of learning complex class boundaries. After training, the neural network forms a classifier.

Outliers detection in air pollution models
The interest in outliers in the measurement of air quality is caused by the desire to develop air quality models with an understanding of the role of outliers in models. The methods of detecting outliers in the data set were used in the research presented by Brendan O'Leary et al. [8]. In their study, simultaneous air pollution measurements were performed in Detroit, Michigan, USA and Windsor, Ontario, Canada in 2008 and 2009, which identified forty-eight outliers using four independent methods: box plots, variogram clouds, difference maps and the Local Moran's I statistic. Selected outliers were used to review models of air pollution. Due to the nature of data sent by air quality sensors, it is important to analyse time series for outliers. One such method is described in [9]. The presented algorithm, named incremental LOF, provides equivalent detection efficiency as an iterated static LOF algorithm, however, it requires significantly less calculation time. In addition, the incremental LOF algorithm dynamically updates the profiles of data points. This is a highly significant property because data profiles are susceptible to variation over time. The incremental LOF algorithm was found as computationally efficient, and at the same time very effective in detecting outliers and changes in distribution behaviours in various data stream applications.

The IoT solution design for data collection system
The central point of the IoT system is a software integration platform that provides instruments for communication and management of smart devices. The platform consists of the following elements: • communication layer, • control software, • cross-platform libraries and clients.
Two basic elements of the system provide essential tools from the point of view of managing the entire system. Libraries and clients allow developing applications that meet various needs and build a monitoring system. From the technical point of view, the platform is scalable, based on microservices implemented on a local hardware platform with the main purpose to work on cloud platforms. The entire IoT platform is supplemented with APIs for device management with various protocols, which enables configuring and monitoring device connectivity, as well as analysing and controlling their behaviour. The use of the integration platform provides support for the entire process related to the devices management and data processing. Fig. 1 presents the types of accounts registered within the integration platform. Typical tasks for client accounts are related to receiving messages and sending commands to devices, which relates to the ability to view devices and their status associated with a given account. Managing resources related to users, networks, and devices is associated with an administrator account. It is worth mentioning that device accounts are grouped on the network, so the first activity belonging to the administrator when defining device accounts is to prepare the network. Then, the administrator, having provided the user with the key to the given network, authorises automatic or manual addition of devices by means of a script. The third type of account is a device account. Most often devices send messages to the system or receive commands from it. If the administrator allows automatic registration of devices, one of the tasks may be self-registration. This process consists of stages, from data transfer, validation and collection to processing using machine learning algorithms and artificial intelligence. Measuring devices, for sending and receiving messages to and from the platform, may use RESTful interfaces, WebSocket and MQTT protocol. All communication takes place via messages transmitted in JSON. The RESTful communication service is equipped with a Swagger programming tool that provides a description of the REST interface and existing methods that can be used, for example, for testing the integration platform.

Device support
The integration platform can communicate with devices via REST interfaces, WebSocket or MQTT protocol. Almost all devices that support one of these protocols can be connected to the platform; from the device programmer it is necessary to create logic inside the device for correct implementation of the platform interfaces. Devices with Python, Node.js or Java language support, such as Linux-based hardware platforms, Android Things devices, etc., can be easily connected by installing the platform client library.

Implementation of the integration platform
Due to the assumptions of the project, among which there is the need to launch the system as a Software as a Service (SaaS) and the wide scalability of the final solution, the integration platform and elements of the user's environment should have a wide range of implementation methods. Among the methods of implementing the integration platform are, for example, Docker, docker-compose, manual installation, and Kubernetes. The use of the last of the mentioned options, i.e. Kubernetes technology, enables the system to be launched along with an integration platform as cloud computing services at such service providers as Microsoft and Google.

Storage of measurement data
The construction of an integration platform is based on a microservices architecture and the database will be connected to the system as a plug-in.
The Apache Cassandra database is used for data storage. Apache Cassandra is a distributed database created in 2008 by Facebook. Apache Cassandra has been designed to cover all areas of non-functional requirements. The main features of the solution are: scalability, replication and data security, no single vulnerability point (SPOF), transactivities and data integrity.
The plug-in that allows you to attach the Apache Cassandra database to the integration platform allows storing commands and notifications obtained through the integration platform in the database. The plug-in consists of two parts: one is responsible for creating a schema and the second one makes a correct plug-in definition in a .yaml file, used for deployment with docker-compose. Launching the service in the first phase of the operation includes creating tables and user-defined type (UDT) diagrams using the data description format transferred in JSON, then the plug-in checks the creation state of the schema using the interval and a certain number of checks. The schema creation service must always be located only on one node. This is to prevent the schema from being modified at the same time, which would result in a crash and exception in the Apache Cassandra system.

The prototype of air quality monitoring device
The prototype for basic air quality monitoring is depicted in Fig. 2.

Fig. 2. Air quality monitoring prototype.
The measuring device consists of two main blocks: the measuring block and the communication block. In the measuring block, there are sensors that determine the concentrations of PM 2.5, PM 10, and the ozone concentration sensor. In addition, the temperature and humidity of the air are measured. A limited number of measuring elements is the result of a compromise between the lifespan of the power source and the price of the device. The physical dimensions of the device: 160 mm in length, 100 mm in width and 60mm in thickness, are caused mainly by the largest measuring device and by the power supply. In the first phase of the project, the device was equipped with sensors measuring the concentration of such substances in the air as O3, PM2.5, PM10, CO2, NOx, NO2, SO2, and additionally humidity and temperature of air. To preserve the simplicity of the solution, when designing the device, the idea of an amateur solution placed on the Instructables.com website was used [10]. An important factor for popularising the solution is its simplicity and availability of all components, and what is more, considering their low price. In contrast to the original idea, different particulate matter sensor -in this case, it was Nova Fitness SDS011 -was used, the remaining gas sensors are MQ-2, MQ-9 carbon monoxide, MiCS-2714 nitrogen compound sensor, and MiSC-2614 ozone sensor. A very important addition to the device is the DHT11 temperature and humidity sensor. The Wi-Fi communication module and integration with sensors are implemented by the Raspberry Pi ZERO device. The whole device was powered by a 5V power source. The method for data transmission determined the location of devices related to the availability of Wi-Fi network. For testing, the devices were placed in three locations/districts of the city of Lublin (see Fig. 3): Sławin (1) Czechów (2) and Węglin (3). Additionally, Fig. 3 contains the locations of reference monitoring systems located on Obywatelska St. (green arrow) and the OPSIS system located at the Lublin University of Technology (blue arrow). Red circles placed around the prototype air quality monitoring stations are only for the sake of legibility of the map. The distance measured in a straight line from the prototype station 1 to station 2 is roughly 3.6 km, from station 2 to station 3 the distance is 6.3 km, and from station 3 to station 1 it is about 4.3 km. The station closest to the Obywatelska reference station was station 2 (1.2 km), and the furthest located station is 3 (6.0 km).

Fig. 3.
Locations of air quality monitoring devices.

The data processing system part
Data obtained from measuring devices was verified with reference data that came from the air quality testing station from Obywatelska St. Description of the statistical samples of results for 2017 for daily averages was made for O3 concentrations, PM 2.5, PM 10, nitrogen and sulphur compounds. When the data set was pre-processed so that it would represent a problem with spot anomaly detection, the last step before applying the unattended algorithm for anomaly detection was normalisation. Typical normalisation methods are the min-max normalisation, where each characteristic is normalised to a common interval (e.g. [0, 1]) and standardisation, where each feature is transformed so that its average is zero and its standard deviation is one [11]. In practical applications, the min-max normalisation is often used, just as in the case of evaluation of the presented sample of data presented in this paper.

The prototype setup and preliminary results
Data analysis leads to showing the relationship between the occurrence of one contamination depending on another (see Table 1). The positive correlation between the occurrence of PM 2.5 and PM 10 particles is intuitive and supported by publications [12,13]. Temperature affects the concentration of compounds such as PM 10, 2.5, sulphur compounds and ozone. In the case of particulate matters and sulphur compounds, they are formed in air especially in winter months as a product of the combustion of e.g., coal in energyinefficient furnaces, so the lower the temperature the more such compounds in air, hence the observed negative correlation. The opposite applies to the temperature and concentration of ozone in the air, the higher the temperature, the higher the concentration of ozone in air [14,15]. Analysing the histograms shown in Figure 4 and after performing the normal distribution test for the concentration of ozone and PM 2.5 particulates, it was found that the concentration of ozone is characterised by a normal distribution with a tendency to a positive skew, and the concentration of dust does not show a normal distribution.

Analysis of outliers by means of machine learning methods
In the preceding paragraphs, various methods for detecting outlying points were described. In the case of stream data, which is weather data, the methods for detecting outliers by means of distance methods are most often used. However, in the case of LOF or LOCI, the parameter determining whether the point is standing out or not is often difficult to interpret. Therefore, to be able to harmonise the data analysis system and to apply the same rules in data analysis in the future, the LoOP (Local Outlier Probabilities) method was used [16]. The LoOP method uses the technique of detecting outliers based on local density, and instead of the withdrawal coefficient provides a measure of outlier distance in the range [0, 1]. Such a measure can be directly interpreted as the probability that the data object is an outlier. LoOP uses the neighbourhood set for local density estimation. Unlike other algorithms, it calculates this density differently: the basic assumption is that the distances to nearest neighbours are consistent with the Gaussian distribution. Because the distances are always positive, LoOP accepts the "half Gauss" distribution and uses its standard deviation, called the probabilistic set distance. It is used (like LOF) as a local density estimate -the ratios of each instance compared to the neighbours give the result of detecting a local anomaly. To convert this result into probability, the normalisation and Gaussian error function are finally used. The idea of having a probabilistic result instead of a value, which is difficult to interpret, is very useful.
Outliers, in addition to the indication of erroneous values in measurements, may also show abnormal phenomena occurring in the system. For data related to air quality, outliers can be used to determine states related to the occurrence of winter or summer smog. In the case of winter smog, outliers will mainly occur for sulphur compounds and particles PM 2.5 and PM 10 [17]. In the case of summer smog, outliers will occur in the range of registered ozone concentration [18].    As the values of normal temperatures decrease, the number of outliers of PM 2.5 increase. That indicates the emergence of health hazards. It is also worth noting that when the temperature rises, the phenomenon reverses and the number of outliers increases in the case of ozone concentration. A more detailed analysis will be shown in the following charts.
Using the probability measure of outliers, Fig. 6 (see above) shows outliers for PM 2.5 as circles with a radius depending on the measure of probability. The highest values for outliers were recorded in the range from the beginning of the year to around the one-hundredth day, and then from around day 250 to the end of the yearwhich correspond to average heating periods in Poland. In the case of analysis of outlier values of ozone concentration (see above Fig. 7), outliers accompany both winter smog episodes and high concentrations in the summer. The phenomenon of winter episodes associated with the occurrence of outliers was described in [18]. The reason is larger VOC concentrations compared to the summertime urban counterpart, leading to carbonyl photolysis. In the summer months, it is noticed in the Lublin area and on the European scale, due to higher temperatures, reduced cloud cover, and lower rainfall in Europe. This results in a higher number of events associated with high ozone concentration, which exceeds information and warning thresholds [20].

Future research works
The tests found that the devices are relatively stable in the temperature ranging from -5 to 35℃, which in the conditions of the studied area is far inadequate, as typical winter temperatures are below -20℃. Therefore, the selected devices must operate in a much wider range of temperatures, especially well below 0℃. Furthermore the battery life was proven too short, ranging from a few to over a dozen hours in the winter months (a detailed research related to the efficiency of energy sources is planned in the future). Therefore, further research is needed on the optimal selection of measuring devices, measurement data acquisition intervals and the frequency of their transmission. In the next phase of the project, it is necessary to create a model for forecasting air quality based on the smallest possible number of sensors.

Summary
Threats related to the quality of inhaled air become issues that are increasingly widely discussed, due to their impact on human health, but also due to the rising social awareness. There are few sources of public information on air quality in the Lubelskie Voivodeship. Therefore, it seems to be purposeful to create systems that will be able to monitor air quality in any region using generally available, sometimes amateur electronic measuring devices. However, such a solution brings many problems related to the quality of data entering the system. In such a case, machine-learning methods can be used to clean data. The examples in which the results of analysis of outliers for various air pollutants were presented were made using the LoOP method. The applied method, in addition to the ease of interpretation of results, is useful in the analysis of data streams. The use of methods for detecting outliers will allow maintaining the assumed quality of input data, and in the further phase of analysis can be used to determine anomalies occurring in the composition of air, affecting the deterioration of its quality in the winter and summer seasons.