Enhanced public transport management employing AI and anonymous data collection

. The paper proposes a simple, economic and expandable solution for enhancing the data collection process used in public transport and transport demand management. A non-intrusive and anonymous method is employed to collect an estimative number of passengers in vehicles and public transport stops, along with other, relevant data. Machine learning and specific algorithms are used to improve the data collection process. No specific infrastructure equipment is required.


Introduction
In the present days, urban areas of highly populated cities are facing traffic congestions, emissions, stress and lack of solutions for improving road traffic experience, due to the limited road network infrastructure possibilities to expand capacity.One of the recommended solutions, also mentioned in several official EU Transport Policies documents is to employ modal shifting, giving away the usage of private cars in favor of the public transport [1].Making public transport attractive, reliable and comfortable does represent key factors in achieving these goals.This objective may be obtained via a comprehensive set of measures, including broad data collection and information processing, regarding the public transport management, efficiency, time scheduling and satisfying the transport demand with a high degree of accuracy.The solution proposed in this paper refers to a method for anonymously data collection employing BT/Wi-Fi enabled devices, followed by a comprehensive set of statistical filtering, machine learning algorithms and other procedures, set for: obtaining an estimate regarding the evolution of the number of passengers transported along the route in a public transport vehicle; -obtaining an estimate regarding the transport demand -passengers in stations; -improved vehicle location, without satellite navigation support; -support for traffic congestion behavior analysis without the need of infrastructure equipment deployment and other benefits.

State of art
The modern public transport management systems (PTM) employ on-board and infrastructure equipment for handling vehicle positions, regulatory actions and other specific actions.The transport demand is usually managed via specific sensors installed in stations, buses and other relevant places.Also, the locations of vehicles are collected via onboard GPS-enabled transponders.All that equipment needs a lot of maintenance, power supplying and/or deployment on wide areas in the infrastructure.However, some alternative techniques have been tested to collect various information from Wi-Fi and BT enabled devices, as the number of different mobile phones and accessories increases daily.Still, despite the huge potential of this methodology this technique is not yet receiving enough attention.It is expected that in the near future, with the announced arrival of 5G and C-V2X a , the impact of such approaches will increase in interest.

Literature survey
Several papers in the scientific literature address this subject.In [2], the authors explore the potential of using pedestrian data for evaluation and enhancement of public transportation efficiency.They employ a Wi-Fi/BT tablet and specific software to collect relevant origindestination information from travelers, with the purpose to improve the public transport management in terminals.In [3], N. Abedi et al. present the benefits and critical challenges regarding the usage of Bluetooth and Wi-Fi for crowd data collection and monitoring.They introduce some new concepts, like discovery time, signal strength analysis, antenna detection range assessment and multirange scanning technique.They conclude that collecting efficient crowd data by scanning MAC addresses can be matched with other crowd data collected by other methods in order to enhance the crowd movement dynamic analysis and monitoring.They also consider that the implementation of scanning approaches on a large scale can deliver significant information from space-time dynamics of people movements.In [4], Naeim Abedi et al. present the benefits and critical challenges when using Bluetooth and Wi-Fi for crowd data collection and monitoring.They mention some challenges that include antenna characteristics, environment's complexity and scanning features.A. Lesani et al. present in [5] the benefits and drawbacks of employing wireless data collection techniques with Wi-Fi and BT.Also, Y. Malinovskiy [6] show the benefits of employing such technologies in public transport.Many authors conclude that this technique represents an attractive method for collecting traffic and people movement data.
The remaining of this paper is organized as following: the next section presents the principle of the proposed solution and some experimental data collection, section IV concerns on the proposed algorithms and initial testing, section V the conclusions.

Concept and experimental data collection
The solution proposed here is focusing on anonymous data collection and processing for improving the public transport management, including location, transport demand and system usage information.Additional information, such as: traffic congestion, origindestination patterns of travelers etc. is possible to be obtained via superior data filtering and post-processing.A BT/Wi-Fi device, capable of discovering and recording MAC addresses, time/position, and RSSI levels is the single equipment needed.Locally stored data can be read either online, if the sensor is communication-enabled, or downloaded at the depots, when vehicles end their tours.The global information processing concept is presented in Figure 1.The process consists in collecting BT and/or Wi-Fi information regarding discoverable devices (usually not phones, but devices connected to phones, such as smart watches, fit bracelets, headphones, car head units, TV sets etc.) consisting of MAC addresses, RSSI levels, time stamps and location stamps or other, relevant information (such as name of device producer, if available).
The fist filtering phase refers to establishing a dual perimeter of analysis, nominated as "inside" and "outside" the public transport vehicle.Based on specific RSSI levels received, the discovered MACs are categorized in these two stacks.Of course, there might be a certain number of nodes situated outside the public transport vehicle that will fit in the same perimeter as those inside, but a second phase of data analysis is designed to look at the timestamps and presence consistency of all inside nodes, to further eliminate nodes that do not have a permanent presence.It is expected that those nodes belong to MAC addresses received from pedestrians, or passengers in neighboring vehicles.
Before proceeding to effective algorithm conception, a set of initial data has been collected on selected tram and bus lines in Bucharest, Romania, in several workdays, on the same route -with a length of 1.7 km.The first testing purpose was to see if specific devices, such as TV sets, or computers, that have been discovered in fixed positions in buildings near the route, could serve as pinpoints (or "beacons") for an enhanced spatial mapping of the public transport route (shown in Figure 2).The first analysis concerned on finding repetitive MAC addresses on the route, for different time periods (days and hours).A sample of collected data is presented in Figure 3 below, where with highlighted colors and text are presented the devices that were discovered for several days in the same locations.As it can be seen in the above figure, there are several devices, such as BT-enabled TVs, that were discovered in several tests on the same route.
Therefore, it can be assumed that these devices could serve as spatial reference points in a machine learning process, in order to develop the configuration of the route.In case of GPS location function failure, these spatial reference points could serve for a relatively precise location of the public transport vehicle.

The proposed solution and initial testing 4.1. General aspects of technology
The experiments for the purpose of the present paper have been conducted with different Bluetooth equipment.Wi-Fi has been also tested with other occasions, for similar purposes (concerning propagation in tunnels), but this aspect is not subject of this paper.Bluetooth is a wireless network standard designed for low power consumption and for communication in a limited personal area (PAN) environment.This technology was not specifically designed to locate objects, but Bluetooth enabled devices are suitable for localization because they contain a mechanism for identifying neighboring devices and performing communication with those devices.Bluetooth access points are similar to Wi-Fi networks, but unlike them, Bluetooth pointers have a greater communication distance from one to another (typically between 10-15 meters).The accuracy of the Bluetooth system ranges from 2 to 15 meters.One of the main advantages of the Bluetooth technology is the variable read distance.This technology is capable of reading at 1 / 10 / 50 m, being capable to locate communication nodes.In addition, it can locate up to 7 objects in a 3 m perimeter due to the master's connection capabilities.Frequency or channel jumper for device communication can take up to 10 seconds.However, it is not possible to use RSSI (received signal strength indicator), or the quality of the link parameters to find out the location results with a sufficiently precise measurement.Prior laboratory tests have been conducted to determine some physical characteristics of the BT radio signals.
The following figures show the analysis of BLE (Bluetooth Low Energy) signals, acquisition made with an Aaronia Spectran HF 6065 Spectrum Analyzer.In Figure 4 is presented the spectrum of BLE signals in range 2.110 GHz -2.169 GHz, and Figure 5 shows the spectrogram of the signal.Figure 6 illustrates the histogram analysis of the BLE signals, visualizing the energy fingerprints of the communications.

Description of algorithms
The proposed (cluster -type) algorithm is employed by a machine learning subsystem for performing an analysis and sorting of received BT/Wi-Fi received MACs.Goals of this approach include: -discovering and memorizing of MAC addresses that are repeatedly found in same locations on the path of the vehicle, with the purpose to re-use them as milestones along the next time the vehicle travels on the same route; -discovering and separating nodes that are located inside the public transport vehicle against the other received nodes; this is achieved via an analysis of RSSI parameters and near-field versus far-field thresholds established by the user; -performing, if needed, the traceability of specific nodes (this function is used for separating travelers entering or exiting the vehicle -in public transport stops, for example); this might be helpful in achieving information regarding origin-destination patterns of travelers, or in the analysis of service levels; -performing specific analysis on the outer nodes (in terms of determining the traffic flowing on the section of the road); -performing a mapping of results, on a specific GIS product.Grouping (clustering) is an action made with the purpose to partition a set of objects into different groups (clusters), where instances in a group are similar in a specific sense.Clustering is used in many fields, such as: computer learning, form recognition systems, image analysis, bioinformatics, compression, graphics etc. Amongst other instruments employed in clustering, a proven stable algorithm for this type of application is the k-means algorithm.(2) In order to ensure the convergence of the algorithm, and to give a more realistic condition to the simulation, more complex initialization techniques have to be applied.In this work the k-means method is defined based on the selection of points after a probability distribution that penalizes nearby points using a Gaussian Mixture Regression.This is to ensure a better traceability of results, because RSSI values are better modeled with a Gaussian noise when simulated propagation conditions are employed.The k-means clustering algorithm is a method of determining the clusters that form specific patterns.The procedure is an unsupervised training.The k-number of the clusters is known, this being a set of a priori parameters.Each cluster has a centroid.The algorithm works with k clusters, so k of the points used in the training will be the centers of the k clusters.Since centroid initialization is randomly, there is a possibility that more runs of the algorithm lead to different results.The implementation of the algorithms has been performed employing LabView (Figure 8).It is possible to correlate these models with density regression function when Gaussian Mixture models are employed for determining the common density of the data.Assume the only unknowns are comprised in the mean vector µi, i = 1, 2, ...., n.Thus, the coefficients θi and θ consist of the elements of µi and µ, respectively.The mixture density is formed as the sum of Gaussian densities, that is, for each class: (3) Where: -(  |  ,   ) is the probability of occurrence of the event   , conditioned by the   conditioning vector; - -distance from the centroid to the limit of the class; -  -represents the evolution of data in time.
Pre-multiplying both sides by ∑i yields: where ̂ is the final vector, resulting from all vectors correspondent to  = 1,2, … … ..

Analysis of results
The results in (4) illustrate several aspects: -  is formed as a weighted sum of the xk, where the weight for each sample is (  |  , ) / ∑ (  |  , ̂)  =1 .
-For the sample where (  |  , ̂) is zero (or small), little is contributed to   .The term   may be intuitively chosen, or wi samples can be employed instead.This aspect is presented in the following diagrams.Note that this involves updating the class means by readjustment of the weights on each sample at each iteration.This procedure is similar to the k-means clustering algorithm described previously.

Interpretation of data leading to extraction of mobility information
Indoor positioning assessment for the development of mobility models can be achieved using several technologies.In this paper, the fingerprinting method is based on the signal strength (RSSI) from a Bluetooth network has been chosen as surveilled element.Through fingerprinting, it is understood that the surface of the interior space where the location is desired is firstly mapped by measuring the power of the signal from the received Bluetooth nodes and creating a database that will be used later when a node is to be tracked.Fingerprint clustering is an important step in data preprocessing in order to achieve optimal accuracy, efficiency, and data needs to be collected prior to the actual location process.Clustering of collected data was performed employing the k-means algorithm.The results were compared with those obtained from the traditional fingerprinting method.Data acquisition was manually performed using a spectrum analyzer and a computer.The collection process started with the user locating his position on the floor map, and a map displayed on a computer.Then, the user crossed the entire surface of interest moving in straight lines along the surface of the enclosure.At the end of each straight trajectory, the user had once again marked his position on the map displayed on the computer.The acquisition rate of RSSI was three samples per second.The acquisition was made in two different premises.The collected data was then divided into a set of learning and a set for testing.Fingerprint clustering can be done in two distinctive ways, using 3D fingerprint coordinates, or using RSSI.It can be observed that in unexpected cases, the clusters are distributed over several levels of analysis, which is explained by the fact that the level of the analysis stage is less than the maximum allowed horizontal length.The positioning error was dependent on the method used for clustering.The tests showed that this method could serve to collect a strong database regarding the presence and movement of different devices, on two selected "interior" and "exterior" areas of the public transport vehicle.Further field tests will concentrate on the antennae positions and patterns, in order to obtain the best setup for a correct data collection.

Conclusion
In this paper a methodology to determine presence and movement patterns of passengers and other relevant elements related to a public transport system has been investigated.The approach is based on anonymous detection of BT (or Wi-Fi) enabled devices inside and outside a public transport vehicle repeatedly traveling on the same route.Based on several statistical filtering and specific algorithms, firstly the data is sorted to discover static, repetitive nodes on the path, to re-use them in the next travels as pinpoints, or location references.Secondly, a set comprised of a k-means algorithm and a Gaussian Mixture model with regression are employed to perform future selection of data from the samples: discovery of vehicle inside and outside nodes, tracking and clustering of these nodes to further perform origindestination patterns and advanced public transport efficiency analysis, such as level of service.The field tests regarding the presence of discoverable BT devices showed that only connected devices to mobile phones are detectable with simple software (such as BT Analyzer, Blue Scan etc. available in Android Market), and mobile phones that are set in the discoverable set by the user.Therefore, the measured rate of people carrying mobile discoverable devices proved to be around 5% in the test bed (Bucharest, Romania), still a low value for obtaining a critical mass of information, compared to the real number of persons in the area.However, considering the expansion of the mobile technology and connected devices (smartwatches, BT headphones etc.) it is expected that in the very near future this percentage will increase significantly, helping this technology to become more precise.Also, due to the fact that a public transport vehicle travels repeatedly on the same route, it collects significant amounts of data that, with the help of statistical filtering and mining improves precision over time.Several field tests have been also conducted to determine the frequency of static nodes detection.The tests showed a permanent presence of above 76%, depending on the testing hours and days of the week.Laboratory tests have been performed to determine the reception parameters of a BT receiver, with a spectrum analyzer, in order to shape the basic elements for the model and to analyze the reception characteristics of BT signals, in terms of RSSI time evolution.Further, a model was developed in LabView 15, consisting in an unsupervised machine learning, employing clustering method.The obtained results showed the feasibility of the proposed method and possible future development to achieve more function and information from the collected data.The authors consider that the proposed method could be simply implemented in the public transport system with minimal investment, leading to a better transport demand and efficiency management, along with the improvement of the public transport comfort, in the benefit of the passengers.This could contribute in the future to the attractivity of this mode of transportation and drastic reducing of the personal cars' usage.

Fig. 2 .
Fig. 2. The selected test route for collecting data

Fig. 3 .
Fig. 3. Sample data with repetitive devices on the test route

Fig. 4 .
Fig. 4. Spectral representation for the reception of BLE signals.

Fig. 5 .
Fig. 5. Representing the spectrogram for the received BLE signals

Figure 7
Figure7presents the analysis of de RSSI evolution for the received signals, highlighting the maximum (max hold) and minimum (min hold) limits for channel power.

Fig. 8 .
Fig. 8. Software implementation of k-means clustering algorithm (upper diagram) and k-means algorithm Each point is associated with the cluster determined by the closest centroid.Distance between point and center can be calculated, for example, as Euclidean distance, but other variants can be as well chosen.The flow of the algorithm is: 1. Randomly select k points as the initial centroids.2. Form k clusters by assigning all points to the closest centroids.

3 .
Recalculate centroids as following: the new centroid will be the center of gravity determined by cluster points.4. Steps 2 and 3 resumes until the centroids are no longer changed.

Fig. 9 .
Fig. 9. Realizing the machine learning model using the kmeans clustering algorithm (right -training model; left -RSSI class based on detection thresholds).

Fig. 13 .
Fig. 13.Emphasizing clustering results in two ways (left -2D projection; right -3D projection)Table1lists the errors obtained with the clustering by different methods.The best accuracy both in terms of 2D positioning and position identification was the use of 3D clustering based on k-means and the use of a Gaussian Mixture Regression.In addition, the results obtained show an improvement in the positioning time regardless of the method used.

Table 1 .
Assessment of clustering methods Figure 14 presents the results obtained for the evaluation of the different methods used for clustering.The evaluated methods are RSSI clustering and MGD (Multivariate Gaussian Distance); RSSI clustering and GMR (Gaussian Mixture Regression); 3D clustering (kmeans) and GMR (proposed).