A 3M K-means algorithm for fast and practicably identifying COVID-19 close contacts

. Given that the risk of the COVID-19 epidemic still exists and the flow of patients is difficult to monitor, identifying the people who have had close contact with the confirmed cases is important in anti-epidemic tasks whether in areas where the epidemic is developing rapidly or in areas where the epidemic has been phase-controlled. This article discusses how to locate people who have been in close contact with confirmed cases quickly and determine the risk of infection. From the perspective of the government, this work proposes a multi-snapshot multi-stage minority K-means (3M K-means) algorithm. This algorithm reduces the amount of data and considerably improves the speed of clustering by quickly ignoring the excluded risk classes and points in the process in the early stages, whereas traditional algorithms involve with O(N2) computational complexity which needs several days, impracticably for the COVID-19 urgent situations. The 3M algorithm greatly cuts down the computational time, thereof making the rapid warning of close contacts practicable. The methods are simple, yet efficient and practicable for the COVID-19 urgent situations The use of this algorithm can help control the COVID-19 epidemic, achieve significant cost savings, and provide the psychological guarantee of people for work resumption.


Introduction
The coronavirus disease 2019 (COVID-19) epidemic has affected more than 200 countries and regions in the world [1], and the situations in several countries have gradually passed the outbreak period. Although the most severe period has passed, effective control of the epidemic is a crucial step in preventing the disease from rebounding. Given the extremely strong infectivity of syndrome -coronavirus 2 (SARS-CoV-2), this disease may persist and spread in cities and densely populated urban areas for a long time until a vaccine is created. Each country's effort to control the domestic situation protects citizens and contributes to global anti-epidemic actions.
For the first time since January, Chinese health officials reported no new deaths from COVID-19 on April 6, 2020 [2]. The number of newly confirmed cases and deaths per day in China has considerably decreased, and most of the newly confirmed cases are from abroad. Provinces and cities in China have implemented different prevention and control measures in accordance with their respective situations. For example, many provinces and cities have implemented the "health QR code" policy [3] and used the health QR code as an electronic voucher for allowing individuals to move around the local area. This voucher must be presented when entering or leaving a community, bus, office building, and other areas. The government can monitor the movement of citizens with the health QR code. If people have been to areas with a serious epidemic situation, the authorities will integrate big data information according to the specific situation and mark different colors on these people's health QR codes to remind them that they need to be isolated. For areas where the epidemic situation is serious, the regional government has organized house-to-house investigations and other actions to determine if asymptomatic infections exist to prevent the epidemic from worsening. Moreover, local governments obtain the location information of each user through mobile communication operators to monitor the urban internal flow of people. A user that flows across provinces or regions is reminded to self-isolate. With the support of current big data, governments at all levels have implemented various policies based on big data. Governments obtain data and technical support for epidemic prevention and control by dividing epidemic risk areas and obtaining people's location, mobility, contact persons, and other relevant information.
Researchers worldwide have conducted corresponding research on COVID-19 [4][5][6]. Current research focuses on the prediction of the development of the epidemic situation [7] aside from medical issues [8,9]. Cho also pointed out that the use of artificial intelligence system can help sniff out coronavirus outbreaks [10] AI will play an important role in epidemic prevention and control.
SARS-CoV-2 is highly infectious, can spread through aerosols and infect people, and has a long incubation period. Therefore, the epidemic is extremely likely to spread. A patient who does not know that he or she carries the virus and has close contact with others before diagnosis and an asymptomatic infected person who has traveled on multiple vehicles in a short time can infect people at an exponential rate, and the resulting situation will be difficult to control. Therefore, given that the risk of the COVID-19 epidemic still exists and the flow of patients is difficult to control, identifying the people who have had close contact with confirmed cases is important for controlling the epidemic whether in areas where the epidemic is developing rapidly or in areas where the epidemic has been phase-controlled.

Algorithm
Identifying the close contacts in a big city is a problem of big data. Based on our previous research on big data [11], we propose a fast method to identify the close contacts of epidemiology [4] [5]. This method uses traffic data to screen the flow of people. Through clustering hierarchically, the groups of people that have been in close contact with a confirmed case can be identified. This method is from the perspective of the government, and its goal is to improve the monitoring ability of the social health system.
In the method, the government collects people's location data in a city in a day from the location information or the snapshot data of people's trajectories from 8:00 am to 10:00 pm every T hours, with a total of 14/T+1 snapshots (see Figure 1). The concept of snapshot is the premise of this method named multi-snapshot multi-stage minority K-means (3M Kmeans) algorithm. In each snapshot, a multi-stage minority K-means (2M K-means) is executed. In each stage, minority K-means (1M K-means) is executed. The 1M K-means algorithm is a modification of the K-means algorithm. In the process of 1M K-means, if no confirmed case is present in a class during current iteration, then this class is considered to be out of risk and discarded. Thus, the center of this class no longer changes with iteration. If an uninfected point is continuously divided into a class that excludes risk in multiple iterations, then this point can be directly discarded and will not participate in subsequent iterations.
After completing a 1M K-means, most classes and points that exclude the risk of infection are quickly discarded and with much fewer data another 1M K-means in next stage is executed. When the clustering results do not change, the 2M K-means clustering ends, and the points with a close contact risk in this snapshot are obtained. The 3M Kmeans clustering algorithm can get the infection probability of each close contact by superposing the results of each 2M K-means clustering algorithm. (see Figure 2) Fig. 2. The iterative process of 2M K-means algorithm for one snapshot at 8:00 am on a particular day. As the number of iterations increases, the points that have been excluded from risk are gradually ignored, and fewer points need to participate in clustering. Points with close contact risk can be obtained after clustering is completed. In the process of 1M K-means, an increasing number of points are ignored in the calculation as the number of iterations increases. In the actual calculation, although the initial points are numerous, a large amount of irrelevant data is rapidly discarded in the early process of iterations, which explains why this algorithm can greatly improve the operation speed.
Within one stage of 2M K-means, the following operation steps of the 1M K-means algorithm are executed: (1) Before clustering, given K clustering centers, the initial eigenvalue of each cluster is 0, indicating that the risk cannot be excluded.
(2) Clustering is performed by the K-means algorithm. For point m whose eigenvalue is 0, the distance between the point and each class is calculated, and the point is classified into the nearest class. The distance from point m to class n can be expressed as d = L(m) -L(cn), where c represents the center of class n.
(3) If the point with an eigenvalue of 1 does not appear in a certain class in successive p iterations, then this eigenvalue can be rewritten as -1, and the risk of this class is excluded. The class can be discarded directly. In the subsequent iterations, the center position of the class will not change with the iteration.
(4) If a point whose eigenvalue is not 1 continuously appears in the same risk exclusion class in successive q iterations, then the eigenvalue of the point can be marked as -1, that is, the point can exclude risk and can be discarded. This point can be ignored in the subsequent iterations.
(5) Steps 3 and 4 are repeated. When the clustering result remains the same, the 1M Kmeans of this stage ends.
In each stage of 2M K-means, the 1M K-means algorithm applies the K-means algorithm and constantly reduces the number of participating points, which can greatly improve the operation speed. In the next stage of 2M K-means, the 1M K-means algorithm has much reduced number of data than last stage. This further cuts down the computational time. This 2M K-means algorithm gains additional advantages when the amount of data is large. With the increase in data amount, the 2M K-means algorithm saves more time compared with the ordinary K-means algorithm.

Experiments and Discussions
We tested the 2M K-means algorithm and found that for points fewer than 10000, the operation can be completed within one second. When the number of points increases to 5 millions, the operation time does not increase too much, and can be done within 15 minutes. Table 1 shows the running time of the algorithm. The 3M algorithm is based on the clustering needs of large-scale data and improves the general clustering algorithm. With a given confirmed case, most points and clusters can be excluded quickly during the clustering process, thereby tremendously improving the calculation efficiency and making the user's location data sufficient for use. Given the limited infectious range of a confirmed case, many of the points can be ignored during this process. Thus, the amount of data involved in the calculation can be reduced to the greatest extent possible. This algorithm can remarkably improve the operation speed of the information mining algorithm, which can reach about 10 times the speed of the ordinary clustering algorithm for a city with 5 million people. With the increase of data, the speed of this algorithm becomes much faster than that of the ordinary K-means algorithm.
This algorithm reduces the amount of data and considerably improves the speed of clustering by quickly ignoring the excluded risk classes and points in the process in the early stages, whereas traditional algorithms involve with O(N2) computational complexity which needs several days, impracticably for the COVID-19 urgent situations. The 3M algorithm greatly cuts down the computational time, thereof making the rapid warning of close contacts practicable.

Conclusions
The 3M algorithm is a "finding a needle in a haystack" cost-saving algorithm because only less than 1/1000 people are diagnosed as infected in a city. Directly calculating people's trajectories within an acceptable time for a big city is infeasible on account of the big data. This simple, yet efficient, algorithm conspicuously expedites the clustering by nearly 80 times for a city with a population of more than 5 million people, thereby providing the list of close contacts in an acceptable time in practice.
For government agencies, this method are effective anti-epidemic operation tools that can help people resume work with confidence, and hence improving the social health system and public work order in a number of countries. The methods are simple, yet very practicable for the COVID-19 situations. For users, authorized data sharing enables them to have clear and accurate assessments of their action path, take precautions, and make preparations.