K-anonymity model for privacy-preserving soccer fitness data publishing

. With the development of data mining technology, more and more researchers use the soccer fitness data to analyse the ranking of soccer athletes' and professional training. However, the direct release of soccer fitness data may leak the personal privacy of soccer athletes, so how to ensure the utility of soccer fitness data and the privacy of soccer player has become an issue. In this paper, we point out the linking attack existing in soccer fitness data, which the attackers can use the auxiliary demographic data as background information to attack the published physical data. So the attackers will map the privacy data and the athlete together. At the same time, we apply the partitioning-based and k-means clustering-based two k-anonymity algorithms to the soccer fitness data publishing to trade-offs the data utility and the personal privacy. Experimental results showed that the performance of methods is convincing.


Introduction
In recent years, with the rapid development of big data and Internet era, massive amounts of data are collected for various reasons by many organizations with the hope that data mining technology will extract useful knowledge from the collected data and turn it into something beneficial for the organization.There are also some obstacles on the way of the data mining.Part of the reason for that are the privacy issues of individuals.The data mining community focused on developing techniques that would enable data utility while preserving the privacy of individuals and started a popular branch of research named "Privacy Preserving Data Publishing" [1].Rossi A [2] used the collected data by GPS and proposed a multidimensional approach to injury prediction in professional soccer which is based on machine learning.However, they don't consider publishing the collected fitness data to other parties to analyze more useful information.Some recent studies [3], [4] show that, the simple technique of protecting traditional data by removing their identifiers (e.g., Name and Social Security Number) before publishing the table does not always guarantee privacy.The linking attack also exists in the soccer fitness data publishing.For example, a soccer player might be re-identified by joining the published data with another soccer website dataset on Height and Weight.Fig. 1 shows such an attack, where Bob's sensitive fitness data will be determined by joining the published soccer fitness data with a public soccer website data.K-anonymity has been proposed to reduce the risk of this type of attack [4].The main purpose of the k-anonymization, which has at least k-1 same tuples of each tuple, is to protect the privacy of the individual to whom the data belongs.The main contributions of this paper are summarized as follows:  We point out the linking attack existing in soccer fitness data.The attackers can use the public soccer website data to determine a soccer player and his sensitive fitness data. Towards to this type of attack, this paper applies the two k-anonymity methods, Mondrian K-Anonymity and K-means Anonymity.Meantime, we conduct a comparative study of aforementioned approaches on the soccer fitness data and analyse the results of these methods.Experimental results show that these methods can preserve the soccer players' privacy and guarantee the utility of these data.The remainder of this paper is organized as follows.Section 2 presents a brief overview of the literature on the k-anonymization.In Section 3, we formally define the k-anonymity model for traditional anonymization.Section 4 focuses on practical solutions of kanonymity for privacy preserving data publishing.We present the experimental results in Section 5 and conclude this paper in Section 6.

Related work
The issue of information disclosure has been studied extensively in the framework of statistical databases.Lots of information disclosure limitation techniques have been designed for data publishing, including Sampling, Cell Suppression, Rounding, Data Swapping and Perturbation.However, these methods compromised data integrity of the tables.Sweeney [3] first introduced the k-anonymity protection model, explored related attacks and provided ways in which the attacks can be thwarted.
Numerous algorithms [5], [6], [7] have been proposed in the literature for guaranteeing k-anonymity.LeFevre [6] introduced a class of algorithms for producing kanonymous full-domain generalizations using two key ideas of bottom-up aggregation (rollup) along generalization dimensions and a priori computation.The work in [7] extended the above study by using a simple greedy approximation algorithm to complete the multidimensional k-anonymity.

Problem definition
In this section, a general model, k-anonymity, is used to define the anonymization problem for soccer fitness data.Furthermore, we consider two efficient metrics to quantify the information loss incurred by the table perturbation.occurs at least times.That is, the size of each equivalence class in T with respect to is at least .

Metrics for information loss
The information loss has a wide concept and various metrics have been proposed in privacy preserving data analysis.In order to maintain the utility of soccer fitness data, we should change the table as small as possible.That is, the information loss after anonymity should be minimized.
The first metric we use is one that attempts to capture in a straightforward way the desire to maintain discernibility between tuples as much as is allowed by a given setting of k.The [5] can be mathematically stated as follows: ( 1 ) In this expression, the set refers to the equivalence class of tuples in table induced by the anonymization.The number of the equivalence classes is .
Another interesting cost metric we use was originally proposed by Xu [8].On a numeric attribute , the normalized certainty penalty is defined as , where the is the range of all tuples on attributes .Then the can be formally defined as follows: ( where the is the -th record, and the represent the number of attributes.

Anonymity method
In this section, we consider methods to protect the soccer fitness data publishing from linking attack.In this paper, two algorithms for achieving k-anonymity are applied in the soccer fitness data.

The greedy partitioning algorithm
LeFevre [7] transform the k-anonymity problem into a partitioning problem.The approach consists of two phases.At the first step, multidimensional regions are defined that cover the domain space by finding a partitioning of the d-dimensional space, where d is the number of quasi-identifier attributes, so that each partition contains at least tuples.And in the second step, the records in each partition are generalized such that they all share the same quasi-identifier value.The solution of strict partitioning is Algorithm 1.

The K-Means clustering algorithm
We can also transform the k-anonymity problem into a clustering problem.The clustering problem is to find a set of clusters from a given set of n records such that each cluster contains at least data points and that the sum of all intra-class distances is minimized and the inter-class is maximized.Using the k-means clustering algorithm, it is generated in the following three steps.
Step 1: Clustering() In this step, we use the Algorithm 2 to cluster the soccer fitness data into classes.In the Algorithm 2, the line 1 and 2 is the initialization process, we calculate the number of the classes and randomly select m records as the center of each class.At the line 3 to line 8, there is the iteration process.At each iteration, we will calculate the distances between each tuple t and each class c, then we will pick out the minimal distance and add the tag of the class to it.Finally, the clustering centers will be recalculated.During this process, euclidean distance is used to measure the similarity of each record and clustering centers.Step 2: Merging() After the clustering, there may exists some groups whose records are less than k.In order to solve this issue, we introduce a merging processing.At the process, the similarity of two class centers will be calculated by the euclidean distance, and then merge the small group into the cluster, whose distance between them is minimal such that the records' number will be larger than .
Step 3: Anonymizing() Finally, we anonymize the records in the same class so that have the same quasiidentifier value.The procedures of the last step is Algorithm 4.

Experiments
In this section we evaluate the performance of the proposed k-anonymity algorithms.The experiments are conducted on a 2.93 GHz Intel(R) Core(TM) 2 Duo CPU with 4GB running the Windows 7 operating system.

Dataset
We use the real soccer fitness data, which is collected form the soccer players in the ShanDong LuNeng from 2016 to 2017 five quarters.All the tuples in the table have 12 attributes.Among them, the athlete's name is replaced by a pseudonym, one nominal attribute and numeric attributes are contained.In this paper, we just use the height and weight as the quasi-identifiers, and the others are sensitive attributes, which can reveal some potential privacy information about the players.

Information loss
DM cost vs k( Fig. 2) Fig. 2 shows the relative changes of the DM cost with three methods by varying .The reason for this is that as k increases, more and more records are generalized to have the same quasi-identifier value, resulting in greater loss of information.In addition, the clustering algorithm is better than the Mondrian.This is because that the number records in each partition is larger than when the is small.And the cost of the clustering is gradually close to the Mondrian when the increasing.
NCP cost vs k (Fig. 3) Fig. 3 further gives the experimental result of the relation between NCP and .NCP cost is mainly used to measure the degree of generalization of records, which can be adopted as a good metric of data utility.In this paper, we calculate the percentage of the NCP.The NCP generally increase as increases, so it exhibits some trade-off between data privacy and data utility.The Mondrian(strict) is better than the clustering and Mondrian(relax), and sometimes Mondrian(relax) exists the same percentage of ncp because of the same partitions when the k is close.

1 K-anonymity: A general model
Suppose a data holder wants to publish a soccer fitness data table to some recipient for data analysis.
Top-down greedy algorithm for strict multidimensional partitioning Input: A table and an integer .