Large Scale Face Data Purification based on Correlation Function and Multi-Phase Grouping

Recent advances in deep learning technologies enable high performance artificial intelligence, which is an equivalence of human capability or higher for various application. However, deep learning is highly resorted to the large scale training data, which typically contains large number of outlier samples that are difficult to remove. In this paper, we proposed a face image purifying algorithm, which combines the correlation function of deep features with multi-phase grouping technique. A correlation function was proposed to determine the principal class by measuring the similarities between all different samples. The principal class was further used as a prior for the multi-phase grouping algorithm to purify the face data by multiple thresholds. The experimental results demonstrate that the proposed algorithm has significant improvement than the primitive cluster algorithm, such as K-Means.


Introduction
Artificial intelligence and machine vision are the frontier of modern industrial and information science, and face recognition technology is one of the earliest implementations of machine vision.Recently, with the development of artificial intelligence in the dynamic face recognition and facial portrait differentiation technology, the face recognition technology has found a new research direction again [2][3].Furthermore, combining the traditional face recognition technology with the large data technology, training the face features with a large number of samples, and improving the face recognition at the performance has been an issue problem in the field of machine vision research.
However, there are lots of non-target images in the collected face training samples.We define the non-target images in the same kind of face images as the noisy samples, which have the similar features with the target images.The sample noise has seriously affects the training results.At present, how to remove the sample noise in a large number of training samples has become a new focus of face recognition technology [4].
Many recent researchers proposed some approaches for this problem: Fitzgibbon and Zisserman [5] proposed a sample purification method based on Joint Manifold Distance (JMD), this method treats a group of face data as a subspace, and performs face clustering by calculating JMD.Wu B et al. [6] established a clustering model using the probabilistic constraint conditions of the hidden Markov random field, they classified all face sample images into K disjoint clusters, and the K should be given in advance.Because they cannot obtain the accurate K value in advance, the results are very unstable.In [7], Zhang et al. clustered the shape of the face to seven classes.Then, they continue to subdivide the smaller classes according to the facial features extracted by the ASM method.This method relies on the facial contour, but the facial expression is varied, and the influence of light, occlusion and other factors, the deep feature of human face is not easy to extract.Therefore, the method has some limitations.
In addition, some researches combine intelligent algorithm with clustering method.[8] first extracted the deep characterization features of face images using convolution neural network (CNN) algorithm, then they combined the feature with the clustering method (e.g.K-MEANS and hierarchical clustering method) to achieve clustered the face images.However, the problem of this method is similar with [6], the K-MEANS needs to estimate the number of clusters in advance, and the clustering results are more sensitive to the initial values.The hierarchical clustering with different types of feature similarity can classify the data into different clusters, but it cannot be re-classified, the result of cluster purification is not satisfactory.Even so, compared with the method only using clustering algorithm, the methods combined intelligent algorithm and clustering algorithm improves the accuracy and efficiency of sample purification to a certain extent.
In this paper, we proposed an algorithm that combines the correlation function of deep features with multi-phase grouping technique to solve the problems of sample purification.Firstly, we use the VGG Face [1] network to extract the high-level feature information of face images, and we define the correlation function used to compute the principal class.Then, we choose the appropriate multiple thresholds to divide the data into groups.Finally, we remove the noisy data and get the purification of the sample images by filtering noisy samples in each group.
The remainder of the paper is organized as follows.Section 2 describes feature extraction of face images and similarity measure.The proposed face image purifying algorithm is introduced in detail in section 3, which includes definition of the correlation function、analysis of principal class 、 multi-phase grouping and setting thresholds.Section 4 shows the experimental results.

Feature Extraction
CNN is an efficient intelligent algorithm, which has been developed rapidly recent years.Because of its deep structure, strong learning ability and hierarchical nonlinear mapping, CNN has been widely used in facial feature extraction [1][2][3], and it becomes the main method of face recognition [9].In this paper, we establish the face feature extraction network by the CNN of VGG FACE [1], and the CNN with 13 convolution layers, each convolution layer contains a linear operator, which followed by one or more nonlinear operator, such as ReLU.The last three blocks are fully connected layers (FC).This paper uses the fc7 FC layer to extract the face image features, [1] shows that the face recognition accuracy rate is 97% using this method in the Youtube Faces Dataset.So, we combine it with our proposed face image purification algorithm to refine the face data.

Similarity measure
We use the CNN method to extract high-level feature .Then the distance between any two samples can be expressed as x x , and the similarity between them is where, ( , ) i j dist x x is the Euclidean distance after , i j x x are normalized.
3 Face image purifying algorithm When ( ) i P c is the maximum, i c is the target sample, and the other category images are taken as noisy samples.
In order to automatically and unsupervised find the face image samples, which distributed densely in the current data set, we define the correlation function, which is the accumulation of similarity of each sample with the other samples, the correlation function for any i x is where, i j  , ( , ) i j eachsim x x is the similarity between the two feature vector i x and j x .The data can be divided into several different classes by pre-setting the number of clusters, but this method may cause the data which in a same class to be divided into several categories, or be classified into other cluster so that they become noise.Once the number of clusters is pre-determined, the data samples of the misclassified categories will not be able to be back.This will affect the effect of data purification.The correlation function can reflect correlation of each sample in the global situation, and can estimate principal class among the sample classes.This avoids splitting the same samples into other classes, reduces the misclassification probability of the samples.

Analysis of principal class
where, max f is the training feature vector of face images and the target sample vector when the correlation function is maximized.

Multi-phase grouping
In the dataset R , we classify the main component samples by max f , and let the classified main component samples set be A , the rest of the samples will belong to set B , R = A + B , A and B both are the proper subsets of the data set.
Let the threshold as 1 T , we compare the similarity between i x and max f by Eq. ( 1), if the similarity is greater than 1 T , then the sample will belong to A , if not, it will belong to B , which can be expressed as We initially purify samples of the principal class by 1 T , and divide the R into { , , ,..., ,..., } , and A is the principal class and B is the group to be filtered, where is the feature vectors which are divided into the principal group of A , and is the feature vectors which are divided into the group to be filtered of B , G D N  .However, the face change or an unclear image will lead to diversification of face images, which makes it be more difficult to comprehensively characterize multiple principal samples.In this paper, we use the centroid of the feature vectors, and the strict thresholds to solve the problem of multifeatures fusion when deciding whether each sample is a main component target sample.Let 1 O be the centroid of A , then, 1 O can be expressed as To minimize the loss of the number of samples of principal class as much as possible, we check for the principal sample images that may be left in the set B , and we purify the B group again.Let the threshold as 2 3 T T , then we compare 1 O with each of feature vectors in the set B , and filter the target samples, which can be expressed as follows Let the set of the second purifying principal class be We purify the group of principal class and update the centroid to 2 O .Then, let the threshold as 3 T , the principal class will be purified as follows We divide the samples into different groups by the different thresholds to be purified, and classify the samples of principal class from each group, so that the purity of the samples is improved continuously.The multi-phase grouping does not only extend the interclass distance, but also minimizes distance within the class.The flowchart of the proposed algorithm is shown in Figure 1.

Thresholds setting
Our thresholds can strictly filter the noisy data, and prevent the noisy data from contaminating the target samples.Figure 2 shows the distribution of similarity between each two samples.The horizontal axis represents the similarity values, and the vertical axis represents the number of samples.When the threshold is below 0.6, the similarities of samples are low, this part represents most of the noisy data.When the threshold is above 0.7, the similarities of samples are high, this part represents target samples.In this paper, we get the threshold between 0.7 and 0.8 by the experience.

Experimental results and analysis
The effect of the proposed algorithm is evaluated using precision and recall of [9].The precision and recall of class i in the th j cluster are defined as follows ( , ) / where, ij N is the number of class i in the th j cluster, j N is the number of the samples, i N is the number of target samples in class i , the F-measure of the class i is ( ) 2 / ( ) F i PR P R   (11) The experimental database was selected from the opened face data MS-Celeb-1M [10], which is provided by Microsoft Research Institute and includes images of 1,000,000 celebrities.In our experiment, we randomly select 10,000 face images of the data.
In our experiment, we keep the 1 T unchanged, adjust 2 T and 3 T to obtain the satisfactory purification results.Figure .3shows the F-measure results using different thresholds.The F-score reflects the result of the face images purification, recall reflects the purity of the data, and precision reflects the number of target samples loss.However, the result of the purification is the best when the threshold is increased to about 0.7.
We select part of the samples whose similarities are so high to compare with K-MEANS method, and to calculate recall rate of principal samples, the result is shown in Figure.

Conclusion
Data purification is an important method for big data analysis and application.To reduce the data loss and purify face images as much as possible, we proposed a simple yet effective purification algorithm for face data.We used this novel algorithm to purify the target face samples in big data samples by selecting the appropriate thresholds.The experiment results indicate that the proposed algorithm is helpful for improving the rates of precision and recall.The highlights of this paper are that the proposed algorithm is more accurate and stable than traditional clustering algorithm.
of feature vectors for training samples set, where, i f is the feature vector of the th i face image in training samples, N is the number of the samples in the training samples set.We get the similarity ( , ) i j eachsim f f of any two different training feature vectors by Eq. (1), where i j  .The correlation function ( ) i Correlation f of each training feature vector by Eq. (3), and we can get the main component sample by maximizing the ( ) i Destin f as follows max

Figure 1 .
Figure 1.Flowchart of our face purifying algorithm.

Figure 2 .
Figure 2. The distribution of similarity between each two samples.

Figure 3 .
Figure 3.The F-measure results using different thresholds 4.

Figure 4 .
Figure 4. Recall rate of different clustering results

3.1 Defining the correlation function
th i category image in the face data set is 1 ( )