Link prediction in author collaboration network based on BP neural network

. Recently, more and more authors have been encouraged for collaboration because it often produces good results. However, the author collaboration network contains experts in various research directions within various fields, and it is difficult for individual authors to decide which authors are best suited to their expertise. This paper uses the relationships among authors to predict new relationships that may arise, recommending each author with the collaborators they may be interested in. The data source comes from 4-year data in DBLP from 2001 to 2004. After data cleaning, the training set and test set are constructed and then used BP neural network to build model. At the same time, this article compares the performance with Logistic Regression, SVM and Random Forest. The experiment shows that the BP neural network can get better result, and it is feasible to predict links in the author collaboration network.


INTRODUCTION
Link prediction is a very important research direction in social networks, such as Facebook, Twitter and Flicker and other social platforms, which recommend the friends to their users through link prediction.A link may mean a friend relationship in a social network, or it may mean a collaboration relationship in the collaboration network, or it may mean interaction in a protein network.Any kind of relationship between two nodes in a network can be considered a link.A definition of link prediction: Given a snapshot map of a social network at a time G = <V, E> and node V , V  , predict the probability of a link between nodes V , and V  [1].It can be seen that the task of link prediction is divided into two categories: one is to predict the link that will appear in the future time, the second one is to predict the hidden unknown link in the space, this article discusses the former.
Link prediction has two main approaches: a score-based approach and a machine learning approach.The score-based approach is to consider link predictions as a regression problem by calculating the similarity scores for each pair of nodes and then sorting them, the order determines the likelihood of forming future links.
The score-based approach is a simple and effective method, but this approach is sensitive to different features' weights.The machine learning approach is effective to use a variety of attributes to predict the formation of links, and does not need to give the weight of each feature manually, and it is easy to expand.Therefore, the method of machine learning has been widely used in link

Dataset
The dataset used in this article is an open social network dataset DBLP (Digital Bibliography and Library Project) which is an integrated database system of English literature taking the author as the core on research results in the computer science field.The author's research achievements, including published papers in international journals and conferences, are listed according to the year.The data set of DBLP is very large.

Features
For the feature vector, the author refers to the researches of other scholars [5~7] and selects the following features as shown in Figure 1

Performance metrics
The performance metrics of this article adopts to .The ∂ is between 0 and 1.
When ∂ = 0.5, it indicates that the precision and recall rate are equally important; when ∂ > 0.5, it indicates that precision is more important; in this paper, the precision and recall rate are treated as the same important, that is, ∂ = 0.5, and the F rate is called F1 rate at this time.

Experient result
In the experiment, the number of neurons in the input layer, hidden layer and output layer of BP neural network were 10, 50 and 2. In practice, the at this time also gets the maximum value (Fig. 3).
Therefore, we chose 100 as iteration times and 0.003 as the learning rate finally.At the same time, the BP neural network is compared with the Logical Regression, SVM and Random Forest in the same test set, and the result is shown in Figure 4.The recall rate of BP neural network is the highest, and the F1 rate is higher than Logistic Regression, which is approximately the same as the Random Forest.

.
prediction.The machine learning approach regards the link prediction problem as a MATEC Web of Conferences 139, 00073 (2017) DOI: 10.1051/matecconf/201713900073 ICMITE 2017 classification problem, and then uses the SVM, decision tree, Boosting, Logistic regression and other methods [2].Neural network (NN) is a powerful machine learning approach for nonlinear prediction, composed of independent units of neurons.Each processing unit sums the weighted inputs, and then applies the result to linear or non-linear functions to determine the outputs.In the neural network algorithm, BP neural network (Backpropagation network) has been very concerned about, with a strong non-linear mapping capabilities.BP neural networks have been widely used in various fields in recent years, including the prediction of the quality or robustness of plastic parts based on key process variables and material grade changes [3].Besides, the use of BP neural networks for accurate shortterm load forecasting (STLF) has played an important role in the national and regional power system management [4].However, in the field of link prediction, few scholars use the neural network algorithm, so the author will apply BP neural network to predict new links in author collaboration network, then explore whether it can achieve good results in this field.2 THE BASIC KNOWLEDGE ON BP NEURAL NETWORK This article uses a typical three-layer BP neural network, including input layer, hidden layer and output layer.The layers are full connected with edges and each edge corresponds to a weight w.At the same time, in addition to the neuron in input layer, neurons in other layers also include a bias b, those neurons will have an input value z which received from weighted summation and an output value which received from nonlinearity transform z through the activation function.The activation function used in this article is the Sigmoid function(f() = 1 1+ − ).The input value of the neuron j in the first-layer is expressed as:  .The variable i and the variable j represent neurons.ij represents the connection from the neuron i to neuron j,   means the weight of the connection.  −1 represents the output of the neuron i in the l-1 layer, that is, the i-th input of the neuron j in the l layer.   represents the bias of the neuron j in the l layer.Then the output of the neuron j in the l layer is expressed as :   = the BP neural network needs artificially specify the number of layers and the number of neurons in each layer.At the same time, the BP neural network needs to initialize the weight and bias of each edge in the initial stage and then the BP neural network algorithm will perform the forward transmission to get the predictive value for each sample of the training set.Then, according to the error value between the true value and the predicted value, the weight value of each connecting edge and the bias value of each layer in the neural network will be updated by backpropagation.The stopping conditions can be one of the following: a.The iterations times reach the given value; b.The error value is less than a given threshold.In this article, the stopping condition is to reach the given value of iterations times, that is, the number of times the parameter was updated reaches the given value.Assuming that a given training set contains m training samples, we can use the gradient descent algorithm to solve the BP neural network.For example, for a training sample (x, y), define the cost function (the difference between the predicted value and the true value) is: It consists two parts: The first part is the sum of error mean square and the latter part is the regularization, which mainly MATEC Web of Conferences 139, 00073 (2017) to prevent the overfitting.In this article, the link prediction problem is transformed into a classification problem.The y value represents 0 or 1, and ℎ , () is the range of [0,1] of the sigmoid function.The optimization problem of BP neural network is to obtain the W, b that minimizes the cost function.In general, random values near 0 are used to initialize weights W and bias b, and then the value of W and b is updated by optimization algorithm such as gradient .The new cost function value can be obtained with the new W and b values and the cost function value will keep close to its minimum value gradually.In the above formula, α indicates the learning rate which can be understood as the pace of each gradient, that is, the decreasing rate of cost function.Its value is very important.If the value is oversized, the decreasing rate may be too fast and missed the minimum value of cost function.If the value is too small, the cost function may be converged very slowly.The general α value is between 0.001-0.3.At the same time, it can be seen that when the sample size is relatively large, if updated value W and b by all the data set each time, the efficiency will be relatively low.Therefore, this article uses Stochastic Gradient Descent (SGD), and each iteration process just selects a part of dataset randomly.In order to evaluate the performance of BP neural network in the author's collaboration network linkage prediction, this paper compares the results of the three algorithms: Logistic Regression, SVM and Random Forest.
Considering the time cost, the author only experimented with data from 2001 to 2004, and removed the data which only has one author.Divided the four-year data into two periods, 2001-2002 and 2003-2004, this article constructed the features by 2001-2002 data and the labels by 2003-2004 data.If the authors have no collaboration relationship in 2001-2002 while they have the partnership in 2003-2004, the label is y=1, which indicating that a new link appears between the author's pair.Otherwise the label is y=0, which indicating that no new links appear between the authors.It is noteworthy that, during the feature construction, the selected author's pair should have no collaboration relationship in 2001-2002, and then to predict whether they would collaborate in 2003 or 2004.At the same time, the author notes that some authors are active only in 2001-2002 and no longer appear in 2003-2004; Conversely, some authors are active only in 2003-2004 while no longer appear in 2001-2002.In the author collaboration network, it will result in the mismatched between features and labels.To avoid this situation, this article considers only the authors that are included both in the two time periods.At the same time, due to the fact that the number of y=0 is far greater than y=1, namely the imbalanced learning problem.Therefore, in the experiment, the author randomly selected minority examples in the negative samples (the samples which y=0), this article obtains the negative samples which has the same quantity with positive samples (the samples which y=1), both are 5,504 samples.As a result, the social network constructed in this article is an undirected graph without weight with 8,363 nodes in the network.In the process of splitting MATEC Web of Conferences 139, 00073 (2017) training set and test set, 70% samples are randomly divided as training sets and the remaining 30% are test sets, in which there are 7,707 samples in training set, and 3,302 samples of test sets.
Fig1.Features computed for each author pair(V , V  ) evaluate and compare algorithms are precision, recall and F1 rates.Several terms must be identified before introducing specific indicators: TP represents true-positive which means a positive sample is predicted as positive by the model; FN represents false-negative when a positive sample is predicted as negative wrongly by the model; In the negative case, if a sample is predicted correctly,the prediction is said to be true-negative (TN);otherwise it is falsepositive(FP).P represents the positive sample, namely there is a collaboration relationship between authors.N represents a negative sample, namely there is no collaboration relationship between authors.Precision is expressed by formula: iterations times epoch and the learning rate eta have an effect on the performance of BP neural network.This article has carried on several experiments.Under the condition of epoch=200, this article respectively chooses the learning rate including 0.001, 0.003, 0.01, 0.03, 0.1 to carry on the experiment.It can be found that the precision rate of the model on the test set is increasing, while the recall rate is decreasing because the model predicts part of negative samples as positive with the increase of learning rate (Fig 2).And in terms of F1, under the learning rate of 0.003 BP neural network can perform better.Therefore, the author choses the learning rate of 0.003 and does experience in the iterative times of 100, 200, 300, 400 and 500 to obtain BP neural network performance metrics under different iterations times.It can be seen that the recall rate of the model is at the maximum in the test set when the iteration number is 100 and the F1 rate MATEC Web of Conferences 139, 00073 (2017)

Fig
Fig. 2. BP neural network classification results on different learning rate (epoch=200)