A Research on Network Similarity Search Algorithm for Biological Networks

The biological network database presents exponential growth, how to find the target network accurately from the network database becomes the difficult problem. This paper proposes a new network similarity search algorithm, the similar network of Top k is calculated by two methods, the similar networks returned by the two algorithms are then filtered by overlap fractions, the weighted reordering algorithm is used to reorder the two sets of data, a precise set of similar network data sets is returned finally.In this paper, the accuracy of the query is judged by the comparison of the edge correctness (EC) value and the maximum public connection subgraph (LCCS) value of the returned sorted similar network data set, and compare query time with other algorithms.From the results, this algorithm is superior to other algorithms in query accuracy and query speed.


INTRODUCTION
The network is widely used in bioinformatics [1], chemical informatics [2], biomedicine [3], social network analysis [4], and other application fields [5]. Highthroughput biological technology has been applied to produce large amounts of biological networks, such as compound structure network [6], biological pathways [7], transcription regulation network [8], protein-protein interaction networks [9], proteins-DNA interaction network [10]. It is difficult to find the target network in a large number of biological networks, and researchers have developed different kinds of Internet search technology: C-tree [11] is the indexing technology of knn query based on network editing distance. GString [12] is a semantic approach; GraphGrepSX [13] is an index subgraph similar search method based on suffix tree structure; SIGMA [14] is a collection based NSS method; RINQ [15] is a reference based index query method; NeMa [16] is a subgraph search method of a community; MAGE [17] is a pattern matching system that supports a random walk based network (RWR) algorithm; REFBSS [18] redefined RINQ's improvement. However, the above algorithm has limited query network, the query return value is empty, the query time is too long, and the query precision is not high enough etc question. Therefore, in view of the deficiency existing in the above algorithm, proposed a new algorithm, the algorithm by combining the two similarity search of Top k network to achieve the similar network the improvement of accuracy and less time for the query.

Network Database and Query Network
A network can be regarded as a directed graph N = (V, E), V represents the point in the graph, E represents the edge in the graph, and the network database is the data center used to store the biological network. The network database can be expressed as D={N 1 ,N 2 ,...N n }, which contains n networks, where Ni represents the ith network in the network database. The query network is expressed as T={Q 1 ,Q 2 ...Q q }, where Q j represents the jth network in the query network.

subnet definitions
If there is a network N' meet: The point V' in N' is a subset of the point V in N, The edge E' in N' is a subset of edge E in N, The network N'(N',V') is the subnet of network N(V,E), which can be abbreviated as N' is the subnet of N. For biological networks, the nodes in the network are biological molecules and the edges are intermolecular interactions. Subnetting part, this article uses the MFinder [19] algorithm, there are two kinds of connections between nodes and nodes, unidirectional and bidirectional connections, so for two nodes graphs there are two types, as shown in figure 1(a), there are 13 types of three nodes graphs, as shown in figure 1(b), four nodes graphs type has 199 kinds, with the increase of graph nodes, the type of subgraph takes on the form of exponential growth, therefore, in this paper, the query and the target network subnet partition only two nodes subnet, 3 subnet and 4 node subnet.

Fig1.
The types of the two nodes and the three nodes.

subnet frequency calculation
Subnet frequency calculation is a relatively complex process. Assuming that n nodes subgraph, are connected in a node on the edge of n-1 the maximum, to calculate the probability P of a child graph, we need to consider likely to n-1 the edge, the probability of the emergence of the subgraph is equal to the graph of each side in a probability sum, computation formula is as follows:

Cosine Similarity
Assuming that SubQ and SubN represent the k node subnet of network Q and network N respectively, then the cosine similarity calculation of network Q and network N is shown in formula: Q and N of cosine similarity calculation is done by its corresponding subnet, N is the number of subnets, taking an example of 2 nodes subgraph, there are only two possible ways to connect the two idea graphs, so n takes 2, the probability of target network and the query network of subnet A are: 2/3 and 1;the probability of target network and the query network of subnet B are: 1/3 and 0. The cosine similarity of the two networks can be calculated as:

Network Alignment Quality Index
An important indicator for measuring network similarity is network alignment(NA). and network matching is divided into local networks than (local network alignment LNA) and global network than (global network alignment GNA), local network than the main concern is the biological information, such as correlation function consistency and biology; While global network comparison focuses on biological information and topology information, the following two indicators are the two most common methods used to judge topological similarity.

edge correctness(EC)
EC is the percentage for edges in network N i that are aligned to network N j , so EC worth the value range of [0,1], the two network N i = (V i , E i ) and Nj= (V j , E j ), the contrast of two networks can be expressed as injective function f: V(N i )->V(N j ). The calculation formula of EC is defined as follows:

largest common connected subgraph(LCCS)
LCCS is the number of edges in the largest connected subgraph for the first network that is isomorphic to a subgraph of the second network. The value of LCCS is different from that of EC value, and the value range of LCCS is [0,|E i |], and |E i | refers to the total number of edges in the first network. For the two networks with the same EC value, the network with large LCCS value is higher, while the larger the LCCS value is, the more dense the network is.

Algorithm Overall Flow
Network similarity algorithm mainly divides into three parts, as shown in figure 3, the first part is the cosine similarity calculation, by calculating the cosine similarity between query and target network, returns the similarity ranking Top k network collection D 1 ; In the second part, the comparison parameters between the query network and the target network are calculated by EC and LCCS, and the network set D 2 is returned by the comparison parameter ranking Top k '(k' =k).The third part is divided into two steps: the first step is obtained by cosine similarity was calculated by the overlap of the Top k network and by EC value and LCCS is worth to the Top k' network of overlapping ratio, if more than the threshold value of π, go to the next step,or directly to the end, the query fails; The second step is to set weights for Top k network and Top k' network, for the two methods have been the former Top k network comprehensive ranking again, finally, returns a similar ranking Top k 2 network collection.

The cosine similarity gets the Top k network
The cosine similarity computing similarity Top k network in the process, first of all to the query and the target network partition subnet, because with the increase of subnet number of nodes, subnet number type can present the form of exponential growth, therefore, in this paper, in consideration of time complexity and computational complexity, in terms of the selection of subgraph, using only the section nodes 2, 3 and 4 nodes figure, after subnetting, the subnet is used to calculate cosine similarity between query and target network, and then based on the cosine similarity value as the query and the target network similarity criterion is an important standard, return to the former Top k similarity network, represented as D 1 = {N 1 , N 2 ...N k }.

EC value and the Top k network obtained by LCCS
EC value and LCCS is used to measure an important indicator of network than EC and LCCS value to a certain extent, reflects the degree of similarity between the network, so this article use this way as the second measurement network of similarity between the reference index of the first network and the query target network computing EC value, because the EC value as a percentage, in most of the small-scale network, as a result of the limitation of calculation accuracy, presents the difference is small, can't accurate judgment to the similarity between the network, in this case, in the case of small EC value differences, this paper users the second measurement value of LCCS supplement for EC value to calculate again LCCS little difference value, and then according to the EC value and LCCS worth comprehensive evaluation standard, returns the Top k' similarity similarity network, remember to D 2 ={N 1' ,N 2' ...N k' }, where k'=k, is just to distinguish the data from D 2 and D 1 .

Overlap and weighted reordering algorithm.
(1)Overlap coefficient Due to the difference between D 1 and D 2 in data set, in order to judge the difference of D 1 and D 2 , the Overlap is used to calculate the difference between the two data sets. We can clearly see that in the above equation. the more the same network between D 1 and D 2 , the greater the value of Overlap, the two methods got similar networks overlaps the Top k is higher, the higher the accuracy, the Overlap of the peak can reach 1, namely two ways to get the Top k network exactly the same. If Overlap value is less than a threshold tends to zero, (take the experience value π=0.1 in this paper)we think that the network query failed, at least in two ways that there is an obvious error, one way to abandon the query.
(2)Weighted reordering algorithm Two algorithms for the Top k similar network under the condition of satisfying Overlap, need to get the Top k similar in the two methods integrating network, and get a new sort, this article put forward the scheme of setting weights, similar to reorder, Top k network D = {N 1 ,N 2 ... N k } The weight setting formula of each network Ni is: intersection of D 1 and D 2 , we pass and reorder the back number: iD1 represents the ranking of the ith in the new sequence in the network set D 1 , i D2 is the same, k D1 represents the size of k in D 1 network, and k D2 is the same. The Orderi value obtained by this algorithm is reordered from small to large, and a new sequence of similar networks is considered.

The Data Source
This article users four real data sets. First comes from the NCI/NIH AIDS antiviral drug screening data (http://dtp.cancer. gov), the molecular structure of the data set, the other three data sets are biological pathways data sets, can be downloaded from WikiPathways website, one of which is the Bos Taurus path data sets, the other two data sets are Homo Sapiens pathway, Homo Sapiens I and II in training network model is different, Homo Sapiens I was randomly selected from the data set while training the query network, Homo Sapiens II can only train up to 30 data sets when training the network. The network dimensions of the vertices and edges used in the experiment are listed in table 1.

Return the results of the EC and LCCS values
For the result of the final return, the network computing EC and LCCS values in the network and network database are shown in figure 4 and figure 5. EC value calculation, for data collection of AIDS, The values of the two nodes and the c-tree algorithm are not much different, and the EC values of the 3 nodes and 4 nodes are significantly higher than the c-tree algorithm. and Bos Taurus data sets, Homo Sapiens I data sets and Homo Sapiens II data sets, both nodes 2, 3, and 4 nodes, EC values were significantly higher than C-tree algorithm. LCCS value calculation, for data set AIDS and Homo Sapiens I, the LCCS value of 2 nodes is similar to that of c-tree algorithm, The LCCS value of 2 nodes in Homo Sapiens II is slightly lower than the ctree algorithm, and other data set the remaining quarter idea figure of LCCS values are higher than c-tree algorithm. On the whole,whether the EC values as a measure, or the LCCS value as a measure, the algorithm of similar web search performance is much better than ctree algorithm.

Average query time
This study analyses the query time, the results of the analysis as shown in figure 6, two, three, four nodes and c-tree algorithms in four data sets are compared respectively, the results of the search results are better than that of the c-tree algorithm, and as a result, this algorithm not only improve the precision of the similar web search, and to a certain extent, reduce the network similarity search of time.

CONCLUSION
This paper, using cosine similarity and similar network and EC value, LCCS combination of the Top two k similar sequence set network, according to the two sequences set with Overlap judgment, then through reverse weighted weight sorting algorithm on two Top k get a collection of sequence data integration. This algorithm also performs a performance comparison with several other algorithms, which can be concluded as follows: (1)the algorithm improves the accuracy of network search; (2)optimized the algorithm and reduced the query time; (3)avoid the situation where the traditional method is limited by the query condition, and the return value of the query network is empty, because the algorithm returns the network set of the previous k in the similarity degree.

FUNDING
The work described in this paper was partially supported by national key research and development project (Project No: SQ2017YFNC050022-06),Hunan education department scientific research project(Project No: 17K044 ; 17A092).