An improved semantic similarity algorithm based on HowNet and CiLin

Abstract. This paper explores an improved method for the semantic similarity calculation of words combined with HowNet and CiLin. Firstly, we designing the algorithm based on HowNet's sememe similarity improvement calculation, comprehensively considering the influence of each part of sememe on the overall meaning, and improving the calculation of word similarity based on HowNet by changing the specific calculation method of each part of sememe. At the same time, we adopt different strategies for the different results obtained in the similarity calculation of CiLin. The experimental RG data set proves that the modified Pearson coefficient of the method reaches 0.87.


Introduction
Word similarity calculation is one of the basic problems of natural language processing. The current research can be roughly divided into two modes of resource integration [1] : The first one is based on the fusion of semantic description level and statistical large-scale corpus. Second, integration on the ontological level .In the following chapters, after understanding the existing concepts of the HowNet and CiLin, firstly, in the calculation of the similarity of HowNet's sememe, we add the depth of the original sememe and the density of the position of the sememe. Secondly, we recalculate the calculation method of the similarity value in each part of the meaning of the similarity of the knowledge of the network. Then we make different choices in different situations to caculate the word similarity of the CiLin. Finally, we improve the distribution weight of the existing fusion network and the word forest method. According to the experiment at the end of the paper, the performance of the improved method is verified.

Introduction to knowledge network
HowNet mainly contains concepts such as sememe, the semantic similarity and so on. Sememe is the atomic concept used to explain the meanings. It is the most basic unit of HowNet. The meaning can be understood as a concept, which is used to explain words. A word can have multiple meanings. The semantic expression (DEF) is the main body of the meaning term, which is composed of the basic meanings of the combination of the knowledge description symbols and is used to explain the meaning of the meaning term. The basic data classification in HowNet can be as shown in Fig.1:

CiLin introduction
CiLin is a computable Chinese vocabulary used to realize the division and categorization of Chinese synonyms and similar words. CiLin has been expanded by the Information Retrieval Research Laboratory of Harbin Institute of Technology which has a five-layer tree structure as shown in Fig.2. The first layer is a large class that divided into 12 according to the concept category, coded as A~L. The second layer is a medium class, total of 95, which is encoded by a large class with a lower case letter. The third layer is a small class, which is represented by a medium class code followed by a two-digit decimal code. The fourth layer is the word group classification, which means the paragraph in the small class. The fifth layer is the atomic word group, which represents the lines in the paragraph. CiLin is stored in text. Each atomic group is a line, starting with 8 characters from the big class to the atomic word group, followed by one or more concepts represented by the character.

Based on HowNet's sememe similarity calculation
Liu Qun and Li Sujian [2] considered that the similarity of two words is the possibility that they can be replaced in different contexts without changing the syntactic and semantic structure of the text, and the formula is proposed: In the formula, p1,p2 represents two sememe, dis(p1,p2) represents the distance between two sememe, and the adjustment parameter α represents the path length when the similarity is 0.5.  Li Lei [3] and Zhu Xinhua [4] improved the edge weight formula follows: In the formula, i(p,q) represents the path between the original nodes p and q where p is the current node and q is the parent of p;θ is the adjustment parameter that defined as 4 in here; Max represents the total number of all the original nodes of the sememe tree; c1 is 0.7 and c2 is 0.3.
According to the edge weight formula obtained by Equation (2), the distance formula for calculating the original node p1 and p2 is obtained. Where G represents the common parent of sememe. We substitute the formula (1) to find the similarity between p1 and p2.

Calculation and improvement of words similarity in HowNet
For the two words W1 and W2, it is assumed that W1 has m meanings: S11, S12, S13, ..., S1m, W2 have n meanings: S21, S22, S23, ..., S2m.The similarity between two words is attributed to the similarity of two meanings, and the maximum value of each concept is the similarity of words W1 and W2.
According to Liu Qun [2] , we derive the semantic similarity between the meanings.
In this paper, the specific steps of the improved calculation method for the four parts of sememe: a) Set the similar vacations of the two sememe to -1 and traverse the S2 original. b) Match the specified part of the two primitives (For example, the specified part is the symbolic sememe part). c) If the specified parts are identical, the similarity is 1; If the specified parts are different, the partial similarity is calculated by the formula proposed in 3.1; If one of the two meanings is a specific word, the similarity is directly assigned γ. Take the maximum value in this step d) Repeat (b)(c) until all the original parts of the specified part match. If the maximum value is still -1 at this time, the similarity will be assigned to this part. e) Determine the length of the specified part of the sememe S1 and S2. If S2 is longer than S1, we multiply the excess unmatched part of S2. According to the meaning of each part of the sememe semantics for word semantics can be defined as β1+β2+β3+β4=1,and β1>=β2>=β3>=β4, β1 is 0.5，β2 is 0.2，β3 is 0.17，β4 is 0.13，α is 1.6，γ is 0.2，δ is 0.2 in this paper.

Calculation and improvement of the similarity of CiLin
In the exploration of CiLin, this paper refers to Peng Qi [12] based on the content similarity calculation of information content and the definition of Seco [5] in WordNet.
hypo C IC C nodes where hypo(C) represents the number of lower nodes for concept C and maxnode represents the total number of nodes.
Firstly, this paper improves the similarity calculation of CiLin : From equation (7), the difference between the two concepts or synonyms that are identical is at least 0, and the similarity is 1. When the two concepts are leaf nodes and the nearest public parent is the root node, the concept of the two ontology edges is the most different. According to formula (6), the number of lower nodes of the root node is the total number of nodes, the information content of the root node is about 1. The number of lower nodes of the leaf node is 0, so the information content of the leaf nodes is also 1, and the similarity is 0. When the average similarity is greater than a smaller constant α, it means that each node has a certain correlation, but the correlation is not high. Propose the following formula for similarity: In the formula, maxdis represents the maximum value in dis(C1,C2), and len represents the number of groups included in the word. This article α takes 0.2. When the maximum similarity is greater than the constant β, indicating that the similarity of a pair of nodes reaches a higher level, it can directly defined as the maximum similarity. If none of the above is true, the minimum similarity is defined as the word similarity. Finally, the formula for the similarity calculation is discussed as below.

Comprehensive HowNet and CiLin word similarity calculation
We consider the word similarity between HowNet and CiLin and calculate HowNet similarity s1 and CiLin similarity s2 for two words W1 and W2. Two similarities are assigned to HowNet similarity weights λ1 and CiLin similarity weights λ2. Comprehensive similarity calculation formula is like : According to the inclusion of two words, it can be divided into the following situations: (1) HowNet similarity is used when both words are only included in HowNet, λ1=1，λ2 =0.
(3) If a word is only included in CiLin and another word is only included in HowNet, look for synonyms in CiLin. If there is no synonym, we record the similarity of CiLin as 0.2. If there has synonym, we calculate the synonym in HowNet.
(4) If a word is only included in HowNet and another word is included in HowNet and CiLin, we look for synonyms in CiLin. If there is no synonym, the similarity is determined by HowNet similarity. If there has synonym, we look for the synonym and the value with the highest similarity serves as the CiLin similarity.
(5) If a word is only included in CiLin and another word is included in the common inclusion, we look for synonyms of the words contained only in CiLin. If there is no synonym, the similarity depends on the similarity of CiLin. If there has synonym, we find the synonym and medium similarity maximum as HowNet similarity.
(6) If the words are included neither in HowNet nor in CiLin, the similarity cannot be calculated.

Determination of the value of λ1 and λ2 weighting factors
This paper uses the internationally popular 30 pairs of word data sets published by Miller & Charles (MC) [6] and the word data sets published by Rubenstenin & Goodenough (RG) [7] as test cases. When the words are included in HowNet and CiLin, the similarity weights λ1 and λ2 set 4 different weights, and 30 pairs of words in the MC30 data set are used as the research object, and the Pearson values are as shown in Table 1.
From the Table 1, we can see that when both words are included in HowNet and CiLin, the similarity of CiLin is 0.9 and the similarity of HowNet is 0.1, the Pearson value reaches the highest. When a word is only included in the CiLin or HowNet and the other word is both included, we set 4 different weight factor combinations to calculate the word similarity by comparing with the artificial test value of RG65. We choice the best weight that is more in line with the actual situation. The experimental results are shown in Table 2.
In Table 2, when λ1 and λ2 are (0.7, 0.3) and (0.6, 0.4), the similarity between the words "pillow" and "pillow" differs greatly from the value of manual evaluation. When λ1 and λ2 are (0.4, 0.6) and (0.5,0.5), although they differ from the manual evaluation value by 0.004, for other groups of words such as "mountain" and "slope", the word similarity of (0.4, 0.6) reaches 0.558, and (0.5, 0.5) is much different from the manual test value. Therefore, when words are not included in CiLin and HowNet, λ1= 0.4 and λ2= 0.6 can obtained higher similarity.

Comparative experiment
Through experiments, we can see that the comparison between Table 3 and Table 4 shows that the Pearson coefficient of word similarity in the fusion of HowNet and CiLin is better than other experiments. The merged method can more reflect the difference between words and the obtained word similarity is more scientific.

Conclusion
The improved semantic similarity method for HowNet and CiLin proposed in this paper. We combine CiLin and HowNet to make full use of the information content of words in different knowledge bases, complement each other's missing points in the knowledge base, correct the corresponding rough points and improve the limitations of knowledge description language. However, the similarity of some words in the experiment is still not ideal. The similarity of HowNet can still be improved and the fusion of the two can more considered so that the similarity of the similarity has higher reliability.