High-order neural network in entity relationship extraction

In this paper, a kind of high-order neural network is proposed to extract entity relations in natural language. In this kind of network, different parameters absorb non-overlapping information from separated data respectively, which makes parameters more significant for understanding. This neural network can alleviate overfitting problem in some degree. When solving the task of entity relationship extraction, this network can give a result no worse than current methods.


Introduction
Information extraction is an important task especially in natural language processing (NLP) field.Extracting entity relations from natural language is really a significant part of information extraction.Recently there seems a bottleneck existing in promoting the performance of relation extraction models.End to end relation extraction method is rediscovered and combined with LSTM to promote the performance [1], and showing comparable performance to CNN-based models on nominal relation classification (SemEval-2010 Task 8).This end to end model has been proved effective in SemEval-2017 Task 10 [2], whereas the same bottleneck seems still exists.A great many of reasons can cause this happen, such as inappropriate representation of entities and relations, inaccurate entities recognition or overfitting of models.Neural relation extraction with selective attention over instances [3] is a well-behaving model in this task.Its trained model can descript its train dataset well with a high precise greater than 95%, but harvests a precise of less than 80%.The precise and recall on test dataset wouldn't climb up simultaneously, even though the model can descript the train dataset almost perfectly, which means there existing an extreme overfitting.
As a traditional task, relation extraction has been studied for decades by a great many of researchers and a landmark Deep Neural Network [4] especially approaches based on convolutional neural network [5] can separate these studies into two stages.Before DNN, Hand-built patterns [6] and handcrafted features [7] are popular approaches, which may be greatly complicated but can be understood.DNN-based models can work well generally on relation extraction tasks, meanwhile many novel traditional ideas are abandoned.Approaches based on LSTM [1] and CNN [3] seems not compatible with traditional handcrafted features based methods flexibly, which is really a great loss.It will happen easily when the parameters of DNN based approaches are difficult to understand.
In this paper, a neural model whose parameters can be explained primarily is proposed, which has a promotion on the performance of relation extraction for its inherent ability of overcoming overfitting.Different from traditional neural network, the network in this paper has two parts (not layers), which are called orders in the remaining of this paper.The first order employs a CNN-based neural network.In the second order, the parameters of the first order will participate to train new parameters as input, under the supervision of the first order.Higher orders are not constructed in the experiments for its serious time consuming, but it is easily to construct according to the theories.Once each order is constructed and trained completely, the parameters in different order achieved different meanings.Moreover, the meanings are not correlative with each other in stochastic.With this method, each group of parameters in each order can descript a different feature, which is not the very feature that we are familiar with though.Meanwhile, the model is born with an ability of overcoming overfitting for the higher orders can always supervise the train process of the lower orders, which seems like the introspection of human beings.
Figure 1 shows the structure of second-order neural network.At below of this figure, it means the training process of the first order.Once the parameters of first order network are trained, the record of parameters change will be inputted to the second order network.The output of the second order network is the kernel of first order network and will update the first order network.When testing this model, we only use the first order network with the kernel outputted by the second order network.
The experimental results will show that there is a real promotion in relation extraction when the second order even a simplest mean-method is constructed.And there is a further promotion, when a more complex model is employed in the second order.Moreover, the proposed model can still behave better when the training dataset is reduced, which may be significant when it is used to process cross-sentence N-ary relation extraction [8] or other tasks with small quantity train dataset.
The contributions of this paper can be summarized as follows: An approach whose parameters can be explained is proposed and works well on relation extraction task.A new way to overcoming overfitting problem is discovered.The proposed model has a powerful expansibility and portability.

First-order Neural Network
In a task, the training data is noted as , and if the whole neural network is represented as M( • ) simply, then we can compute the output of this network: where all the parameters are noted as simply.
In the stage of training, an optimism or cost function is employed to conduct the parameters in a so-called corrected direction.We note this function as � • , our goal is to find the best parameter : Once we get the , we finish our training work.

Second-order Neural Network
Usually 1 st -order neural network can satisfy the task, but for a complex task, it seems so simple that we have to overlying too many layers.Now let's do some change.
In a task, the training data is sliced into two parts � � � , also we note a neural network as M 1 ( • ) simply, the final output can be computed as following: where all the parameters are noted as (t) simply.Also, in the stage of training we need a cost function J • to conduct the training steps.Now our goal is to find the best : While the (t o ) is the best parameter for M 1 where α i is component coefficient of � � , and � � �� � � � �� � � � �� � are a group of linear independent vectors filling the whole space that (t o ) can fill.This group of vectors tell what the model cannot learn from the training data � ⸴ And the value of parameter = tα 1 �α 2 ���α n 䳌 will learn from � ⸴ While it is not always convenient to do find a group of these vectors � � �� � � � �� � � � �� � , we can always do some fitting work which all the machine learning methods do well including neural network.Suppose (t) is initialized with (0) in the start of training and stopped at (t o ) , as (0) is randomized, (t) can be treated as a stochastic process with t ∈ t0�t o 䳌.
For regulation, let t 0 = 0 and t 1 = (0), then The independence of each element in { � − �−� } ensure that this kind of exists with great probability.
To ensure M 2 Once we get the , a approximation of can be computed with .

Higher order Neural Network
While higher order neural network may consume a great deal of computing resource, it is worth when the complexity of each single order is reduced, besides meaning or feature of parameters in each order is related to different data which makes it more explicit to solve different problems.
It is a natural progression of derivation from 2nd-order neural network to Nth-order neural network.Training data has been splitted into N package � � � ��� manually and N 1st-order neural network has been chosen, which are noted as � 1 • �� 2 • ���� � • respectively.Parameters to � i • are noted as � .If we choose �( • ) as the cost function, �t , the best value of � will be generated from the following formulas: where �−1 � j = 2�����i are the training data produced by M �−1 .
In the iteration, there is a great advantage that different parameters are trained by different data, after which these parameters can collaborate with others efficiently.This property makes it possible to let people understand the meaning of parameters.Moreover, the parameters to ith order only learn the knowledge which has not learnt by the first i-1 orders, for the same information will not produce differential coefficient to the higher order.Besides there is a windfall that it is compatible with other machine learning models which means any order of the model may allow replacing by a classic machine learning method.While its consumption does not increase too fast for the training data will be split into smaller, too higher order neural network is still not recommended.

Experiments
Our experiments are intended to demonstrate that the proposed model can release the overfitting problem and achieve a better performance

Dataset and Evaluation Metrics
The model is evaluated on the dataset developed by Riedel et al.in 2010 [9], which was also used by Hoffmann [10] and Surdeanu [11] and Lin [3].By aligning Freebase relationships with the corpus of the New York Times (NYT), this dataset is created easily.The entity mentions are found using the entity tagger named Stanford [12].To demonstrate the effectiveness, we separate the dataset into two parts, one was produced before 2007, which is used for training, and the other part is produced in 2007, which is used for testing.The model will be evaluated using precise-recall curve, which is common used in entity relation extraction tasks.

First order settings
The first order employs the CNN+ATT model proposed by Lin et al in 2016 [3] and inherits the parameter settings and preprocessed data, which include the word embedding developed by word2vec tool and relation instance sentences.The words appearing less than 100 times are deleted and multiple words entity mentions are concatenated.

Second order settings
In this experiment, three different models are employed in the second order respectively, so that we can compare the performance.
 Mean model Here, we only consider the last (n-d+1) terms of � , because it is very random in the start.
 Max model Here, �t and � � are the k dimension of and � respectively.And we also only consider the last (n-d+1) terms of � , because it is very random in the start. CNN model In this experiment, because of our experimental conditions restrict, we employ the simplest CNN that has only one � × � kernel and one convolutional layer.As shown in figure 2, taken the sequence { � } as input data, through a convolutional layer and a max pooling layer and a non-linear layer, is generalized.Specially, after the convolutional operation, a matrix with the same size to input layer is generalized.It means that there exists a danger of out-of-range.When it happens, we regard all out-of-range inputs as zero.

Effect of HNN
As shown in figure 3 with the precise-recall curves, the HNN models significantly outperforms the BGRU+ATT model over the entire range of the recall.HNN_MAX and HNN_MEAN have the same performance, and the HNN_CNN is even better.In contrast to BGRU+ATT, HNN_CNN has an improvement of greater than 3% at each recall.On the level of low recall, the precise is impressive, while it drops quickly when the recall is greater.When the recall reaches 0.32 approximately, all the precise is less than 50%.
The figure 4 shows the result of the proposed model trained by half size of the data used in figure 3 model.It is shown that there is a greater difference between HNN_CNN and BGRU+ATT than the result shown in figure 3, which means the performance of HNN_CNN descend no worse than the performance of BGRU+ATT.That is to say that the overfitting problem is indeed released in a degree when HNN is employed.

Further explanation
When human is thinking about a question, they seldom achieve the best answer after his first thinking.However, when a man turns back and rethinks the same question, he will achieve great promotion.It is worth mentioning that the rethinking progress is always along with the original thinking orbit.The second order of the proposed model is equipped with this ability.When the finding the relationship of target entities, it is easy to be affected by various modifiers, for instance appositive and parenthesis, in the sentences.As a result, the model ends with a local optimal solution easily.The second order network makes an examination to each optimal step when the first order network being trained., which makes the whole model achieve a better result.

Conclusion and future works
In this paper, we develop a new neural network HNN to solve the task of relation extraction in NLP.The experiment result suggests that our model significantly outperforms the best CNN model at the task of relation extraction, which means the over fitting problem is released.
In the future, we will explore the following directions: There is a great time consuming in our model, though it achieves a better performance.In the future, we will optimize the structure and computing functions to release this problem.We believe that there is a great probability that higher order neural network can perform better.We will develop higher order neural network if we can discover a method releasing the time consuming problem.
Performance comparison of each model trained with the whole dataset.螀⌬⸴4⸴ Performance comparison of each model trained with half of the dataset.
• , M 1 • with (t o ) is not the best model for the whole task, for the training data � have no contribution for (t o ).(t o ) only tells what the model can learn from � , But we can give an expression to the best parameter noted as : • collaborate with M 1 • efficiently, � should supervise the training result of M 2 • through the intermediary M 1 • .And M 1 • should produce enough training data for M 2 • by choosing different initialization value of 0 .