Chinese named entity recognition model based on BERT

. Nowadays, most deep learning models ignore Chinese habits and global information when processing Chinese tasks. To solve this problem, we constructed the BERT-BiLSTM-Attention-CRF model. In the model, we embeded the BERT pre-training language model that adopts the Whole Word Mask strategy, and added a document-level attention. Experimental results show that our method achieves good results in the MSRA corpus, and F1 reaches 95.00%.


Introduction
In 1991, a paper on company name recognition was published at the IEEE Conference on Artificial Intelligence [1]. Since then, a branch of named entity recognition has appeared in the field of natural language processing research. The initial named entity recognition was mainly based on dictionary [2] and rule-based matching [3]. Then it is mainly based on machine learning methods, these methods include Hidden Markov Model, Maximum Entropy Model, Support Vector Machine and Conditional Random Field.
Nowadays, the performance of computers has been continuously improved. Therefore, named entity recognition based on deep learning has gradually become a research hotspot in the field of natural language processing.
In recent years, related research work has improved the named entity recognition model. Collobert et al. [4] proposed the use of convolutional neural networks to realize named entity recognition. Huang et al. [5] proposed the use of bi-directional long and short-term memory network combined with artificially designed features to realize named entity recognition, and the F1 reached 84.83% on the CoNLL2003 data set. Ma et al. [6] added a convolutional neural network for extracting character-level representations of words in the model, and the F1 reached 91.21%. Luo et al. [7] used the BiLSTM-CRF model combined with the attention, and the F1 on the BioCreativeIV dataset reached 91.14%. Wu et al. [8] proposed joint word segmentation training with the CNN-BiLSTM-CRF model, and at the same time processed samples with the help of pseudo-labels, which further improved the performance of entity recognition. Peters et al. [9] used BiLSTM to generate contextual representations through pre-trained language character models and achieved good results. Jana Straková et al. [10] applied the BERT pre-processing model to named entity recognition and F1 reached 93.38% on the CoNLL-2003English data set.

Our frame
Our model framework is BERT-BiLSTM-Attention-CRF, which consists of a BERT embedding layer, a bi-directional long short-term memory network layer, an attention layer and a conditional random field layer. The structure of the model is shown in Figure 1.

BERT model
BERT is a general language pre-training model proposed by the Google artificial intelligence team in 2018 [11]. It can represent words in vector form to obtain the similarity between words. BERT uses a bidirectional Transformer neural network as an encoder, so that the prediction of each word can refer to contextual information. The model also proposes a "mask language model" and a "next sentence prediction task model" to capture word-level and sentence-level feature representations. The Mask language model masks 15% of the information in the corpus to maximize the representation of each word in the model. The BERT model is shown in Figure 2.  Different from the BERT in the Single Word Mask (SWM) [12], we use the Whole Word Mask(WWM) [13] in accordance with the Chinese habit when pre-processing the sentence.
In the Whole Word Mask, if the part of a word is masked, the other parts that belong to the word will be masked, as shown in Table 1.

BiLSTM model
LSTM (Long-Short Term Memory) is an improved model of Recurrent Neural Network (RNN). It solves the problem of gradient explosion or gradient disappearance that occurs when RNN processes long sequence information. LSTM adds a memory unit, forgetting gates, input gates and output gates [14], alleviating the problem of long sequence forgetting. The structure is shown in Figure 3.  The LSTM calculation formula is as follows: In which, represents the forget gate, the function is to decide what information to discard from the current cell state; represents the input gate, the function is to decide how much to choose from the newly acquired information to update the state; represents the output gate, which determines how much information generates the hidden layer State variables; , and are their adjustable parameters; , and are their weights; , and are their biases; is the sigmoid activation function; tanh is the hyperbolic tangent activation function.
Since the one-way LSTM model can only learn the above information, it cannot use the following information, which limits the effect of entity recognition. The BiLSTM (Bidirectional Long-Short Term Memory) proposed by GravesA et al. [15] consists of a forward LSTM and a backward LSTM. The basic idea is to take each word sequence separately forward propagation and backward propagation, and then concatenate the output at the same time. The structure is shown in Figure 4.

Attention model
Aiming at the characteristics of entity naming methods in Chinese texts and uneven distribution of entities, we introduce a document-level attention mechanism to focus on the global information of the document, while increasing the similarity evaluation of the cosine distance score. We use = ( 1 , 2 , … , ) to represent the document, which contains sentences . Each sentence = ( 1 , 2 , … , ) contains words. The output sequence of BiLSTM is input to the A matrix, so as to obtain the correlation degree between the current character and all words and the global feature representation of the target word. The calculation formula is: where , represents the attention weight of the character and the other character in the document, ℎ represents the output of the BiLSTM layer, represents the similarity score of the two characters using cosine distance, and represents the parameter matrix learned during the training process, and is the output of attention layer.

CRF model
In the task of named entity recognition, the dependency between adjacent tags is also one of the factors that cannot be ignored. CRF can obtain an optimal prediction sequence through the relationship between adjacent tags, which can make up for the shortcomings of BiLSTM. For any sequence = ( 1 , 2 , … , 3 ), assume that P is the output matrix of BiLSTM, and the size of P is × , where is the number of words, is the number of tags, and , is the score of the th tag of the th word. For the prediction sequence = ( 1 , 2 , … , ), The calculation formula is as follows: where A represents the transition score matrix, , represents the score of label transitioning to label , and the size of A is + 2. � represents the true label sequence, represents all possible labels sequence. ( , ) is the score function of the predicted sequence . ( | ) is the probability of the occurrence of the predicted sequence . ln� ( | )� is the likelihood function of . * is the output sequence of the maximum score.

Experimental data set
The commonly used labeling modes for named entity recognition are BIO mode, BIOE mode and BIOES mode. We adopt the BIO mode, which has 7 tags, namely "O", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC", where O is a non-named entity, B is the first word of a named entity, I is a non-first word of a named entity, PER is a person's name, ORG is an organization First name, LOC is geographic name. We use MSRA corpus for model experiments. This data set is issued by Microsoft Research Asia and is a public Chinese data set in China. The MSRA data set contains 7 types of labels in three categories: geographic name, organization name, and person name.
Our experiment mainly identifies and evaluates people's names, geographic names, and organizations. The specific data set in the MSRA corpus is set to 46364 sentences in the training set and 4365 sentences in the test set.

Evaluation rule
The experiment in our paper uses precision rate, recall rate and F1 value as indicators to judge the accuracy of the model. The calculation formula is as follows: In which, represents the number of correctly identified named entities; represents the number of incorrectly identified named entities; represents the number of unidentified named entities; is the accuracy rate; is the recall rate.

Experiments environment
We use Tensorflow1.14.0 to build the experimental model. Our computer memory is 32GB, the graphics card is NVIDIA GeForceRTX2070 and the python version is 3.6.9.
In Our experiment, Adam optimizer is used, the maximum input sequence length is 128, LSTM_dim is set to 200, batch_size is 16, and learning_rate is 5e-5. To prevent over-fitting problems, set drop_out_rate to 0.5.

Analysis of results
To prove the effectiveness of the model, We compare BERT-BiLSTM-Attention-CRF (Whole Word Mask) with the previous model The experimental results are shown in Table  2. From the Table 3, we can see that the accuracy, recall, and F1 of our model on the MSRA data set have achieved the best results, and the F1 of the method in our paper reaches 95.00%. First of all, the performance on the MSRA data set shows that the F1 of BiLSTM-CRF is 3.54% higher than that of LSTM-CRF. It can be seen that the bi-directional structure of BiLSTM has a stronger ability to acquire context sequence features than the one-way structure. Comparing the BERT-BiLSTM-CRF (SWM) model and the BiLSTM-CRF model, it shows that the BERT pre-training language model has significantly improved named entity recognition, and its accuracy has increased by 8.59%. When the mask strategy of BERT is changed to the Whole Word Mask in the BERT-BiLSTM-CRF model, its F1 on the MASR data set increases by 0.41%, indicating that its feature extraction ability is stronger. The model in this paper introduces a document-level attention on the basis of BERT(WWM)-BiLSTM-CRF, and the F1 reaches 95.00%, indicating that the attention can enhance the feature extraction ability of the model under global information.
The F1 of the experiment in Our paper changes in the first 20 epochs as shown in Figure  5. As shown in the figure above, the neural network models BiLSTM-CRF and LSTM-CRF, which do not use BERT, have a low F1 at the beginning of training, and reach a stable level after many iterations, but they are still lower than the three models using BERT. The three models of BERT-BiLSTM-CRF (SWM), BERT-BiLSTM-CRF (WWM), and BERT-BiLSTM-Attention-CRF (WWM) can reach a higher level after one round of training, the F1 reaches about 90%, and as the number of training rounds increases, the F1 continues to increase, eventually reaching a high and stable level.
We also compare the performance of the BERT-BiLSTM-Attention-CRF (WWM) model and other existing models on the MSRA corpus, as shown in Table 3. As shown in the above table, the CNN-BiLSTM-CRF model uses convolutional neural networks and bidirectional LSTM to extract character feature sequences. The DC-BiLSTM-CRF model uses DC-LSTM to learn sentence features, combined with self-attention mechanism for entity recognition. Lattice-LSTM-CRF uses mesh LSTM for character feature extraction to complete entity recognition. The BERT-IDCNN-CRF model uses BERT (SWM) for word embedding, combined with iterative expansion convolution to extract sentence features, and its F1 reaches 94.41%, which is still far from the performance of our model. So we can see that the performance of the BERT-BiLSTM-Attention-CRF (WWM) model is better.

Conclusion
In our paper, the word vector is obtained by the BERT (WWM)language pre-processing model, and then the word vector is input into the BiLSTM-CRF and attention layer to construct the BERT-BiLSTM-Attention-CRF model. By verifying on the MSRA corpus, compared with other existing models, the BERT-BiLSTM-Attention-CRF model of the F1 reaches 95.00% on the MSRA corpus, which has the best performance. The BERT model using the Whole Word Mask strategy and document-level attention are the biggest advantages of our model. This makes the sequence features extracted by Our model conform to the Chinese habits, and can learn the word-level structural features and contextual semantic information. In the future, we consider applying it to entity recognition in the professional field.