Chinese-Lao Bilingual Named Entity Alignment Research

Chinese-Lao bilingual NE alignment has a very important significance. Three entity alignment methods are proposed in this paper. Firstly, the paper proposes the similarity of bilingual entity fuzzy matching problem. Secondly, we use bilingual entity word sequence pattern similarity to propose Chinese entity model to match Lao entity method. Then we build a naive Bayes bilingual NE alignment model to align Chinese and Lao named entity in the comparable corpus, by mining knowledge information words of Chinese entities. In the end, the rules combine the advantages of the three methods are proposed to achieve the best results.


Introduction
Named entity refers to the entities which has a certain significance in the text [1], it is an important information in Natural Language Processing and an important support for information retrieval, automatic question answering, Machine Translation, topic discovery, and other information processing research [2].For bilingual sequences, the alignment of bilingual entities aims at establishing the correspondence between the source language and the named entity in the target language and is an important task in the field of multi-language information processing such as Machine Translation, cross language information retrieval, etc. [3].
In recent years, the economic, political and cultural exchanges between China and Lao is deepening, which promote the development of Lao information processing technology, but also put forward higher requirements.But the research is still very weak in Lao languages.
Lao and Chinese Named Entities are quite similar, such as they both do not have special features like capitalization to help identify named entities.Moreover, in the sentence, there are not spaces to delimit the word.And the order of Subject, predicate and object is the same.Of course, it has its own characteristics.For example, if the personal name is the Lao local name, the first name is in front, the last name is in back, otherwise, the last name is in front, the first name is in back.In the sentence of Lao language, the adverbial is generally at last.The front of general location name of Lao has the special word to be distinguished.The personal name is represented as man if the front of name have Mr., the person name is represented as woman if the front of name have Mrs., etc. [4][5].

Selection of candidate equivalent entity
Candidate equivalent entity of Chinese named entity mainly were unknown words and identified similar entities in Chinese text in Lao parallel text.If the Lao parallel text is rare, consider comparable text with similarity more than 0.4.
In the process of selecting candidate equivalent entities, we select the corpus first to find out the potential text set of

NE alignment based on similarity of bilingual entity fuzzy matching
To compare the similarity between Chinese named entity and the candidate entity selected by machine translation way, and translate the Lao candidate entity into Chinese by machine translation and compare the similarity between it and Chinese source entity.The higher similarity result between the two translation processes is the final similarity.If the similarity is greater than a certain threshold, we believe that the two are equivalent entities.Candidate entities are selected by making rules combined with unknown words.At present, Machine Translation has a poor translation of named entity, but some translation results can be used as reference.The correct part of the translation results can be referred to the entity screening and the similarity can be obtained by calculating the number of syllables and words accord with the feature of transliteration and paraphrase.The Lao candidate entity with the highest similarity and more than a certain threshold is the Lao equivalent entity.
The Chinese and Lao equivalent entities are determined by calculating their pronunciation consistency ratio, the specific calculation formula is as follows:

NE pattern recognition
Because of the differences between the Chinese and Lao pronunciation rules, the transliteration features are not fully applicable.Some Lao syllables have no corresponding pronunciation in Chinese which brings great obstacles to the named entity alignment based on transliteration.Some entities do not fit in with the features of transliteration and paraphrase.For example, some people names often appear in the English, digital, abbreviated form such as Vill Wannarot.This paper argues that although there are differences between Chinese and Lao but in parallel corpora and comparable between Chinese and Lao entities in the corpus with the entity model should be the same or similar.We can get a high quality pattern that can meet the needs of people, place names matching by compiling statistics high frequency instance pattern through some common names, DOI: 10.1051/ , 020 ( 2017) 710002052

2016
MATEC Web of Conferences 100 GCMM matecconf/201 52 translating the pattern into Lao and adjusting the order according to Lao language habits.Lao words which fit in with the pattern are equivalent entities.For example, Huang Hua, served as China's foreign minister from 1976 to 1982.
Translate the sentence into Lao and put the adverbial of time in the end of the sentence according to Lao language habits, generate a pattern.Compare this pattern with Lao text words sequence in the comparable corpus, and entity which fit in with the pattern is the equivalent Lao entity.
We can get a lot of Lao pattern by the manual adjustment pattern in Chinese.Lao entities mining through pattern have high accuracy in type judgement.For example, the most commonly used short pattern: [Mr.X says], the accuracy of matching Lao names in the corpus is 100%.If there is a need to align with Chinese entities, the pattern must contain sufficient context information.An equivalent pattern is shown as follows: Chinese: Headmaster Yellen attends the meeting Generate the following pattern: Mr. X headmaster attend meeting Headmaster can be replaced by similar name inspired feature words such as manager, doctor, etc.
candidate entities, then we consider the similar entities and unknown words recognized from the text set as a set of candidate equivalent entities.
Fig.1.Flow chart of screening candidate equivalent entities