Transition based neural network dependency parsing of Tibetan

In order to improve the performance of Tibetan natural language processing applications such as machine translation, sentiment analysis and other tasks, this article proposes a neural network-based method for syntactic analysis of Tibetan language dependence. Part of the corpus of Qinghai Normal University’s part-of-speech tag set is marked by the corresponding mapping relationship is transformed into the corpus annotated by the national standard part-of-speech tag set. At the same time, the CoNLL format Tibetan language dependency syntax tree library is constructed, and the method of shift-reduce plus neural network is adopted to systematically study and analyze the Tibetan language dependency syntax. Thereby improving the quality of Tibetan dependency syntactic analysis, and its accuracy rate reaches UAS:94.59%


Introduction
Tibetan dependent syntax analysis is one of the core technologies of Tibetan natural language processing. It mainly analyzes the components of Tibetan sentences effectively, lays an effective foundation for various high-level applications such as emotional classification, machine translation and extraction of entity relations, etc. At the same time, it also has certain requirements for basic task lexical analysis in Tibetan natural language processing. Reasoning serves as a link between the preceding and the following. Although the research on Tibetan natural language processing has also been rapidly improved with the development of network information technology, and now has also made more fruitful research results, such as Tibetan word segmentation, part-of-speech labeling [18], semantic dependency analysis [2] [10].However, the research on Tibetan syntax analysis still faces many challenges. Dependent syntax analysis methods based on in-depth learning are basically blank. Therefore, it is necessary to draw on and consolidate some of the earlier statistical methods based on [1] [16] and to use some appropriate neural network methods to extract and analyze the features of various Tibetan sentence patterns.
The more common methods in the field of dependency analysis work today are Transition-based [17] and Graph-based [9].The graph-based method establishes a directed complete graph in which the maximum spanning tree of dependency analysis is solved to obtain the optimal solution. Transfer-based dependency parsing models the entire sentence decoding process as a finite automaton problem (analog to the parsing component in the compiler).Tibetan sentence patterns include statement sentences, exclamation sentences, rhetoric sentences, interrogative sentences and imperative sentences, which are basically represented by the wedge-shaped symbol '།'.We use the dependency analysis method based on transfer, which starts from the start state of the transfer and goes from one state to another continuously, and finally reaches the end state, with the corresponding state sequence tree of the end state as the analysis result. The state includes the stack of words being processed, the buffer holding the words to be processed, and the memory generating the dependent arc. The transfer sequence generally includes the move-in protocol and the generation of dependent arcs. The goal of transfer dependency parsing is to train a classifier to predict the next transfer sequence or action for a given transition state. Thus, the whole sentence structure can be expressed through the dependency between words.

Related research
Tibetan has a history of more than 1,500 years since its creation. After many years of long evolution, Tibetan has gradually moved to electronic information technology and achieved good results. Some experts and scholars at home and abroad have studied Tibetan natural language from various perspectives, including lexical analysis, syntax analysis, semantic analysis, machine translation and emotional classification. Dependent syntax analysis is also indispensable in Tibetan. At present, the study on Tibetan dependency analysis has only allowed others to propose a discriminant-based Tibetan dependency syntax analysis [16]. The perceptual machine method is used to train the parsing model, and the CYK bottom-up algorithm decodes to generate the maximum spanning tree. Xia Wuji and others have studied and proposed projection-based semantic dependency analysis of Tibetan [2], studied projection-based semantic dependency analysis of Tibetan and built a semantic dependency tree, and classified the dependency arcs by maximum entropy model. The dependent syntax analysis of Tibetan compound sentences [1] was also proposed by Genius et al. and its method is to use maximum entropy to extract the features and get the maximum entropy model to complete parameter estimation. The study of Tibetan dependency syntax analysis based on in-depth learning is basically blank. For this reason, this paper will make a transfer-based dependency analysis of Tibetan.
The Tibetan dependent syntax based on the neural network has a higher speed than the discriminant. In order to better assist Tibetan natural language processing technology, this paper conducts a transfer-based neural network Tibetan dependency syntax analysis study, to analyze the existence and type of dependency between predicted words, and to take the first step in deep language understanding. In this article, we use low-dimensional (50-dimensional), dense word embedding, but this method is not perfect. First, from a statistical point of view, too many feature weights with inaccurate predictions are used. In this way, Lexicalization and higher-order interaction features are important to improve the performance of these systems, but there is not enough data to properly weigh most of these features. For this reason, the introduction of more supportive features, such as word class features, will also provide significant support in improving the performance of dependency analysis.

Tibetan dependency syntax analysis and difficulties
Firstly, the Tibetan text corpus is preprocessed by word segmentation and part of speech tagging [18], which is divided into two categories: Qinghai Normal University annotated corpus specification and national standard tagging corpus specification. The dependency tagging of Tibetan language dependency syntax analysis uses the 36 types of dependency marked by Huaquecairang et al.

Tibetan dependency syntax analysis
The dependency syntacticstructure of Tibetan language relies on its part of speech, dependency tag and its phrase structure, as shown in Figure 1. An example: "དགེ ་�ན་�ི ས་ང་ལ་ ད�ད་�ོ མ་མ�བ་�ོ ན་གནང་།"(The teacher will guide my thesis.) In this example, the arrow (dependency arc) points to its child nodes, in which ROOT is the root node, SUB is the subject predicate relationship, GZCX is the case auxiliary in Tibetan grammar, OBJ is the object relationship, GENI is the case of the Ge auxiliary in Tibetan grammar, AUXIW is the auxiliary verb structure relationship, PUNCT is the punctuation structure relationship in Tibetan. All the above are based on the dependency labeling standard of Qinghai Normal University. The part of speech (POS tagging) of each word and its phrase structure are marked at the top of Tibetan characters in the figure.

Difficulties existing
The main task of Tibetan lexical analysis is to separate Tibetan text into meaningful words, determine the category of each word and shallow ambiguity elimination, and identify some longer proper nouns. This paper uses the Tibetan part of speech tagging standards of 67 categories of Qinghai Normal University and the national standard 92 categories of Tibetan part of speech tagging norms, and makes corresponding mapping relations to transform each other.
Usually, the important information of a Tibetan sentence is located at the end or end of the sentence, the part taken from the front of the sentence and the part taken from the end of the sentence to calculate the core components of the sentence. The recognition of core verbs (ROOT nodes) by dependent parsing in Tibetan sentences is also of great importance.
Before the dependency relationship of case auxiliaries, it was only genius that allowed others to propose discriminant-based Tibetan dependency syntax analysis methods [16] and dependency syntax analysis of Tibetan compound sentences [1] where Tibetan case auxiliaries act as the father nodes of the belonging words.On the contrary, this paper presents the father nodes of the case auxiliaries, as shown in Figure 2,3 below "ང་ཚ� ་ཨང་བཞི ་བའི ་ �ལ་�་�གས་�བ་སོ ང་།"(we joined the fourth group). Figure 2 is an example of a case assistant as a father node, while Figure 3 is a child node. The dependencies shown in Figure 3 are relatively comprehensive.  Using the arc standard algorithm of one of the transfer systems as the basis of the transfer analyzer, a series of transformed sequences are made on the Tibetan text, resulting in dependent arcs and predicting the next transfer action. The transfer process is described in detail below.

4.1Tibetan dependent syntax analysis based on transition
First, each transfer process is based on the current conditions or state configuration to make a decision (transition), update the state after the transfer to the next decision process (greedy algorithm for decision making), each step selects the transfer result that is currently considered the best, which only loses a small accuracy and greatly improves the speed.
State configuration: c = (S, B, A)，Sis the stack holding the transfer process (Stack), B is the input cache queue (Buffer), A is the set of dependency arcs for the currently analyzed dependency edges (dependencies, including dependency tags), and a sentence is assumed to be where 1 , 2 , ⋯ , n , n is the word in the sentence, initial state S = [ROOT], B = [ 1 , 2 , ⋯ , n ]，A = ∅，if the queue of one state is empty and S = [ROOT], then this is the last state, the end state, to end the entire transfer process.
Transition: There are three states: LEFT-ARC, RIGHT-ARC and SHIFT.S i represents the top element of the stack (stack first in, stack last out), b i corresponds to the first element in the queue (queue FIFO), then there are three operations: LEFT − ARC(l):When the number of elements in the stack is greater than or equal to 2, add a dependency as S 1 → S 2 , and the label for the dependent arc is l, then S 2 Remove from the stack.
RIGHT − ARC(l):When the number of elements in the stack is greater than or equal to 2, add a dependency as S 2 → S 1 , and the label for the dependent arc is l, then S 1 Remove from the stack.
SHIFT: When the number of elements in the buffer queue is greater than or equal to 1, b 1 Remove from the queue and add to the top of the stack.
Table1 is an example of the whole process of transferring Tibetan sentence dependencies, with the example "ང་ཚ� ས་�ི ས་འཁོ ར་�ོ བ་�ོ ང་�ེ ད་བཞི ན་ཡོ ད།" translated as: (We are learning computer. ) The dependency tags between the Tibetan text and the sentence components of the word segmentation are input into the neural network classifier through the transfer process.

Dependency analysis model based on neural network
A standard neural network with an implicit layer is established. As shown in Figure 4, where the corresponding embedded representations of the elements we selected from S w , S t , S l are added to the input layer. The feature vectorization of word segmentation and part-of-speech tags and dependent tags is input. Assume the size of the input set is n w , n t , n l ，S w = �w 1 , ⋯ , w n w �, for n w Vectorize all words to get x w = �e w 1 w , e w 2 w , ⋯ , e w n w w �, so x t and x l can be obtained.Connect to a fully connected network and use the activation function (cube function) followed by a fully connected layer to get the classification probability using softmax.

Experiment
To support Tibetan dependent parsing based on transfer neural network, more than 10,000 Tibetan dependent parsing corpuses are constructed into training, test and verification sets. The neural network described in the previous section has 48 features, since it is nonlinear, including cube functions, which can also help to extract the relationships between features, such as whether they occur together or not. However, traditional methods can only manually combine these 48 features to represent the situation that occurs together, which will result in hundreds of combinations, that is, hundreds of features, but each sentence may occur very rarely, so the feature vectors will be sparse, time-consuming and relatively poor generalization ability.

Experimental data
The corpus uses the new Tibetan-dependent corpus format, Conference on Computational Natural Language Learning, which ends in CoNLL, as shown in Table 2.

_ _
Only the first eight columns are used in this text, meaning: 1) (ID) denotes the ordinal number of the current word in a sentence, starting from 1; 2) (FORM) denotes the current word or punctuation; 3) (LEMMA) denotes the prototype or stem of the current word (or punctuation); 4) (CPOSTAG) denotes the part of speech (coarse grained) of the current word; 5) (POSTAG) denotes the part of speech (fine grained) of the current word; 6) (FEATS) denotes the characteristics of a sentence. In the experiment, this column was not used and was replaced with an underscore. 7) (HEAD) is the central word of the current word; 8) (DEPREL) denotes the dependence of the current word on the central word.
In CoNLL format, each word occupies one line, and the non-value column is underlined '_' instead, the sentence is separated by a blank line (see Table 2.), and a symbol '_' is added at the end of the dependency to increase the feature.

Experimental results and analysis
A total of 10023 sentences were used in the experiment, of which 10023 sentences were in the training set. 500 sentences in the validation set and 500 sentences in the test set were in the CoNLL format mentioned above.
The main measures of dependency analysis are as follows: On a test corpus in Tibetan CoNLL format, the dependency correctness rate (UAS) for classifying non-dependent tags is classified by a neural network model. The results are shown in Table 3.
The experimental results show that the dependency parsing accuracy based on the neural network is 94.59%, which has room for improvement. This experiment is the first time to use the neural network method to analyze the dependency of Tibetan. The small size of the database has not completely covered all the Tibetan sentence patterns, and the lexical analysis has not been fully used, thus causing analysis errors in model training.