Tibetan Information Extraction Technology Integrated with Event Feature and Semantic Role Labelling

: we integrate with semantic information which is based on syntactic analysis for extracting the Tibetan information. For Tibetan language information extraction, through experiments analyzed, syntactic analysis model which is integrated with information of semantics, as well as the evaluation of program can be used in Tibetan language information extraction task successfully.


Introduction
Information extraction refers to automatic extraction of main information from the text, displayed in a structured form. Information extraction [1], as a key technology in the field of information processing, has been widely used in information retrieval [2], automatic question and answer [3], text mining and so on. Tibetan information extraction algorithm of syntactic information and semantic parsing can be used the information security system such as the Tibetan public opinion surveillance, Tibetan text link detection, Tibetan hot topic recognition and tracking, etc., with good academic value and broad application prospect. It has positive significance in automatic generation of Tibetan database and knowledge base, Tibetan question and answer system, Tibetan information retrieval research, etc.

Related Research
In the late 1950s, Zellig Hams put forward the related information extraction and indexing of the literature of science and technology [4], and since 2000, the ACE evaluation meeting and information extraction have developed through decades of years.
In foreign countries, the researches on information extraction have developed from particular field to the open field, enabling the extraction information to be more comprehensive. Data format has developed from regular standard text to irregular common text. Data type has developed from news report and scientific paper to various website texts, Weibo, etc. Method has developed from manual modeling to automatic modeling of machine learning. The information extraction has obtained more notable achievements [6]. Heng Ji et al. [7] studied the cross-document event extraction and tracking, as well as its evaluation standard. Alan Ritter, et al. [8] studied the event extraction facing the open field such as Twitter, applying event extraction technology to real-time and irregular social media space, which makes event extraction technology further practical.
In China, Chinese information extraction technology was started relatively late, but it has made some research achievements. Yuan Yulin [9] studied to semantic resources construction of even extraction. Zhao Yanyan, Liu Ting et al. [10] s proposed event type recognition based on trigger word and binary classification technology. Xu Ronghua, Zhu Qiaoming et al. [11] defined an event fusion framework, TEFF, and made theme event displayed in the hierarchical form according to the role of all kinds of meta events in the theme events. Ding Xiao, Qin Bing, Liu Ting et al. [12] researched event extraction in the field of music, and applied research results in the platform of Harbin industrial university language technology (LTP) [13].
rule Tibetan sentence components include subject, predicate, object, attributive and adverbial. The sentence is mainly predicate sentence as the core with subject-predicate structure. Tibetan word order is as follows: subject, object (direct object, indirect object), predicate, belonging to SOV syntactic structure, which is different from Chinese. When adjectives, numerals, demonstrative pronoun are taken as modifier, they are put after the center word. When personal pronoun and noun are taken as modifiers, they need to add utterances in front of central word. Verb adjective modifiers are generally before the central word.
For example, the characteristic form of Tibetan of verb phrases includes V + UT + UM + UE (UT as tense auxiliary, UM as modal auxiliary, UE voice end utterances). The order and quantity of case-auxiliary word (grammatical function is equal to postpositive preposition); the connection relation between case-auxiliary word and front nouns reflects the grammar characteristics of the Tibetan language.
In order to construct syntax treebank of Tibetan phrase, the Tibetan notional word and function word are respectively refined and tagged, shown as in figure 1: Table 1 collection of part-of-speech tagging of Tibetan phrase treebank Common noun --NN Name --NR Noun of time --NT Name of organization --NO Name of place --NS Rhetoric word --NE

Examples of Tibetan phrase treebank integrated with semantic information
According to the labelling in table 1, with the use of labeling specification about phrasal level of Chinese tree from university of Pennsylvania library (university of Pennsylvania, Chinese tree CTB), this paper constructs the Tibetan phrase syntax treebank for Tibetan event. The phrase syntax treebank is as shown in figure  In the current construction of Tibetan syntax treebank, there are more relying on the syntax, but what is interdependent on syntactic withdrawal is the word dependencies. This paper focuses on Tibetan phrase syntax, with the samples in figures 1 and 2 as the standard, to manually label the Tibetan content words and function words respectively. The corpus with 2000 sentences is taken the training corpus, to decrypt test corpus, and then it is assisted with artificial proofreading, adopting the method of bootstrapping to gradually expand the Tibetan phrase syntax treebank.

Sample of Tibetan information extraction integrated with semantic feature 4.1 Trigger word recognition
Information extracting must identify the trigger word in the sentence firstly. Tibetan verbs are generally at the end of the sentence, and the core of the whole sentence; from EITCE 2017 the aspect of position, the Tibetan language identification is relatively easier than other components. Many verbs do not appear in the training corpus, and are non-record words. (respect for teachers, for example), assuming (respect teacher; assuming (respect) does not appear in the training corpus as trigger word, it is easier to identify this event. However, the meaning of (respect) and (esteem) is close; based on this, this paper semantic similarity calculation based on Tibetan verb dictionary, and trigger word dictionary, automatically extends event trigger word, and cover all types of event trigger word as much as possible. For Tibetan word's semantic similarity, the word vector space model and the Tibetan vocabulary context words are used to describe, and the context can provide abundant linguistic information for words. First, a group of feature words are selected, and then according to the frequency of this set of words in the corpus and TF-IDF value, the relevance between a group of words and every word is calculated to get a featured vector; finally, the similarity between vectors is used as the similarity of two words. Root trigger word and its information category form a binary set (trigger, type), for example, (chairman, Person/Respect), to build the binary comparison table of "Tibetan root trigger wordinformation category"; the binary table has two columns of event trigger words and information categories. The first column is root trigger word; the second column represents the information category of each trigger; each trigger word corresponds to the only one information category, and the trigger words in this table will cover the entire information category. The information category is labelled as the different symbols, with maximum likelihood probability algorithm to get parameters for identification model, and the state transition matrix parameter is

 
i Init is the total number in the state of i.
Chinese Proposition Bank (CPB) of University of Pennsylvania is the resource of Chinese superficial semantic labelling based on Penn Chinese Treebank. The labelled data of Penn Chinese Treebank mainly comes from Xinhua news, Smorama news and magazine and Hong Kong news. CPB contains more than 20 semantic roles; the same semantic role has different semantic meanings for different target verbs. The core semantic role is from Arg0 to Arg5. Arg0 usually means the action of agent; Arg1 usually shows the effect of action and so on. The remained semantic roles are additional semantic role, with the prefix ArgM to express, followed by some additional tags to represent semantic category of these parameters, for example, ArgM2LOC represents location, and ArgM2TMP represents time, etc.
With the structure of "predicate -argument role", for each predicate in the sentence (verbal predicate or nominal predicate, etc.), the corresponding semantic roles of the predicate in the sentence are marked, and the main argument role is the agent AGR0-SUB, object ARG1-OBJ, auxiliary argument role, time ARGM -TEMP, place ARGM-LOC, way AGRM -MNR, as shown in table 3. This integrates semantic information into the syntactic structure tree, as shown in figure 1 and figure 2 at phrasal level; for training and syntax analyzer, Berkeley Parser is used.  (2) Result assessment scheme In this paper, the results of Tibetan information are compared with the standard library for analysis; F value of accuracy (P), and recall rate (R) is used for comprehensive evaluation.
P=number of correct sampling attribute/ number of actual sampling attribute x100% R= number of correct sampling attribute/ number of imposed sampling attribute x100%

Experimental method and results
This experiment compares the methods of part-of-speech tagging, trigger word recognition, part-of-speech tagging integrated with trigger word, and semantic role labelling integrated with trigger word.
(1) Part-of-speech tagging: the part-of-speech method of only using Tibetan words, for example, verb is labelled as vv, and noun is labelled as NN.
(2) Trigger word recognition: taking the results of model recognition as the input of information extraction model to get the results of information extraction (3) Part-of-speech tagging integrated with trigger word: combing the former two methods, and comparing it with single method (4) Semantic role labelling integrated with trigger word: namely, the method used in this paper, trigger word as event feature, semantic role information as the text feature of Tibetan, as shown in figure 1 and 1, and labeled in treebank.
The experiment results are shown in table 5 Comparison of the methods of Tibetan information extraction

Result analysis
From table 5, it can be seen that when only using part-of-speech tagging method (method 1), and only using the trigger word recognition method (method 2), the F value is low; thus, only using dictionary information and part-of-speech tagging method have low contribution rate for Tibetan information extraction. The reason is that both dictionary and part-of-speech tagging method more focus on the objective knowledge, but not to integrate the Tibetan linguistics into the model. F value of the former two methods have not significantly improved, which also proves the a priori knowledge or artificial model has no good effect on increasing efficiency.
Semantic role labeling information fusion trigger word, that is, the method used in this paper, has improved overall accuracy and recall rate for the Tibetan extraction, thus, integrating Tibetan syntactic and semantic information of linguistics also helps Tibetan information processing of tasks. In addition, information extraction also has a lot to do with the accuracy of the syntactic analysis. The Tibetan information extraction model based on semantic role labeling relies on the accuracy of syntax analysis model of Tibetan language phrase, especially syntactic analysis of long sentences.

Conclusion and following work
Tibetan information extraction model integrated with semantic information is integrating semantic information for training on the basis of Tibetan language phrase syntax, so as to provide support for Tibetan information extraction. This paper provides a solution for the Tibetan information extracting on the aspect of semantic information, including the Tibetan semantic information classification, Tibetan semantic information labelling and the syntax analysis of phrases in the Tibetan language, etc. It contributes to providing an information extraction solution for other languages, including Mongolian, Uygur and other ethnic minority languages. The next step is to strengthen research in the lexical syntactic level to study and Tibetan lexical and syntactic integration model, so as to realize Tibetan language information extraction model without, namely, Tibetan language syntax analysis model based on syllable sequences, serving the Tibetan information extraction.