Laos Organization Name Using Cascaded Model Based on SVM and CRF

According to the characteristics of Laos organization name, this paper proposes a two layer model based on conditional random field (CRF) and support vector machine (SVM) for Laos organization name recognition. A layer of model uses CRF to recognition simple organization name, and the result is used to support the decision of the second level. Based on the driving method, the second layer uses SVM and CRF to recognition the complicated organization name. Finally, the results of the two levels are combined, And by a subsequent treatment to correct results of low confidence recognition. The results show that this approach based on SVM and CRF is efficient in recognizing organization name through open test for real linguistics, and the recalling rate achieve 80. 83％and the precision rate achieves 82. 75％.

Reference [4][5] adopted HMM approach for NER, this model is based on a stringent independence assumption, but in fact most data could not be treated as a series of independent element. Reference [6] adopted SVM for NER. Reference [7][8] adopted CRF for organization name recognition, the result is more desirable, but still have room for improvements.
Reference [9] established a role-tagging approach, but the inadequacy is the role set has a great impact on the recognition result, so repeated researches shall be done for the chosen of the right role set. Reference [10] combined machine learning and artificial knowledge for organization name recognition.
In this paper, Laos organization names are divided into simple organization name and complicated organization name two categories. Simple organization name only has one word, like court, congress, republic,etc. Complicate organization name consists of mufti-letters, The complicated organization name of Laos is different from that of Chinese, which has the characteristic of post-modifier, feature words in the left boundary of the organization name. For example: the Confucius Institute in Laos. Therefore defined as S + P form, S is the feature words of organization name, like company, university ,etc and P is the rear word of organization name. Namely, Laos complicated organization name is composed of organization name feature word and one or more organization name rear word.

The resources needed to recognition the laos organization name
Automatic extraction of each list organization name recognition required from the training corpus. The details are as follows: (1)Feature word table f D Feature words refers to the organization name is characterized by significant words, such as "factory , University, company".
The organization name recognition of Laos is the first of left boundary, so the establishment of the list can be used as the trigger condition of organization name recognition.
(2)Rear word table b D Rear word refers to the words that in addition to the feature words of organization name, location names and common nouns larger proportion, but overall, the word is more complex, there is a strong randomness.

(3)The left and right boundary word table
The left boundary word is the first word of organization name, such as "representative ""admitted ";The right boundary word is the last word of organization name, such as "director" "host ".Different boundary word indicate different directions on the boundary of organization name. Therefore, .When statistical boundary word table, it is necessary to statistics the number of times as boundary word, and according to the number of times will be divided into different levels.
(4)Simple organization name table Mainly used for simple organization name recognition, the words that in the vocabulary are considered to be the candidate words of simple organization name, for example: post office .

Laos organization name using cascaded model based on CRF and SVM
Laos organization name recognition model divided into two layers, the first layer use CRF to recognition simple organization name, and recognition results transmitted to the second layer; The second layer is based on the driven tagging method, which combines SVM and CRF to identify the complex organization name, that is to use the SVM to identify the name of the left boundary of the organization, the words which to be recognized as the left boundary word 2 backwards using CRF to rear marking. Then the recognition results of the two layer are combined. Figure 1 is an example of the organization name recognition is converted to a sequence annotation, figure 2    After the left boundary is determined, use CRF to carry on the back label. Considering a smaller proportion of the organization name, use full labeled policy will cause a lot of waste of resources, decided to adopt the drive type annotation, namely the left boundary driven, only for the look-up annotation. Candidate words determine the rules as follows: assuming the longest organization name length is n, each determined a left boundary, the word directly labeled as "L", the words followed by N -1 words become the organization name look-up, unless encounter punctuation, another left boundary or f the first of a line. Then tagging the word that Identified as candidate,in other words which are directly labeled as non organization name components .The use of this strategy, to a certain extent, reduces the training and tagging time, improve the recognition efficiency, and because of the reduction of redundant information, the recognition accuracy is also improved. The atomic features used here are also to be as follows in addition to the a used in the first layer.
This method is suitable for the complete identification of organization names, according to different materials need to make some adjustments in the way. If the text does not complete the organization name occupies a certain proportion, the use of two kinds of methods to identify, first in this paper, the second directly with CRF identification, then compare two recognition results, identification of different the confidence degree is higher as the final results.

Subsequent processing
Subsequent processing includes two parts, the first part for the construction of a probabilistic model, the recognition result confidence below the strings of a certain threshold calculating the confidence, a proper threshold is selected through experiments and reliability are higher than the threshold determination for organization name, or identified as non organization name. The Credibility T (org) of organization names including organization names feature words credibility T (S) and rear organization name word credibility T (P), is calculated as follows: parallel relationship words (such as: and , versus , etc.) before and after the labeling should be consistent, inconsistent will mark the confidence degree is higher.
From the training corpus extraction framework of organization names, such as: (admitted to ,candidates to + Organization Name + (school, reading, work).And according to the number of times to streamline, Confidence is lower than a threshold of recognition and matching the results to determine the matching for organization name, or identified as non organization name.

Experiment analysis
In this paper, we use the data collected from the Lao language news website, We use word segmentation program to deal with the data,and in part by the Lao language experts and the Lao students to carry out a manual tagging, used as an experimental corpus .The remaining part is used as a test corpus. Test results taken three common evaluation Testing method is as follows: Testing results were evaluated by correct rate, recall rate, and F-measure .
(1) correct rate(P): The P measures the numbers of correct named entities in the answer file over the total number of named entities in the answer file.  As can be seen from the experimental results, the recognition effect of driven tagging of SVM CRF is best, the training time is reduced due to the reduction of redundant information. But because the recognition of this paper is based on the correct word segmentation and part of speech tagging, the error of the word segmentation will decrease the recognition accuracy.

Conclusion
In this paper, a two layer model based on CRF and SVM is established to recognition the organization names,According to the different characteristics of the simple organization names and the complex organization names, at different levels, different methods are used to recognition.