Analysis on types of spelling errors in true Tibetan characters

. Spelling error checking is a challenging research topic with a wide range of applications such as text editing, word processing, spell checking, teaching, etc. As an alphabetic language, spelling errors in Tibetan could be categorized into three types, namely, non-true type, true type, and punctuation misuse. In order to study true Tibetan syllable spelling error in much more depth, the article analyses the types of True Tibetan syllable spelling errors based on Tibetan word formation rules, grammar and semantic features laying a foundation for Tibetan spelling error checking research.


Introduction
With the rapid growth in the amount of information in Tibetan texts available online, Tibetan spelling error checking has become an urgent demand, raising huge interests in related research and application in the community. Given the fact that the more detailed and thorough the analysis of the types of spelling errors is, the more effective the design of spell-checking strategies will be, analyzing the types of errors in Tibetan texts, summarizing and categorizing the rules and commonalities of spelling errors are essential for developing in-depth and effective spelling checking methods. The spelling of Tibetan text includes three aspects: non-true characters, true characters, and punctuation. In recent years, researchers have conducted research on the spelling check of Tibetan non-true characters, and many valuable research results have been obtained [1][2][3]. Tibetan true-character spelling checking is also an important part of Tibetan text spelling checking, and scholars have also begun to pay attention to the research of true-character spelling checking. The analysis of the types of errors in the Tibetan true-character spelling check is the basic work of the true-character spelling check, but there are no related documents, which affects the development of the spell check technology of Tibetan text. This article takes Tibetan word formation rules, grammar and semantics as the starting point, analyzes the types of spelling errors in Tibetan true characters, and provides data support for the study of Tibetan true characters spelling checking technology.

Research status
In 1967, British linguist Corder [4] proposed the concept of error analysis for the first time. He systematically analyzed the errors in the collected text corpus, and studied the nature and types of errors, which opened the era of text error analysis. Due to the complexity of the language itself, there are many types of text errors, and it is difficult to analyze the types of text errors. In order to analyze the types of spell check errors in depth, the Association for Computational Linguistics (ACL) has established a Natural Language Learning Special Interest Group (CoNLL) to discuss the analysis of spell check error types. The goal of CoNLL-2014 [5] is to automatically detect all types of grammatical errors in short English texts written by non-native English speakers and return the corrected text. Inspired by the shared task of analyzing the types of spell checking in English, a lot of researches on the analysis of types of spell-checking errors have been established in China, and this field has received extensive attention from researchers. The International Natural Language Processing and Chinese Computing Conference NLPCC has added a Chinese grammatical error correction task with the goal to detect and correct grammatical errors in Chinese sentences written by non-native Chinese speakers [6]. At the NLPCC2018 evaluation sessions, six teams from the Alibaba, Peking University and other institutions achieved good results. In 2018, Tan et al. analyzed five types of noun singular and plural errors, verb form errors, subject-predicate inconsistency errors, article errors, and preposition errors that ESL learners often make, and proposed a method based on LSTM and N-Grammatical error correction method [7]. In 2020, Liang et al. classified and analyzed the spelling errors of English learners, and designed an automatic spelling check system for the corresponding types [8].
Since the beginning of the 21 st century, scholars have begun to analyze Tibetan spelling errors, mainly focusing on the analysis of non-truth spelling check types. In 2009, Dorje Dolma elaborated on the diversity of spelling errors in Tibetan texts, and used the n-gram model to solve the problem of checking Tibetan syllables [9]. In 2011, Guan Bai analyzed the types of errors in Tibetan characters and designed a method of proofreading the corresponding Tibetan syllable characters [10]. In 2013, Zhu Jie et al. discussed the spelling check of Tibetan syllables, the error check of Sanskrit transliteration, the check of continuous relations and the error check of Tibetan words based on the five defined types of Tibetan text errors, text proofreading system [2]. In 2017, Liu et al. calculated the types of spelling errors of non-true characters on the corpus containing more than 90 million syllables on Tibetan web pages according to predetermined rules, and analyzed the causes of the spelling errors [3]. The analysis of the types of errors in the Tibetan true-character spelling check is the basic work of the true-character spelling check, but there is no relevant literature yet. This article takes Tibetan word formation rules, grammar and semantics as the starting point, analyzes the types of spelling errors in Tibetan true characters, and provides data support for the study of Tibetan true characters spelling checking technology.

Classification of spelling errors in Tibetan text
Tibetan is composed of letters as syllables, syllables as words, words as phrases, and phrases as sentences. Therefore, there are spelling errors at the letter-level, word-level, grammatical-level, semantic-level and punctuation. Non-true character errors refer to Tibetan typos that do not conform to the Tibetan grammar. For example, " ཁ " In " ཁགོ ངས " cannot be preceded by a word, such errors are non-true word errors. True characters refer to words that comply with the Tibetan word formation rules but are wrong in the context. For example, in the sentence " ང་�ོ ང་མ་ཡི ན། " (I am a student), each Tibetan character is correct individually. However, according to the meaning of the sentence, the word " �ོ ང " should be " �ོ བ ", which is a true character error. Through analysis, it is found that the types of punctuation errors in Tibetan texts are mainly in the use of syllable separators and single vertical characters, including two types of missing and redundant. For example, " �སོ ང་། " is missing the separator between the syllables " � " and " སོ ང ", which belongs to the wrong type of punctuation absence; there are two types of errors in the sentence " ང་�་་�ང་�་སོ ང། " (I go to the street). There are two types of errors, missing punctuation and punctuation. The syllable " � "and " �ང " appear between two syllables, Belongs to the type of redundant punctuation errors. The syllable " སོ ང " and the single vertical character " ། " lack a syllable separator, which belongs to the type of missing punctuation errors.
Letter-level spelling errors are non-genuine errors. To judge whether Tibetan characters conform to the word formation principles in the grammar, the characters themselves are considered separately, and they have nothing to do with the context. Word level, grammatical level, and semantic level are the types of spelling mistakes of true characters, and it is judged whether the characters conforming to the principle of character formation are correct in the context.

Classification of spelling errors in True Tibetan characters
At present, spelling errors of true characters are the most concerned research content in the field of spell checking in Tibetan texts, and these research has great significance and value to Tibetan NLP in general. By analyzing the Tibetan grammar and Tibetan corpus, we analyzed the types of Tibetan true-type spelling errors, and concluded the types of Tibetan true-type errors, including word formation errors, grammatical errors, semantic errors, and joint errors. Class, see Table 1.

Word formation error
Tibetan word formation means that a single letter or a single syllable can be combined with other proper Tibetan characters or even with itself only to form a word. The spelling errors are caused by the similarity of the word formation or sound. This article divides Tibetan word formation errors into eight categories: pre-added characters errors, upper-added characters errors, root characters, lower-added characters errors, vowel errors, post-added characters errors, Then add the character errors, and component mixed errors. For example, in the sentence " མཚ� �་ནང་གི ་�་�ང་། " (A boat in the lake), the word " �་�ང " (boat) is wrongly written " �ི ་�ང "(Knife).

Grammatical errors
Tibetan grammar consists of two parts: "thirty ode" and "Character organization law". There are 10 kinds of non-free function words in "thirty ode", each of which has its own adding rules. Its spelling errors are mainly reflected in the addition of non-free function words, that is, the addition of the current conjunction or function word is related to the addition of the preceding syllable. The spelling errors in the word organization mainly lie in the verb tense change, that is, the choice of the current verb tense depends on a key time word or some specific words in the sentence. According to these two parts, this paper divides grammatical errors into adding errors of non-free function words and verb tense errors, For example, " དེ " (should be " ཏེ ") in the sentence " འ�མ་ད�ལ་དེ ་བཤད། " (said with a smile) violates the rules of adding words to be described It is an error of adding a non-free function word; another example is the sentence " �ོ ན་ཆད་ཨ་ཕས་ལས་ཀ་དེ ་�བ་�ོ ང་། " (the job that father did before) according to the time word " �ོ ན་ཆད " (before) at the beginning of the sentence, which determines that the verb tense corresponding to the subject should be the past The tense " བ�བས ", and the verb " �བ " is the present tense, constitutes a verb tense error.
There is a method called abbreviation in Tibetan. This method shortens long words into shorter syllables or into one syllable. The purpose of this method is to stay true to the original text without changing the theme or central idea of the original text. For example, " བ�་ཤི ས " (Auspicious or Tashi) is abbreviated as " བ�ི ས ", and the original meaning of " བ�ི ས " (the future tense of " �ི ད ") is to lead, guide, and quote. This abbreviation law violates the purpose of abbreviating loyalty to the original text, and its original meaning there is an ambiguity between them, which is a type of abbreviation error.
The error that occurs when the predicate is added after the object and predicate form a phrase is called predicate redundancy error. For example, the " ཐག་གཅོ ད " in " ཐག་གཅོ ད་�ས " is composed of the object ( ཐག ) and the predicate ( གཅོ ད ), " ཐག " means able, and " གཅོ ད " is an action verb related to the object. There is no need to add a predicate " �ས " (the correct usage is " ཐག་བཅད ") after it. Predicate redundant error. Predicate redundancy errors occur frequently in Tibetan texts, such as " བེ ད་�ོ ད་བཏང ", " �་བ�ེ ད་�ས ", etc., are all predicate redundancy errors. A type of literal translation error often appears in the translated text, which is an error that violates the requirement of maintaining the original content and the original form during literal translation. For example, the literal translation of "草原上鲜花盛开"(flowers on the grassland) is " �་ཐང་�ེ ང་�་མེ ་ཏོ ག་བཞད། ". Although there are no grammatical errors in this sentence, there are semantic problems (should be translated as " �་ཐང་�་མེ ་ཏོ ག་བཞད། ") due to the influence of Chinese, which is a literal translation error.

Conclusion
There will be spelling errors in the process of using any language. This article uses Tibetan word formation rules, grammar and semantics as the starting point, analyzes the types of Tibetan true-type spelling errors, and divides Tibetan text into non-true characters errors, true characters errors and punctuation errors, and then further divide the true characters errors into the first level error types, such as word formation errors, grammatical errors, semantic errors and concatenated errors. The second level classification of word formation error types, grammatical error types and semantic error types is made. The results of this research lay the foundation for the downstream task of Tibetan spelling checking technology. On the basis of this achievement, we will study its spell check methods for different types of errors to improve the performance of automatic spell checking of Tibetan text.