Selection of acoustic modeling unit for Tibetan speech recognition based on deep learning

The selection of the speech recognition modeling unit is the primary problem of acoustic modeling in speech recognition, and different acoustic modeling units will directly affect the overall performance of speech recognition. This paper designs the Tibetan character segmentation and labeling model and algorithm flow for the purpose of solving the problem of selecting the acoustic modeling unit in Tibetan speech recognition by studying and analyzing the deficiencies of the existing acoustic modeling units in Tibetan speech recognition. After experimental verification, the Tibetan character segmentation and labeling model and algorithm achieved good performance of character segmentation and labeling, and the accuracy of Tibetan character segmentation and labeling reached 99.98%, respectively.


Introduction
Automatic speech recognition technology is a key technology for human-computer interaction. In recent years, deep learning-based speech recognition technology has achieved leaps and bounds [1][2] and is widely used in such fields as voice search, personal digital assistants, and in-vehicle entertainment systems [3].
The selection of modeling units for Tibetan speech recognition is the primary problem facing acoustic modeling in Tibetan speech recognition, which provides important safeguards for the whole Tibetan speech recognition process. In Tibetan speech recognition system, researchers have considered modeling units with different granularity, including words and syllables [4], vowels [5][6][7][8] and phonemes [9][10][11], respectively. Tibetan not only has a large vocabulary, but also various variants exist. If words or syllables are used as modeling units, the requirements of the corpus are too high and can lead to data sparsity problems. As a result, it is difficult to include all modeling units in the training data, making the modeling units lack sufficient training samples to ensure the reliability of the acoustic model. If vowels or phonemes are used as modeling units, it is not possible to distinguish the differences in pronunciation of the same phoneme at different locations. In order to solve the above problems, the paper proposes a method of using Tibetan character as the modeling unit, and presents the flow of its segmentation and labeling algorithm.

Tibetan character selected as the acoustic modeling unit for speech recognition
The Tibetan script is a logical grammatical system of pinyin, which is not only horizontally spelled but also vertically spelled. Usually, if one knows the pronunciation pattern of Tibetan letters, one can spell the corresponding pronunciation of Tibetan characters. Tibetan characters are written vertically to form a stack of characters, known as character [12] or precomposed character stack [13]. Specifically, Tibetan character is defined as all single characters and Tibetan stacked combination symbols including the base character, head letter, subjoined letter, and vowel. For example, the Tibetan syllable བ�ི གས in བ, �ི , ག and ས are each one character, and the syllable is composed of four characters.

Tibetan character segmentation and labeling model
The Tibetan character segmentation and labeling model in this article is mainly composed of preprocessing, segmentation and labeling modules. The specific model is shown in Figure1. To use this model for Tibetan character segmentation and labeling, first, the system reads the text and preprocesses the read text. After that, the task of Tibetan character segmentation and labeling is carried out with the help of combinatorial library, Diacritic base letter library and knowledge rule library.
The pre-processing section consists of modules for digital normalization, Sanskrit normalization, non-Tibetan character filtering, and contraction reduction.
Segmentation and labeling are mainly composed of character segmentation and character labeling modules. Character segmentation refers to the longitudinal segmentation of Tibetan syllables, for example: ལེ གས is divided into ལེ , ག and ས .Character labeling refers to labeling the segmented character according to the pronunciation of the same character in different positions. For example: ལེ གས is marked as ལེ , sག and fས.
The composite component library is 476 composite components counted in the literature [14] .The pronunciation of the same character in different positions in Tibetan language, for ka' and appears in the multi-component base When the letter position (དགའ) pronounced 'ga' sound, and so on, ད, བ, ཟ, ཞ and other 40 characters, their pronunciation will change. The modeling unit is the same ག, and there are obvious differences in voice features. Therefore, a base character library with diacritics was constructed. In addition to the difference between base letter and base letter, there are also pronunciation differences between base letter and post-added characters, base letter and post-added characters, and pre-added characters and base letter. In order to distinguish the same character in different positions, the Tibetan character is segmented, and then the prefixed letter, Diacritic base letter, suffixed letter and second suffixed letter are marked. Fig. 2. Flow chart of Tibetan character segmentation and labeling algorithm.

Tibetan character segmentation and labeling algorithm flow
The text structure in Tibet is (prefixed letter+) base letter/ Combination construction (+suffixed letter)(+ second suffixed letter) [15]. Tibetan characters can be divided into four categories: single-component Tibetan characters, double-component Tibetan characters, three-component Tibetan characters and four-component Tibetan characters. Through the characteristics of Tibetan phonetics, Tibetan characters can be further divided into thirteen subcategories. For thirteen sub-categories segmentation and labeling, the algorithm flow chart of Tibetan character segmentation and labeling shown in Figure2 is designed.
As shown in Figure 2, in the algorithm flow, after inputting a Tibetan text, the output is the Tibetan text segmented and labeled by character. Among them, CM refers to the combined construction, DB refers to the diacritic base character library, S refers to ག, ང, བ and མ, and DB1 refers to the ད, ཞ and ཟ in the diacritic base font. F refers to ས.

3.1Experimental data description
Since there is currently no shared text corpus in Tibetan, this article character the existing Tibetan corpus in the research group to establish three types of sample corpora that can be used for character segmentation and labeling of Tibetan, in order to facilitate the description below , These three types of sample corpus are represented by C1, C2 and C3 respectively. The C1 corpus contains 220,000 sentences, 4.35 million words and 553 character; C2 corpus contains 240,000 sentences, 5.34 million words and 561character; C3 corpus contains 260,000 sentences, 4.1 million words and 563 character. The sources of corpus are Tibetan textbooks for elementary and middle schools, Weirenwang.com, China Tibetans.com and Qiong mai Tibetan Literature.

Experiment 1
In order to verify the segmentation and labeling effects of character, we designed the first set of experiments to investigate the segmentation effects on different data sets and count the size of the characters.  Table 1, it can be seen that the accuracy of the Tibetan character segmentation and labeling model reached 99.98%, and the error rate of 0.02% is that there are two syllables in the text that lack syllables and label errors. It shows that the Tibetan character segmentation and annotation model proposed in this paper has achieved good segmentation and annotation effects. Experiment 2 According to the frequency of use of character in all modern Tibetans, it is divided into four levels: prefixed character, base character, Diacritic base character, suffixed character, and second suffixed character. Figure 3 lists the relationship between the number of characters and frequency of use.

Fig. 3. Frequency distribution table of Tibetan character
It can be seen intuitively from Figure 2 that only less than 4% of characters are used when composing Tibetan scripts, but their use frequency is as high as 49.61%, and 4% of characters are all diacritical characters. In the latter 90.61% of characters, only 47.10% appeared when composing Tibetan, and 90.61% belonged to invariant characters. This shows that in Tibetan, it is impossible to distinguish the difference in pronunciation of the same phoneme in different positions. Therefore, the segmentation and labeling of Tibetan character has resolved the differences in phoneme pronunciation.

Summary
Tibetan character segmentation and labeling are the basic work of selecting acoustic modeling units for Tibetan speech recognition. This paper proposes the algorithm flow of Tibetan character segmentation and labeling by designing the Tibetan character segmentation and labeling model. Experiments show that the Tibetan character segmentation and labeling model and algorithm flow have achieved good character segmentation and labeling performance. The accuracy of Tibetan character segmentation and labeling has reached 99.98%, and the segmentation and labeling effects can be achieved. Satisfying practical needs has laid the foundation for the subsequent establishment of the acoustic model of Tibetan speech recognition based on the word D and speech recognition. We plan to study the acoustic model of Tibetan speech recognition based on the neural network of Tibetan characters on the basis of the work of this paper in the future to improve the performance of Tibetan speech recognition.