A language model for Amdo Tibetan speech recognition

We built a language model which is based on Transformer network architecture, used attention mechanisms to dispensing with recurrence and convalutions entirely. Through the transliteration of Tibetan to International Phonetic Alphabets, the language model was trained using the syllables and phonemes of the Tibetan word as modeling units to predict corresponding Tibetan sentences according to the context semantics of IPA. And it combined with the acoustic model as the Tibetan speech recognition was compared with end-to-end Tibetan speech recognition.


Introduction
The research on the Tibetan language model is still in its infancy [1], and there are some research based on the deep neural network [2][3] but least for speech recognition. In speech recognition, the use of deep learning algorithms can achieve end-to-end speech recognition with word or phrase as the modeling unit [4][5][6][7]. The neural network model is better than traditional models in speech recognition but it depends on large volumes of data, requiring a lot of speech data for training to realize its potential. Tibetan is a minority language with a relatively small population in China. It is mainly divided into three dialects; U-Tsang, Kamba, and Amdo. Thus the speech data required for the end-to-end Tibetan speech recognition model training is more difficult to collect than the corpus of text data. Therefore, Tibetan speech recognition still uses syllables or phonemes as modeling units, and the combination of acoustic models and language models has a better performance. The content of this paper is a language model for Amdo Tibetan speech recognition, how to transliterate Tibetan sentences into corresponding IPA, and train language models using syllables or phonemes as modeling units for speech recognition tasks.

Transformer component
Transformer was originally used in the field of machine translation [8]. It is different from RNN (Recurrent Neural Network) and CNN (Convolutional Neural Network) structures. It uses a self-attention mechanism for relating different positions of sequence in order to compute a representation of one word in sequence, and, at the same time, process the sequence in parallel. It's entire model framework is completely built with attention mechanism and feed-forward neural network, and Transformer's training speed and performance are much better than RNN [9].

Multi-head attention
First, dot-product of the query sequence and all the keys sequence is divided by the scaling factor k d , and then a softmax function is applied to obtain the weights of the values sequence to computing the scaled dot-product attention. The scaling factor plays an adjustment role, so that the dot product grow large, resulting in pushing the softmax function to an area with an extremely small gradient. The output matrix is: where the projections are parameter matrices:

Feed-forward neural network and positional decoding
The feed-forward neural network consists of two linear transformations with a ReLU activation in between.
x represents the input; 1 W represents the parameter matrix of the first linear

Tibetan phonetic transcription
In order to make the language model predict Tibetan sentences according to the context semantics of IPA, and then combine with the acoustic model to play a role in speech recognition, we need to transliterate Tibetan into the corresponding IPA sequence as the input of the Transformer. Tibetan words are used as the output of the transformer to training model. Tibetan script is a horizontal and vertical two-dimensional phonetic script composed of consonants and vowels. It is composed of 7 basic components according to strict Tibetan grammar rules. According to the spelling order, they are Prefix, Superscript, Root Consonant, Subscript, Vowel sign, Suffix, and Second Suffix [10]. There is a many-to-one mapping relationship between Tibetan words and corresponding phonetic symbols. Tibetan word are separated by a tsek '་'. Usually a Tibetan word is a syllable [11], consisting of single or multiple consonants and monophones or a combination of monophones and final consonants [12]. In this paper, the four components of the Tibetan syllables:Prefix, Superscript, Root Consonant, Subscript are used as consonant phonemes, Vowel sign, Suffix, and Second Suffix are used as vowel phonemes. For example, the Tibetan sentence"བོ ད་�ད་�་བ�ོ ད" (Tibetan speech) is transcribed into "wot ʰkat ʰma ɟʷot". But in fact, the recognition result of the acoustic model cannot be so accurate. For example, sometimes the "�་བ�ོ ད" voice signal in a certain context is recognized as "ʁʰma ɟot" or even can be "ma ɟot". This is because the acoustic characteristics of some Tibetan word are similar, or the contextual voice will affect the acoustic characteristics of the current word. To this end, we transcribed a small part of the training data in a broad way, ignores as many details as possible, or transcribed them according to the acoustic characteristics of the context.

Model architecture
The model framework is mainly composed of encoder and decoder. The encoder is composed of 6 network blocks, all network blocks are the same in structure, but they do not share parameters. The network block of the decoder is the same as the network block of the encoder. It is also composed of 6 network blocks. In order to optimize the training process, the entire network uses residual connections and normalizes the layers.

Encoder
Each encoder can be divided into two sublayers:self-attention sublayer and feed-forward neural network sublayer.

Fig. 2. Encoder architecture.
After transcribing a Tibetan sentence into an IPA sequence, it must be vectorized by the embeding layer. The vector sequence dimension is 512. When the transcribed IPA is a syllable or phoneme, the sequence length is set to 25 and 60 respectively. Then it encodes each vector in the position coding layer to obtain the relative or absolute position information in sequence. Next, it is linearly transformed into three vectors q , k , and v , and passed to the self-attention layer, processed with multi-head self-attention, and each group of vectors is processed through 8 heads to feature extract for the model to jointly attend to information from different representation subspaces at different positions with considering the whole sequence. Then concatenates and compresses the results into one vector sequence through linear transformation, giving it to the feed-forward neural network layer, and then input to the next block of encoder. Finally, the Encoder output sequence converts it into Attention matrices ( K , V ) and sends it to the Decoder.

Decoder
Similarly, before input to the decoder, word embedding and position encoding are also required. The decoder consists of a self-attention sublayer, an encoder-decoder attention sublayer and a feed-forward neural network sublayer. The difference is that there is an additional masked multi-head attention layer in the decoder. When training the model, the first sub-layer of the decoder uses masked multi-head self-attention, so that the input only contains the word information before the current position to achieve sequential decoding, and the current output can only be based on the outputed part. The input of the encoder-decoder attention layer is to create the Q matrix from the self-attention sublayer, and obtain the K matrix and V matrix from the output of the encoder. Therefore, the input to the decoder includes not only the output of the encoder, but also the output of the previous decoder. Finally, each step of the decoding stage through the linear layer and the softmax layer will output a Tibetan word. During the test, the first input of the decoder is the start symbol SOS, and the Tibetan output of each step is provided to the bottom decoder in the next time step until the special end symbol EOS is reached.

Experiment
The corpus data to training language model of this paper is mainly composed of 200M texts of Tibetan and corresponding phonetic symbols through transliteration. In order to test the performance of the speech recognition system when the language model is combined with the acoustic model, we used the speech data of 10-hour reading of Tibetan textbooks for primary and secondary schools by 1 female and 1 male (Channel:Mono, Sampling:16 bit, Bit rate:16kHz). we trained language model on 2 Tesla M60 GPSs with following hyperparaments: Optimizer:Adam optimizer (learning_rate=0. 0001, 1 β =0. 9, 2 β =0. 98, ε =1e-8), Num_heads=8, Num_blocks=6, Hidden_unite=512, Dropout=0. 2 The performance of the transformer based language model on the task is shown by WER(Word Error Rate) in the following table: We use the CNN+CTC based acoustic model trained by speech data to combine with the language model. The results on test dataset of selecting different modeling unite as follows:

Conclusion
The experimental results show that the performance of Tibetan speech recognition combined with an acoustic and language model is better than end-to-end speech recognition. Additionally, a speech recognition system where using the phoneme as modeling unit is better than using the syllable as modeling unit. Due to the small scale of training data, the generalization ability of the model is weak and cannot be applied in practice. In terms of phonetic transcription there are still some problems and room for improvement. For example, it is impossible to accurately transliterate Sanskrit-Tibetan words that conform to the Tibetan grammar rules.