The New Algorithm for Speech Control in the Cockpit

Speech technologies are being developed intensively in the recent years, especially the automatic speech recognition as an additional input method in human interface and technical devices. Most of the known algorithms for speech control have small probability of correct recognition. Widespread methods, like Markov models and neural networks, which require large processing power, allow recognizing the words with a probability of no more than 85–92 %. Such accuracy is not enough to use the voice control on board of a modern aircraft. The article is devoted to a problem of improving the automatic speech recognition’s accuracy. A version of word recognition algorithm based on the classical approach is suggested, it includes the comparison with the patterns. In this work to improve the recognition’s accuracy a new method of calculating a similarity measurement between the recognizable word and the pattern, which based on z-Fisher transformation, is described. This article also contains an algorithm’s modification that takes into account the fixed ratios with the patterns of other words and uses the words adjustment to the pattern with dynamic programming elements. The usage of fixed relations between words provides additional information, which positively affects the recognition. The experimental results of the developed algorithm’s approbation on a large amount of speech data are presented.


Introduction
In recent years a lot of researches on the improvement of the cockpit interface based on modern audio technologies appeared, for example: control of onboard equipment based on automatic speech recognition [1], the creation of surround sound effect to enhance the information content of a sound indication [2], the analysis in order to ensure the safety characteristics of speech signals [3].The creation of a cockpit voice interface is complicated by many factors: quality of the existing methods of recognition [1,4,5], where the percentage of correctly recognized words rarely reaches 90 -95%; overload impact [6,7]; presence of strong acoustic noise [8,9]; occupational diseases of hearing organs of helicopter aviation crew [10].

Formulation of the problem
We propose some algorithms, which can increase the probability of correct recognition, which is critical when using voice control on board the aircraft, where any mistake affects the safety of the flight.
Our algorithms are based on the traditional recognition method [5], which contains the comparison the parametric portraits of words with patterns.Description of the algorithmic implementation given in the article [11].The conceptual simplicity of this method is useful for evaluating the effectiveness of the proposed new algorithmic solutions.In order to increase the level of recognition we propose the following new approaches: -A modified measure of closeness between parametric portrait of a recognizable word and patterns.The new measure is to find the middle of the z-Fisher transformation of the correlation coefficients between word's and patter's frames.
-Use of the information about fixed ratios of recognizable words, not only with the appropriate pattern, but also with all other members in the dictionary, which increases the probability of successful recognition.
The initial version of algorithm to the simple case of operator dependent recognition is shown in [12].This paper proposes a final version of the algorithm and the experimental results in case of operator independent recognition.The effectiveness of the considered algorithms significantly increases when using adaptation of words in the time domain [13], based on dynamic programming approach [4,14].Algorithmic implementation is described in detail in [15].

Recognition based on comparison with standards
The algorithm is applied classical scheme of finding the maximum value of the measure of closeness between the recognizable word and all standards of the dictionary.Let us consider the version used in this study.Let there be recording of the words in the time domain.For a single word such record has the form -the amplitude of the microphone signal k N -number of discrete values of the speech signal.
In our case, the sampling frequency is f=20050 Hz, which corresponds to the sampling interval W=1/f=1/22050a 0,05 ms.
Apply to form (1) the spectral-time transformation, which divide the record in time scale into t N intervals with duration 20…40 ms, on each of which, by using fast Fourier transform algorithm, windows Hannah and averaging over frequencies, compute 40 ... 30 f N logarithms of signal density values [5,11].As a result, we get a parametric portrait of a word in the form of a matrix where columns correspond t N quantization intervals in time, and the rows contain values f N frequency components.Let us apply to the matrix the notation ^ìj x .
In our experiments we use a uniform frequency scale and Mel scale, widespread in the analysis of speech signals [4,5]: here f -sound frequency in Hz, m -sound frequency in Mel scale.
To create the standard of the word take E implementations of this word and calculate the expected value of E parametric portraits The resulting portrait (3) we take as a standard of the word.If it is required to recognize M words, then using formula (3 where each standard is a matrix of dimension Let it be required to recognize a word that is received in time record (1).Then create the parametric portrait (2) and calculate the measure of proximity (distance) between portrait (2) and each of the standards (4).As the distance we choose, for example, estimate of the correlation coefficient: x -the element of portrait of standard number l, located at the intersection of row number i and column number j.
The result of recognition is defined by the maximum correlation coefficients among all M standards , max ˆmax -index corresponding to the maximum correlation coefficient and determining the recognition result.
There are other distance measures such as Euclidean distance between the matrices of parametric portraits, but experience shows that they give similar results [1].Therefore, we will consider this algorithm as a base.

Recognition on the basis of comparison with other standards
Considered the traditional recognition scheme is widespread and is to find a standard with a maximum measure of closeness to recognizable word.It reflects the natural process of understanding speech in which a person chooses the most appropriate standard to recognizable word, and does not care about the correlations with other words.This scheme is based on the assumption that a measure of closeness to the word with "its" standard more than with others or "foreign" standards in the dictionary.When machining in order to increase the probability of correct recognition we offer other schemes of comparison.It is advisable to use the additional information contained in fixed ratios between different words.Note that such a change in the formulation of the problem should not lead to a significant increase in the amount of calculation, as discussed above the detection scheme still requires a comparison with all the standards.Let formulate the new recognition algorithm that uses the results of comparison with all the standards, which are discarded after a maximum location in a traditional scheme.
As part of the basic algorithm (1) - (7) it is advisable to additionally take into account the values of the correlation coefficients between the different words.Let the training database contains M words with E implementation of each.We form M standards in accordance with (4).
Further, according to (5)  .In recognition of unknown words x calculate from the formula (5) evaluation of the correlation coefficients of its parametric portrait (2) with all M standards, so obtain the vector with dimension M : ^`, ˆlx ( In equations (10) and (11) we search the minimum by all words in the dictionary M k ,..., 2 , 1 .

Modified proximity measure
We create the modified measure of closeness between the recognizable words and standards, which will significantly increase the probability of correct recognition.
Estimates of correlation coefficients have a special asymmetrical distribution [14], so in statistical relation more stable results can be obtained by substituting in formulas ( 8) -( 11) z-Fisher transformation instead of the correlation coefficients, having approximately normal distribution rˆ -evaluation of the correlation coefficient.Furthermore, let calculate the correlation coefficient not around the hall parametric portrait of the word and its standard as the base in the formula ( 5), but separately for each time interval (frame) followed by averaging values.Then, for partition to the number of frames t N calculate for each j frame of recognizable word an assessment of the correlation coefficient with j frame of l standard, going through all the standards M l ,..., 2 , 1 .In this case, instead of the expression (5) apply the following formula . lxj rˆ --evaluation of the correlation coefficient between j frame of recognizable word with j frame of l standard.
Then go to the Fisher z-transform.For this purpose, in the formula (12) substitute estimate of the correlation coefficient (13): . As a final measure of closeness take the average over all intervals t N z-Fisher transformation (14) of the correlation coefficients (13) between frames of word and the reference: The proposed two new algorithms 8-11 and 12-15 is applicable in conjunction with the traditional method of pre-adjusting recognizable words in length scale, which reduces the effect of variations in pronunciation of different speakers [10].The method uses dynamic programming [4,14,15].

Experimental evaluation of the effectiveness of the developed algorithms
Recognition results of proposed algorithms are tested on the record material of seven different speakers and dictionary with 20 isolated words.Every speaker pronounced 600 words (20 words and 30 realizations of each).Standards formed using similar records of the eighth speaker, so we discuss here speaker independent case of recognition with small learning base composed only one speaker.Recognition results for all speakers and each algorithm are shown in table 1.Where 1 -a basic recognition with a measure of the closeness -the correlation coefficient (6); 2 -a modified proximity measure with averaged z-Fisher transformation (15) of the correlation coefficient; 3 -similar to option 2, but with the implementation of pre-adjusting words in length scale, 4 -the comparison with the standards of "foreign" words, and using modified proximity measure, and preadjusting words in length scale.Note that each of the listed above algorithmic approach reduces the average error rate.For example, a modified proximity measure reduces the number of errors by 4.4%, pre-adjusting words in length scale together with a modified measure reduce the failure rate by 2.3%, comparing with the "foreign" standards improves results further by 1.1%.Thus, the combined use of three algorithms (modified proximity measure, duration adjustment, comparison with "foreign" standards) allow us to recognize correctly 96.5% of all words.Note that in this experiment we used a small training base composed only one speaker.

Conclusions
New approaches to improve the quality of voice recognition are developed: -a modified measure of closeness between parametric portrait of a recognizable word and patterns; the new measure is to find the middle of the z-Fisher transformation of the correlation coefficients between word's and patter's frames; -recognition based on a comparison with the "foreign" standards, using fixed relations between different words.
Experimental results show the effectiveness of the proposed algorithms, especially when combined with the method for adjusting the word's length based on dynamic programming.
To the dictionary with 20 isolated words and 30 realizations for each and for seven speakers in a speakerindependent version the combined use of the considered algorithms shows the result up to 96.5% correct recognitions.In this study, the possibility of increasing the quality of recognition, based on the expansion of the training base, deliberately not been used for a better estimation of the effectiveness of the proposed algorithms.

Table 1 .
Recognition results (the number of errors in %)