Speech Recognizing for Presentation Tool Navigation Using Back Propagation Artificial Neural Network

Backpropagation Artificial Neural Network (ANN) is a well known branch of Artificial Intelligence and has been proven to solve various problems of complex speech recognizing in health [1], [2], education [4] and engineering [3]. Today, many kinds of presentation tools are used by society. One popular example is MsPowerpoint. The transition process between slides in presentation tools will be more easily done through speech, the sound emitted directly by the user during the presentation. This study uses research and development to create a simulation using Backpropagation ANN for speech recognition from number one to five to navigate slides of the presentation tool. The Backpropagation ANN consists of one input layer, one hidden layer with 100 neurons and one output layer. The simulation is built by using a Neural Network Toolbox Matlab R2014a. Speech samples were taken from five different people with wav format. This research shows that the Backpropagation ANN can be used as navigation through speech with 96% accuracy rate based on the network training result. Thesimulation can produce 63% accuracy based on 100 new speech samples from various sources.


Background
Artificial Neural Network (ANN) is a branch of Artificial Intelligence which has been widely applied to various applications and is proven to solve various complex problems. Backpropagation is one of the popular ANN architecture and is widely applied in research of speech recognizing, for example, speech recognition of a recently born baby [1], speech recognition of cardiac abnormalities [2], speech recognition to control robots [3] and speech recognition to learn foreign language [4].
Presentation tools usually consists of slides that can be displayed on the screen and the user can navigate using a keyboard or other controller devices such as mouse, wireless mouse and wireless pointer. The control device intended to facilitate users in giving presentations. But the use of a keyboard and mouse in the transition between slides still limit the movement of user because there is a maximum distance between the devices with a laptop or computer that is used. While wireless pointer provides only know next and previous orders and can not refer directly to a specific slide the user wants. The transition process between slides will be more easily done through speech, the sound emitted directly by the user during the presentation. The easier use of the device, the better support for the presentation process. Therefore, this research aims to create software for presentation tool navigation through speech, using Backpropagation ANN.

Fast fourier transform algorithm
FFT is a method for transforming the speech signal into a frequency signal, meaning that the speech recording is recorded in digital form in the form of wave-based speech frequency spectrum. The FFT algorithm is designed to perform complex multiplications and additions, even though the input data may be real valued [5].

Artificial neural network
An artificial neural network is an information-processing system that has certain performance characteristics in common with biological neural networks. Artificial neural networks have been developed as generalizations of mathematical models of human cognition or neural biology [6].
A neural net consists of a large number of simple processing elements called neurons, units, cells or nodes. Each neuron is connected to other neurons by means of directed communication links, each with an associated weight. The weights represent information being used by the net to solve a problem [6]. Each neuron has an internal state, called its activation or activity level, which is a function of the inputs it has received. Typically, a neuron sends its activation as a signal to several other neurons. For example, consider a neuron Y, illustrated in Figure 1, that receives inputs from neurons 1 X , 2 X , and 3 X . The activations (output signals) of these neurons are 1 x , 2 x , and 3 x , respectively. The weights on the connections from 1 X , 2 X , and 3 X to neuron Y are 1 X , 2 X , and 1 w , 2 w and 3 w respectively. The net input, y_in, to neuron Y is the sum of the weighted signals from neurons 1 X , 2 X , and 3 X , i.e., The activation y of neuron Y is given by some function of its net input, y = f(y_in) [6].

Common activation functions
Single-layer nets often use a step function to convert the net input, which is a continuously valued variable, to an output unit that is a binary (1 or 0) or bipolar (1 or -1) signal [6].

Binary Sigmoid function (Logsig)
During feedforward, each input unit ( i X ) receives an input signal and broadcasts this signal to the each of the hidden units 1 Z , … , Zp . Each hidden unit then computes its activation and sends its signal ( j z ) to each output unit.
Each output unit ( k Y ) computes its activation ( k y ) to form the response of the net for the given input pattern [6].   During training, each output unit compares its computed activation k y with its target value k t to determine the associated error for that pattern with that unit. Based on this error, the factor k G ( k = 1, … m) is computed. k G is used to distribute the error at output unit k Y back to all units in the previous layer (the hidden units that are connected to k Y ). It also used (later) to update the weights between the output layer and the hidden layer.
After all of the δ factors have been determined, the weights for all layers are adjusted simultaneously. The adjustment to the weight jk w (from hidden unit j Z to output unit k Y ) is based on the factor k G and the on the factor j G and the activation i x of the input unit [6]. Backpropagation learning has emerged as the standard algorithm for the training of multilayer perceptrons, against which other learning algorithms are often bench-marked [7].

Research Method
This study uses Research and Development to develop speech recognition simulation of number one, two, three, four and five. Speech samples were taken from five different people with wav format. The total of speech samples for network training is 75. Each number has 15 speech samples.

Feature extraction
In this stage, software for preprocessing is built using FFT, as shown in Figure 6.

Network training
The next stage is to design software and implement ANN Backpropagation algorithms to learn speech patterns. Backpropagation ANN is an autoassociative network type, input range that is processed into the same network with the range of output results. Backpropagation ANN development in general can be seen in Figure 7.  Thus, the general system of this application follows a path as shown in Figure 8.
The Backpropagation ANN consists of one input layer, one hidden layer and one output layer. Activation function is used to connect the input layer to the hidden layer and also to connect the hidden layer to the output layer. Activation function used in the research is Tansig. A multilayer network learns much faster when the sigmoidal activation function is represented by a hyperbolic tangent [8].

Result and discussion
The Backpropagation ANN built has 1 hidden layer with 100 neurons. The activation function used is Tansig. There are two ways to determine the training process stops, by limiting the number of iterations or Mean Square Error (MSE).
The simulation is built by using a Neural Network Toolbox Matlab R2014a. Matlab is an interactive, matrixbased system for scientific and engineering numeric computation and visualization. Its strength lies in the fact that complex numerical problems can be solved easily and in a fraction of the time required with a programming language. The basic Matlab program is further enhanced by the availability of numerous toolboxes [9].
In the simulation, the number of iterations is limited to 1000 while MSE is limited to 1.00e-06. Network training results can be seen in Figure 9. The training process occurs until 23 iteration and training time during 0 second.
While Figure 10 shows the results of Confusion. The conclusion obtained is the level of accuracy in determining the output network is 96 %.
To measure the accuracy of Backpropagation ANN, it is necessary to test some new speech samples using a simulation program that can be seen in Figure 11. This program provides menus to recordspeech, play speech, save speech and open a speech that have been saved. After a speech has beenrecorded, the program will display the results in terms of numbers as well as opening the intendedslide presentation tool. The speech will be recorded by the internal microphone on the PC / laptop. Figure 12 shows part of codes from Matlab program that instructs to go to the slide based on the analysis result from the speech input.  In Figure 13, it can be seen that the program record a new speechsample then saved it as Sample_a2.wav. This speech instruction is to open the slide 2. It can be seen that the program succeeded in opening the slide 2.  Figure 14, it can be seen that the program record another new speechsample then saved it as Sample_a3.wav. This speech instruction to open the slide 3. This speech instruction is to open the slide 3. It can be seen that the program succeeded in opening the slide 3. The new speech samples tested were 100 speeches from different sources. Some results of the testing can be seen in Table 1.

While in
The result showsthat the simulation can recognize63speeches correctly and can demonstrate the proper slide. While 37speeches can not be recognized properly by the system and showed wrong slides.

Conclusion
Based on test results and discussion, it can be concluded that the Backpropagation ANN can be used as a presentation tool navigation through speech with 96% accuracy rate based on the network training result. The simulation can produce 63% accuracy based on 100 new speech samples from various sources.