Syllables sound signal classification using multi-layer perceptron in varying number of hidden-layer and hidden-neuron

The research on signal processing of syllables sound signal is still the challenging tasks, due to non-stationary, speaker-dependent, variable context, and dynamic nature factor of the signal. In the process of classification using multi-layer perceptron (MLP), the process of selecting a suitable parameter of hidden neuron and hidden layer is crucial for the optimal result of classification. This paper presents a speech signal classification method by using MLP with various numbers of hiddenlayer and hidden-neuron for classifying the Indonesian Consonant-Vowel (CV) syllables signal. Five feature sets were generated by using Discrete Wavelet Transform (DWT), Renyi Entropy, Autoregressive Power Spectral Density (AR-PSD) and Statistical methods. Each syllable was segmented at a certain length to form a CV unit. The results show that the average recognition of WRPSDS with 1, 2, and 3 hidden layers were 74.17%, 69.17%, and 63.03%, respectively.


Introduction
One of main goal in speech recognition is to obtain the best accuracy for recognizing or classifying the speech signal uttered by speaker.The technology of speech recognition is currently used in many applications, such as smart phones, security systems, etc.However, these systems still have some difficulties in distinguishing syllables or word that sound similar.The common stages in speech recognition system are pre-processing, feature extraction and recognition stage.In the recognition stage, the process of selecting the suitable parameter in the classifier system is crucial for optimal result of classification.
Many classification methods were applied for classifying the speech signal by previous researchers.Several methods such as the Hidden Markov Model (HMM), Support Vector Machines (SVM), Gaussian Mixture Model (GMM), and Multi-layer Perceptron (MLP) or Artificial Neural Network (ANN) as classifiers [1 -8].MLP or ANN is a method when learning algorithm is performed and converged.It involves of weights and the ability of the underlying networks to implement desired function using sufficient number of hidden neuron.Generally, hidden layer is not needed in case of small samples number in the data set.However, there is no proved or accepted theory in determining the numbers of neurons in hidden layer for function approximation.The number of neurons in hidden layer influenced the network.
Several previous studies has been done using MLP classifier [2,3,8,9] for speech classification.In [5], three discrete wavelet families (db, sym, and coif) with a different number of coefficients was used and evaluated with two classifiers (GMM and MLP).The result in computation time showed that MLP has better performance than GMM [5].In [8], MLP was used for classification of Hindi CV syllables.In [3], the MLP was used for classifying syllables by using combination of Discrete Wavelet Transform (DWT) and statistical features with variation of mother wavelet of Haar, Coiflet and Daubechies.The experiment result showed that Daubechies is the most effective mother wavelet compared to Haar and Coif.In [2], The MLP was used for classifying syllables sound by using DWT, Renyi Entropy (RE), and Autoregressive Power Spectral Density (AR-PSD) features.
This paper presents the classification of Indonesian CV syllables sound signal by using the MLP in varying number of hidden neuron, and the signal processing by using DWT, RE, AR-PSD, and statistical for generating features [10].Five feature set are performed in this study.Feature Set 1 is the combination of the DWT and statistical (WS).The wavelet type used is db2 at the 7th level of decomposition [11].Feature Set 2 is the RE features.Feature set 3 is the combination of AR-PSD and statistical features in frequency and time domain (PSDS) [2].Feature Set 4 is the combination of AR-PSD and the RE features after selected by using Correlation-based feature selection method or CFS (RPSDS).Feature Set 5 is the combination of WS, RE, and PSDS (WRPSDS).
MATEC Web of Conferences 154, 03015 (2018) https://doi.org/10.1051/matecconf/201815403015ICET4SD 2017 A database of 360 CV syllable utterances taken from six different speakers was created.It is formed by three Indonesian consonants (/k, g, l, r/) followed by the three vowels (/a, i, u/).The consonant /g/ and /k/ represent articulation place of the velar as well as the part of the stop consonants [1,12], while /l/ and /r/ represent articulation place of the alveolar.Frequency sampling used was 8 kHz and 16 bits mono per sample.
After the phase of recording, the next step was the cropping phase, which was basically a windowing phase to form a rectangular window of the signal.From the acoustic study by the previous researcher (Sharma et al. 2013), it was found that duration for all relevant acoustic parameters was about 60 ms [1].Therefore, the duration manually taken for each CV unit in this study was about 60 ms, starting from release burst of the associate consonant to steady state of the following vowel.The significant events regions of the CV unit /ka/ are shown in Fig. 1a [13].The next phase was the peak normalization.By applying peak normalization, the signal magnitude variation which is caused by the differences in the recording condition (such as speaker distance and loudness factor) can be avoided [14].

Fig 1a.
Example of syllable /ka/ and its significant events regions

Wavelet
The wavelet transform (WT) is a signal processing method which can decompose a signal into several bands using a low-pass filter and a high-pass filter.In this part, feature extraction using DWT at 7th level of decomposition was conducted.In the decomposition phase of DWT, only at a lower frequency band which is also called as approximation.By decomposing at 7th level, it gives the highest frequency band of 2000-4000 Hz and the lowest frequency band of 0-31.25 Hz.More level decomposition will be insignificant to improve recognition rate because a very low frequency band will not have discriminatory information [15].
In the DWT, the process of selecting the suitable mother wavelet is crucial for optimal result of classification.Based on previous research [1,3,11], it was found that Daubechies 2 (db2) was the one of the effective mother wavelet.The Daubechies wavelet of class D-2N can be written as: Where h0,…, h2N-1 Є ℝ are the constant filter coefficients satisfying the condition, and φ is the (Daubechies) scaling function.After transformation process by using DWT, then the result was a signal in frequency domain.The moving average feature was calculated of each twenty sample of the signal magnitude until the maximum sample of the signal magnitude.As the additional feature, the signal in frequency domain was calculated using a statistical method [3].

Renyi Entropy
The Renyi entropy (RE) is a generalization of the Shannon entropy, the collision entropy, the Hartley entropy, and the min entropy.The function of generalized entropy for discrete variable X can be defined in Equation below.
Where pi is the probability of X belonging to possible outcome, o1, o2,..., on.The order of entropy, , has the constraint of ≠ 1.In special case of  = 1, it converges to Shannon entropy [16], [17].

Autoregressive Power Spectral Density (AR-PSD)
In this study, PSD using Yule-Walker AR algorithm was performed.The AR model in P order can be defined in equation bellow: Where   = AR's Coefficient Then, using 256 point xpp(t) with Hamming's window, AR-PSD estimation can be formulated as follows: Where rxx is extrapolation of data series autocorrelation bias estimation data from AR model, T is the period of sampling, and   2 is the variance of the drive noise input.

Single Hidden Layer in Multi-layer Perceptron (MLP)
In the classification process by using MLP, the process of selecting the suitable parameter and architecture is crucial for the optimal result of classification [18], [19].The architecture used in this section consists of three layer, they are input layer, hidden layer, and output layer.The input layer represents the features of each feature extraction method (WS, RE, PSDS, RPSDS, or WRPSDS).The hidden layer consists of 1-20 hidden neurons.The output layer consist of twelve neurons, it represents the classification result of the syllables.To estimate the reliability of the classification results, the data verification was performed.The verification technique used on the test set was k-fold cross validation or the hold out method [20].

Hidden Layers in MLP
In this experiment, we used 2 layers in Hidden layer.
Based on the previous experiment on single layer of hidden neuron, the optimal result for WRPSD was 55 nodes.Therefore in this part we used 55-55 nodes.

Hidden Layers in MLP
In this part of experiment, we used 3 layers.As the previous research recommendation [9], the nodes architecture for 3-hidden layer was 20-20-15 nodes.

Result and Discussion
In this study, the number of features generated by using WS, RE, PSDS, RPSDS, and WRPSDS were twenty-nine, twenty, thirteen, nineteen, and sixty-two features, respectively.After feature extraction, the next phase was classification that uses MLP-BP.The parameter of the learning rate and the momentum was 0.3 and 0.2, respectively.In the previous study [14], the experiment for the number of hidden neurons 0 to 20 has been done.
In this study we continued the experiment for the number of hidden neurons 20-60 as shown in Fig. 2.
Figure 2 shows the percentage accuracy for WS, RE, PSDS RPSDS, and WRPSDS in varying number of hidden neuron.The result showed that WRPSDS has the highest score in average accuracy, but at ninth hidden neuron, the score of WS is higher than WRPSDS.It indicated that the number of hidden neuron can influence the classification result, but the increase of accuracy was not linear.Table I shows the percentage classification scores using Ten-fold cross validation (Ten-FCV) for 1, 2 and 3 hidden layer.The experiment result showed that the average recognition for 1, 2, and 3 hidden layers that using WRPSDS were 74.17%, 69.17%, and 63.03% respectively.It indicated that the MLP architecture in 1hidden Layer (55 nodes) gives better performance of classification compared to 2-Hidden Layer (55-55 nodes) and 3-Hidden Layer (20-20-15) nodes.
In case of /a/, the highest score was 72.5% by using WRPSDS features.(For the vowel /i/ the highest classification score was 70%.In case of /u/ the highest score was 79.99% which was the highest score among the other vowel.Some feature showed better performance in 2-Hidden Layer architecture (RE, PSDS, RPSDS), but overall 1 layer was still better.

Conclusion
In this paper, a classification of the Indonesian syllables sound using classifier of multi-layer perceptron in varying number hidden layer and hidden neuron fusing with Wavelet, Renyi entropy (RE), AR-PSD features was proposed and implemented.Based on the experimental result presented in this paper, it can be concluded that the MLP architecture in 1-hidden Layer (55 nodes) when fusing with WRPSDS gives a better performance of classification score compared to 2-Hidden Layer (55-55 nodes) and 3-Hidden Layer (20-20-15) nodes as shown by accuracy of 74.17%, 69.17%, and 63.03% respectively.Some feature such as RE, PSDS, RPSDS showed better performance in 2-Hidden Layer architecture, but overall 1 hidden layer architecture was still better.The future work recommended for this re-search is to use bigger syllable dataset, applied to the Indonesian stop consonant or the other place of articulation (such as labial, dental, etc.), to use different combination of feature extraction technique, and to use different testing procedure of classification process.

Fig 2 .Fig 3 .
Fig 2. Accuracy for each feature extraction method with various number of hidden neuron in MLP

Table 1 .
Accuracy for each feature extraction method with various hidden layer in MLP