Comparative Study of Machine Learning Approach on Malay Translated Hadith Text Classification based on Sanad

. Sanad is one of important part used to determine the authentication of hadith. However, very little research work has been found on classification of Malay translated Hadith based on sanad. There are some researches done using machine learning approach on hadith classification based on sanad but using different objective with different language. This research is to see how Machine Learning techniques are used to classify Malay translated Hadith document based on sanad. In this paper, SVM, NB and k-NN are used to identify and evaluate the performance of Malay translated hadith based on sanad. The performances are evaluated based on standard performance metrics used in text classification which is accuracy and response time. The results show that SVM has the highest accuracy and k-NN has the best response time (time taken in process for classification data) compare to other classifier. In future, we plan to extend this paper with the analysis on interclass similarity and also test on larger dataset.


Introduction
Sunnah (Hadith) is a second of fundamental sources in Islam after Qur'an [1] [2] which is Muslims reference in any activities in their life [3].Based on [2], the author said that hadith are related to actions and sayings of Prophet Muhammad by trustworthy narrators.It is essential in understanding Qur'an and Islamic jurisprudence [4].However, [5] mentioned that hadith has been overlooked compared to Qur'an by most academicians in computer science.In the hadith, there are two main components which known as Isnador Sanad (the chain of narrators) and Matn (actual narrative or main text) [4][5] [6].The sanad contains of a chronological list of the narrators, each narrators stated the one from whom they heard the Hadith all the way to the main narrator of the Matn followed by the matn itself [4].Sanad is essential in every hadith.It is used in the first step of checking the authentication of a hadith [2][5] [7].To date, there is need to automatic classification Malay translated hadith [2] based on sanad.
Machine Learning is a wide area of Artificial Intelligence focused in design and development of algorithm that identify and learn patterns exist in data provided as input [8].Text classification is an important issue which draws many researchers in machine learning and information retrieval techniques [6].Referring to [8], the author also mentioned text classification is a key problem influenced by machine learning within information retrieval.However, very little research work has been found on classify Malay translated Hadith based on sanad.Classification hadith based on sanad has been done by [6][1] [4] with different objective using different language.Review study on Malay translated hadith has been done by [9]to identify the authentication of narrator's name, improving Malay hadith retrieval system by [2] and retrieve Malay hadith text using mobile application by [3].In this novel approach, Machine Learning techniques are used to classify Malay translated Hadith document based on sanad.
This paper is structured as follows.Section I cover on the introduction of Hadith and how the Machine Learning approach fill into the picture.Section II is focusing on some background information in Machine Learning and Malay Translation Hadith.Section III is focusing on the approach used in this paper.Section IV is a discussion on result and Section V which is contains the conclusion of the paper.

Machine learning approach (Text classification)
In machine learning, there are three basically type algorithms used: (1) supervised; (2) semi-supervised; (3) unsupervised learning as shown in Figure 1.Supervised learning is required learning a function from training data provided as input.In the case of text classification, the training data are collected of document-class pair representing proper classes for given documents, according to human specialists (data are provided by human assistance as input data).Unsupervised learning is different from supervised learning which is no training data are provided.Semi-supervised learning combines large amount of unlabelled data with a small amount of labelled data [8].To classify document into a fixed number of predefined categories is the main reason of text classification.Each document can be categorised in multiple, exactly one, or no category at all.The classification of documents is recognised as a supervised learning task because the purpose is to use machine learning to automatically classify documents into categories based on previously labelled documents [10].Based on [10], the author also  [12] in automatic text classification.This paper is focused on supervised classification using three most popular technique as stated by [10][11][12] as Figure 2.

Support Vector Machine
The Support Vector Machine (SVM) is a classification technique that was introduced by Vapnik and was first applied by Joachims for text classification [8][13] [14].It is a supervised learning algorithm that analyse the data and identify patterns used for classification [15].The core fundamental of SVM is to determine the most appropriate border line in separating hyper plane [14].In the set of training data, SVM creates a hyper plane for separating data in two categories (positive and negative) and classification in which data must be placed in this two categories [13][15].Referring to [13], the author believed SVM is a powerful classifier based on the lowest structural risk principle.SVM also popular in text classification with better performance than other methods [14][16] [17].

Naive Bayes
Naive Bayes algorithm is one of classification technique that makes exploit of statistical approach and based on the conditional probabilities for the problems of pattern recognition [16].Naive Bayes uses Bayes Theorem concept [15][16] [18] with strong independence assumptions.The classifier are named "naive" because the algorithm assumes that all terms occur independent from each other [19].The independence assumptions of features do not depend and effect on each other in classification tasks.Although it is severely limited in its applicability, the computation of Bayesian classification approach is more efficient.It can be trained efficiently to estimate parameters for classification without requiring large amount of training data.The naive Bayes classifiers often work much better in many complex real-world situations than one might expect due to its apparently oversimplified assumptions.Under some specific conditions, naive Bayes classifier has been reported to perform surprisingly well for many real world classification applications [18].

K-Nearest Neighbor
K-Nearest Neighbor (k-NN) classifier is an on-demand (lazy) classifier.The classification is done only at the moment a new document is given to the classifier.The classification decision is computed based on the classes of the k "nearest" neighbour of the new document using a distance function in a predefined metric space.This is accomplished as follows: a) Determine the k nearest neighbours of the new document in a given document training set, b) Use the classes of the nearest neighbour to determine a class for the new document.This algorithm concentrate on specific features of the document to be classified [8].It is effective and easy to implement [18] and also the most accepted algorithms for pattern recognition [16].

Malay translation hadith
The text of Malay hadith comes from translated version of Arabic hadith [9].According to [7], there are six books contain reliable collection of Hadith text: Bukhari, Muslim, Abu Dawud, Termizi, Nasai and ibn Majah.At first, writing narration of Hadith was prohibited due to some religious reasons.However after the death of the prophet, Umar Ibn Abd al-Aziz initiated the writing project of Hadith to guarantee an integrity and uniformity of the text upon fearing that some of hadiths are being lost [6].

Methodology
Figure 3 shows the process of text classification in this research is referred to the framework based on [8].Malay translated hadith dataset are used for this experiment.This dataset is focused only in sanad and the matn are removed.

Document in test set (sanad)
The data used in experiment are data from Lidwa Pusaka [20] website for dataset hadith Sahih Bukhari and Mutiara Hadis [21] website for dataset hadith Sunan At-Termizi. 100 Hadiths are choose randomly for this experiment which divided by two categories: 50 hadiths from Sahih Bukhari and another 50 from Sunan At-Termizi.The data are labelled manually and the labels belong to two different category classes: 1. Sahih Bukhari (SB) and 2. Sunan At-Termizi (ST). Figure 4 shows a sample of sanad from Sahih Bukhari (SB).
Figure 5 shows a sample of sanad from Sunan At-Termizi (SB).

Document representations
The dataset is represented as Figure 6.Row data referred to hadith document which contain 100 hadith from Sahih Bukhari and Sunan At-Termizi.The Column data referred to sanad and there are 272 data used for it.The documents are represented in two classes: Sahih Bukhari (SB) and Sunan At-Termizi (ST).

Conclusions
The experimental results done on the Malay translated Hadith document based on sanad shown the best performance for accuracy is SVM classifier and response time is k-NN classifier.However, this paper only covers the accuracy and response time of the classification used without details explanation in analysis on interclass similarity.So in the future work, we plan to extend this paper with the analysis on interclass similarity and also test on larger dataset.Response Time (second)

Figure 8
Figure8shown the results of response time in second(s) are 0.16 s for SVM, 0.01 second for NB and 0.00 second for k-NN.We found that k-NN shown the best response time (time taken in process for classification data) compare to other classifier.