Discretization method to optimize logistic regression on classification of student’s cognitive domain

The accuracy level of the student determination in a class often has been paid less attention in educational data mining. So, this paper studies how to improve the performance of classification method to reach the higher of level accuracy. Therefore, we optimize logistic regression using equal frequency discretization method. Here, we test the student data by three intervals, four intervals, and five intervals. For logistic regression, we implement two regularization types, namely: lasso, ridge. Furthermore, to evaluate the results, we use the random sampling technique. Additionally, we measure the results by four classifier metrics, namely: F1, precision, accuracy, and recall. The experimental result shows that this method can be applied to optimize the logistic regression. On logistic regression_lasso and logistic regression_ridge, the three intervals achieve the highest of accuracy level. They can improve the accuracy level about 9% 9.4%, respectively.


Introduction
Nowadays, research in education area focuses on processes enhancement to make the better of the education environment.Research workes many efforts to achieve this goal.One of the examples is the implementation of educational data mining [1].There are many tasks in data mining, for examples: clustering [2], classification, analysis association, etc. [3].
Specifically, researchers focus on the classification of educational data.Classification using four methods: Decision Tree, Rule Induction, Naïve Bayes and Neural Network in research [4] is addressed to predict the performance of student academic.Also, a decision tree is also explored to predict the effective learning in research [5].Guo et al. also predict the student performance using the classification method.Here, they exploit deep learning [6].Next, the classification is done to predict the student dropout factor by using Induction Rule and Decision Tree in [7].Different from the others, research focuses on the improvement of the Particle Swarm Classification to classify the question level in an examination [8].Additionally, L.Ge et al. in [9] apply SVM as a classifier to predict about extraversion and introversion traits on the student.Almost previous research is addressed for prediction.As far as we know, classification methods need dataset which has a label because of classification as the supervised learning.
In contrary, the others research generates information about predicting using Linear Regression.In [10], the research build model to forecast the grade level of student achievement which assists the teacher in the early handle to the poor student.Another research applies linear regression to predict students performance in final examination [11].Additionally, the performance predicting is also studied by [12].Here, to achieve the goal, research employ patterns of student activity that one of bag classifier is a linear regression.Next, the automated marker quality is offered by the applied of this method in research [13].Lastly, the prediction is also done by the research [14].It is addressed to predict the student's psychomotor domain.Nevertheless, all research does not study on about the performance enhancement.Also, linear regression is usually applied to data in the form of continuous variables.
Therefore, our research explores logistic regression as classification method to handle the categorical variables.Furthermore, our research also improves the accuracy level of the classifier performance to generate the most valid information.Finally, this information is useful to decide how many classes in the data labeling.

Methods
In this section, we illustrate the proposed framework consisting of many steps as follows: Step 1: Extracting features based on category.In this step, we extract features of student's cognitive domain to improve the performance of educational data mining [15].This step produces ten features, namely: number of main items which are answered by the student (MID), the percentage of the right answer and all answer of main items (MID%_true), time elapsed to answer the main items (Time_MID), student score of main items (Score_MID).Next, the number of guidance/scaffolding items which are answered by the student (GID), the percentage of the right answer and all answer to guidance/scaffolding (GID%_true).Then, the time elapsed to answer the guidance/scaffolding (Time_GID), student score of guidance/scaffolding (Score_GID), number of the accessed hints (Hint) and the sum of MID and GID (Total_score).
Step 2: Doing discretization method Student data are a continues variable which has been extracted is discretized to many intervals: three intervals, four intervals, and five intervals.Here, we use discretization method called equal frequency.This method is addressed to optimize the classification process, so the classification performance is better than before.Furthermore, this information can be used to make the best decision about how many classes of our data set.

Step 3: Applying logistic regression
We propose logistic regression as a classification method.This step is done to know how many intervals which can produce the most optimal of the classification process.A logistic regression learns a logistic regression model from data.So, logistic regression learning algorithm is as a learner.
Logistic regression is a regression model with categorical dependent variable [16].It is a simple understanding way of finding the β parameters on equations: The standard logistic distribution spreads an error symbolized.Particularly, in machine learning algorithm, logistic regression is an important algorithm used to model the probability of a random variable Y is 0 or 1 given experimental data.Additionally, logistic regression evaluates the relationship between the categorical dependent variable and one or more independent variables by estimating probabilities using a logistic function having the formula as follows: Here, we assume that t is a linear function of the single explanatory variable x.So, t is expressed as follows: Consequently, the logistic function can be written as: This method is used for predicting a dependent variable that is categorical as in the previous step.
Additionally, in computer science, especially in machine learning field, regularization is addressed to prevent overfitting or to solve an ill-posed, problem with introducing additional information.Here we use two regularizations: lasso [17] and ridge [17][18][19] to solve it.

Step 4: Analysing results
In this step, we analyze the experimental results.We compare each other to find the best result which is indicated by the optimal value for every metrics relating to the classification performance.

Result and Discussion
The execution of the proposed framework is described in this section.Next, we analyze the experimental results.For the first result, we visualize applying to the student data which is discretized to three intervals, four intervals and 5-intervals on logistic regression.Here, we set a parameter on logistic regression using lasso and ridge.The first result is presented in Figure 2 Furthermore, we explore them based on parameters of logistic regression.Here, we know that parameter ridge has the higher of correlation value (r) than lasso for all intervals of discretization method.On lasso, value r of three intervals, four intervals and five intervals are 0.47, 0.07 and 0.23, respectively.For ridge, 0.52, 0.37 and 0.30 are achieved by three intervals, four intervals and five intervals, respectively.Furthermore, the highest of correlation is achieved by discretization-three intervals on all parameter of logistic regression.
Relating to the performance of logistic regression as a classifier, we employ many metrics, namely: accuracy (Acc), F1, precision (Prec), and recall (Rec).Table 1-3 shows the results of running for all intervals and all parameters on logistic regression.We evaluate the classifier model using the percentage split technique.
In addition, data training uses many sizes: 10%, 20%, 30%, 40%, 50% and 60% and iteration of training/testing: 2-3.Table 2 shows the performance of discretization method with five intervals applied on logistic regression for lasso and ridge regularization.
For lasso, the highest of accuracy level is about 0.846 that is achieved on train set size 60%.Additionally, the highest of F1, precision, and recall are approximately 0.942, 0.983 and 0964, respectively.Next, ridge reaches the highest of accuracy, F1, precision and recall around 0.833, 0.95, 0.987 and 0.964, respectively.
The experimental result of discretization method with four intervals is showed in Table 3.Here, lasso attains the highest of accuracy, F1, precision, and recall are about 0.756, 0.969, 0.94 and 1, respectively.Then, the highest of accuracy, F1, precision, and recall around 1,

Table 1 .
1 Performance of Logistic regression for Discretization-three-intervals.

Table 2 .
Performance of Logistic regression for Discretization-four-intervals.

Table 3 .
Performance of Logistic regression for Discretization-five intervals.