DESIGN OF NEURO-FUZZY DECISION TREES

In order to improve accuracy of fuzzy decision trees classification we propose a procedure of parameters adaptation by means of neural network training. In the direct cycle, fuzzy decision trees are based on the algorithm of fuzzy ID3; in the feedback cycle, parameters of fuzzy decision trees are adapted based on stochastic gradient algorithm by traverse to the root nodes back from the leaves. Using this strategy, the hierarchical structure of the fuzzy decision trees remains fixed.


Introduction
Domestic and foreign literature describes the use of decision trees as a powerful evolutionary methodology for solving problems of classification and regression [1][2][3][4][5]. Being a DATA MINING tool (detection of hidden knowledge from data), decision trees are used for search and retrieval of interpretable classification rules, which are clear for humans. We should note that many packages for intellectual data analysis already contain methods for constructing decision trees; they are the perfect tool for decision support systems.

Principles of constructing decision trees
Decision tree is a tree, the leaves of which contain values of the target function, and the remaining nodes contain branch conditions, which determine the edge to go (for example, "Sex is male"). If the condition is true for this observation, transition is made along the left edge, if it is false -along the right. Usually, each node includes checking of one independent variable. Sometimes two independent variables are compared with each other in the tree node, or a function of one or more variables is determined.
If a variable that is checked in the node takes categorial values, then the branch emanating from the tree node corresponds to each possible value. If the value of the variable is a number, it is checked to see if the value is greater or less than a constant. Sometimes, numerical range is divided into intervals (to check if the value hits one of the intervals).
The leaves of the trees correspond to the values of the dependent variable, i.e. classes. Figure 1 shows the iris classification tree. Classification contains three classes (marked with red, blue and green on the Figure 1) and has the parameters: length \ width of sepals (SepalLen, SepalWid) and length \ width of petals (PetalLen, PetalWid).

Figure1. Iris classification tree.
As we can see, each node contains class belonging (depending on the fact, what elements have hit this node in a greater number), number of observations N, and number of each class. Not leafy tops also contain a transition condition -to one of the subsidiaries. Sample is divided according to these conditions. As a result, the tree has classified initial data (exactly initial data, those on which it was trained) almost perfectly (6 out of 150 are not correct).

The main methods, which use decision trees
Classification and regression trees (CART) was the first method invented in 1983 by four well-known scientists in the field of data analysis: Leo Breiman, Jerome Friedman, Richard Olshen and Stone (Table 1) [2].
The essence of the algorithm is in ordinary construction of a decision tree [6]. On the first iteration, we build all possible (in a discrete sense) hyper planes, which would divide our space into two. Number of observations in each sub-space of different classes is counted for each subdividing. As a result, we select subdividing, which has maximally allocated observation of one of the classes in one of the sub-spaces. Accordingly, this subdividing is the root of our decision tree, and two subdivisions will be the leaves on this iteration.
On the next iterations, we take the worst leave (in the sense of number of observations of different classes) and conduct the same operation on its subdividing. As a result, the leave becomes a node with some subdividing and two leaves.
We continue to do so until we reach the limit on the number of nodes, or the overall error (the number of misclassified observations over the tree) fails to improve from one iteration to another. However, the resulting tree will be "retrained" (will be made-to-the training sample) and, accordingly, will not give normal results on other data. In order to avoid "retraining", it is possible to use test samples (or cross-validation) and, accordingly, to make reverse analysis (so-called, pruning), when the tree is reduced depending on the result on the test sample [7]. This is a relatively simple algorithm, which results in one decision tree. It is convenient for the primary data analysis, for example, to check presence of relationships between variables.
Random Forest is a method invented after CART by one of the four scientists -Leo Breiman in co-authorship with Adele Cutler [3]. The method is based on the use of committee (ensemble) of decision trees.
The essence of the algorithm is that random sampling of variables is made on each iteration, and then, constructing decision trees starts on this new sample. Besides, "bagging" takes place -sampling of random two-thirds of observations for training, and the remaining one-third is used to evaluate the results. This operation is done hundreds or thousands times. The resulting model will be the result of "voting" of a tree set obtained during simulation.
Stochastic Gradient Boosting is a data analysis method introduced by Jerome Friedman [4] in 1999. It represents solution of the regression problem (which can include classification) by constructing committee (ensemble) of "weak" predictive decision trees.
On the first iteration, a decision tree limited by number of nodes is constructed. Then the difference is counted between the value predicted by the resulting tree and multiplied by learnrate ("weakness" coefficient of each tree) and the unknown variable on this step. The next iteration is based on this difference. This continues until the result improves. It means that on every step we try to correct the mistakes of the previous tree. However, it is better to use check data (not involved in the simulation), because retraining is possible on the training data. It often comes to a local decision (for example, a hyperplane has been selected on the first step, which maximally divides the space on this step; but it will not lead to the optimal solution). Common disadvantage of constructing traditional decision trees is requirement for certainty of the input data, which is achieved by applying the average values of the input parameters of the analyzed technology. This can lead to receiving significantly shifted point estimates of project performance indicators. It is also clear that the requirement for determinancy of input data is unjustified simplification of reality, because any technology is characterized by many uncertainties: uncertainty of input data, uncertainty of external environment, uncertainty associated with the nature, options and a model of project realization, uncertainty of requirements for technology effectiveness. Uncertainty factors determine technology risk, i.e. a danger of resources loss, revenue deficiency, or additional costs.

Neuro-fuzzy trees
In order to improve the accuracy of classification, the author suggests using neuro-fuzzy decision trees, which have property to adapt parameters by means of neural network training. In the direct cycle, fuzzy decision trees are based on fuzzy ID3 algorithm [5]. The feedback cycle, parameters of fuzzy decision trees are adapted based on stochastic gradient algorithm by traverse to the root nodes back from the leaves.
As initial data, we use the so-called triangular fuzzy numbers with a membership function of the following type ( Figure 2). These numbers model the following statement: "Parameter A is approximately equal to α and is clearly in the range [a min , a max ]".
In general, a fuzzy number means a fuzzy subset of a universal set of real numbers, which have a normal and convex membership function. This description allows experts to take parameter interval [a min , a max ] as input information and most expected value of α, then the appropriate triangular number A = (a min , α, a max ) is built. Selection of three important points of initial data is rather common in investment analysis. Often subjective probabilities of realization of appropriate ("pessimistic", "normal" and "optimistic") initial data scenarios are compared with these points. Further, we will call parameters (a min , α, a max ) as significant points of triangular fuzzy number A.
We should note that attributes of technological innovative projects are classified as subjective or objective. Subjective attributes include quality features such as technical level, advantages of the enterprise, innovative risk, project management; we will estimate them by linguistic values presented by fuzzy numbers based on expert interviews.
Objective (quantitative) attributes include plans of investment costs, etc. These quantitative features are reduced to a common scale in order to provide compatibility with linguistic values of subjective features. Let us consider typical descriptions of the technology projects attributes ( Table 2).  Fuzzification involves conversion of attributes' numerical values to linguistic terms in order to reduce information and present it in a human-understandable form convenient for decision-making. One of the ways to determine membership functions of these linguistic variables is expert opinion or human perception. In order to automate this procedure, you can use statistical methods and fuzzy clustering based on self-organizing neural network training. Let us consider the second method.
Let there is a data set X, which must be conversed to k linguistic variables T j , j=1,2,…,k. For simplicity, assume that T j function has a form of triangulation: Parameters, which are to be defined for each attribute, form k centers {a 1 , a 2 ,..., a k }. A neural network algorithm -Kohonen's self-organizing maps -is an effective method for determination of these centers [3].
Let us consider a numerical attribute of the project -investment costs for a group of examples in Table 2 Obviously, these linguistic terms can be described as «low», «average» and «high». The second column of Table 3 shows the degree of proximity of the attribute «investment costs» to these three membership functions.
For description of linguistic and corresponding numerical values, we assume that the membership functions of this linguistic term are known.
The measure of similarity between linguistic terms can be determined by their functions as follows: f where M is the sum of degrees of membership of conversion of a fuzzy set to the final state.
Using the described function f we can calculate the degree of membership of each of two linguistic terms «Advantages of the enterprise».
The value of fuzzy attributes, for example, the attribute «technical level», can be functionally represented by a set of membership functions (Table 3). For a given set of functions, we find some new fuzzy sets, which are considered as a result of clustering of initial data for description of membership functions of set.

Conclusion
Thus, the developed method of constructing neuro-fuzzy decision trees allows you to get rid of weighted estimations of the input data and has a property of neural network adaptation of the parameters based on stochastic gradient algorithm by traverse to the root nodes back from the leaves. In the direct cycle, fuzzy decision trees are constructed based on fuzzy ID3 algorithm. In the feedback cycle, parameters of fuzzy decision trees are adapted. Because of this strategy, the hierarchical structure of the fuzzy decision tree remains fixed.
In conclusion, we note that the proposed approach of use of back-propagation algorithm directly on the structure of fuzzy decision trees improves the accuracy of their training without damage to interpretability.