A Chinese text classification system based on Naive Bayes algorithm

. In this paper, aiming at the characteristics of Chinese text classification, using the ICTCLAS(Chinese lexical analysis system of Chinese academy of sciences) for document segmentation, and for data cleaning and filtering the Stop words, using the information gain and document frequency feature selection algorithm to document feature selection. Based on this, based on the Naive Bayesian algorithm implemented text classifier , and use Chinese corpus of Fudan University has carried on the experiment and analysis on the system.


Chinese text preprocessing.
We first for Chinese text preprocessing, including structure processing, word processing, to stop words and so on. Extract on behalf of the text characteristic of metadata (characteristics), saved in a structured form, as the center of the document representation.

segmentation preprocessing
In the organization of the text, Chinese is very different to the European and American language which is Represented by English. In western languages, the word and the word was separated by Spaces, don't need word processing. In the Chinese text, words are joined together. Therefore, word segmentation is difficult. Chinese word segmentation technology is facing two biggest problems, the ambiguous segmentation and the unknown word recognition [1]. To get rid of the Stop words in technology implementation is not complicated, just build a stop words dictionary, will stop at each word after word segmentation and matching words in dictionary entry, if the match is successful, will remove the word.

the text feature selection
Feature selection is to improve the efficiency of text classification, reduce the computational complexity . Text feature selection is usually through the judgment of key .
The commonly used methods are: document frequency, information gain cross entropy, mutual information, statistics and expectations, and so on . This paper uses the information gain and document frequency method of key judgment.

the information gain
Information gain is an evaluation method based on entropy is defined as a feature in text information entropy difference before and after.

document frequency
Document frequency is the number of text feature item The idea is: the DF value below a certain threshold of entry is low frequency words, they contain less or no category information. And get rid of the low-frequency 01015-p.2 words from the feature space, reducing space dimension to improve the accuracy of classification.

text classification
Text classification use a tagged category of text data sets to train a classifier , this text data set is called the training set , then use the trained classifier of unmarked categories of text classification. Text classification algorithms, which are frequently used KNN algorithm, the SVM algorithm and Bayesian algorithm, etc.
On many occasions, Naïve Bayes classification algorithm is comparable to obtained with the decision tree classification algorithm and neural network, the algorithm can be applied to large databases, and the method is simple, high classification accuracy and speed.
Theoretically, compared with all other classification algorithms, Bayesian classification has the smallest error rate, in its class under the premise of conditional independence assumption was set up it is the best classification algorithm. In many cases, however, its class conditions independent simple assumption is not established, but the practice has proved that even so, it in many areas are still able to obtain better classification results.
Naive Bayes method [3] will break down training document into eigenvector and decision class variables.
Assumed the characteristic vector of each weight relative to the decision variables is relatively independent, that is to say, each component independently on the decision variables. Naive Bayesian conditional independence assumption between attributes greatly simplify the calculation of joint probability, formula as shown in (2).
To construct a classifier module is the key to the training process module, using Bayes classification algorithm, structure specific classifier. Training process is generally more time-consuming, the system will training all text once, the characteristics of related information in the file. In training once testing directly from the configuration file read relevant information without the need for training again, to save time.