Review of the Classification of Massive Chinese Texts Based on Spark

. As the Internet develops rapidly, the number of texts is also growing rapidly. Whether it is the content of online emails exchanged by people, or the online novels and other literary contents, or news reports, personal blogs, Weibo or comments, they are constantly increasing the amount of text at all times. However, most of the data is not classified or processed, which causes a lot of spam, junk information, meaningless articles or advertisements. Their production not only consumes a lot of Internet resources, but also affects users' online experience and reduces the users' work and study efficiency. Therefore, it is vital accurately classify a large amount of text, judge its nature according to the classification result, and carry out targeted treatment. The classification of massive texts based on Spark framework is reviewed in this paper.


Introduction
With the development of Internet technology and social media, massive network text data has been derived.However, most of the massive data has not been processed and classified, which results the emergence of bad network behaviors, such as spam and advertisement push, making it difficult for people to extract useful information from massive data, which wastes a lot of users' time and energy to process spam.Therefore, how to efficiently classify massive text data has important theoretical significance and application value [1], and how to efficiently extract valuable information in massive text information has become a research hot spot [2] .
Text classification technology, as a key technology for text processing, has been widely used in improving information retrieval and utilization [3] .At present, the classification algorithms, such as K-neading algorithm [4] , Naive Bayes [5] , Maximum Entropy [6] , Support Vector Machine (SVM) [7] , Artificial Neural Network [8] , decision tree [9] , and rough sets [10] , are widely used in practical applications.The researches on the above text classification methods focus on the small and mediumscale data model.The traditional classification system seems to be powerless for the large-scale data and the scenes that require higher real-time performance.Therefore, many predecessors have organically combined the framework of big data with traditional machine learning, in order to solve the problem that traditional text classification can not complete the classification of massive texts [11] .
The MapReduce framework is the most widely used big data parallel computing framework.People have attached more attention to the research on parallel text classification algorithms under the MapReduce framework.The disadvantage of the MapReduce framework is that it stores intermediate results on HDFS during parallel computing, leading to a large amount of IO overhead.While the Spark framework is a parallel framework based on memory computing, and it does not directly store the intermediate results on the disk during the performance process (the data portion is cached to disk only when the memory is insufficient), so the performance efficiency of Spark framework is relatively good [12] .

Current situation of text classification
Bayes classifier is the most classic method in the study of classifiers.It has been widely used in many scenarios, for example, it is found that if the Bayes classifier is used for spam filtering, the phrasal characteristics and other attribute features can improve the classification accuracy rate, up to 95% [13] .Some studies use language model [14] to estimate the correlation probabilities, and have achieved good results.In addition, the online Bayes method has also been widely used in text classification and information filtering [15] .However, the Bayes method requires that the vocabulary between documents should be independent with each other.This conditionally independent hypothesis is not easy to apply to the actual text, so there is often a gap between the actual effect and the theoretical valuation.
Support vector machine is another excellent classification algorithm.It is developed based on a statistical model.The method is based on the statistical VC dimension theory and structural risk minimization principle.It has many advantages in solving small sample, nonlinear and high-dimensional pattern recognition problems, and has been used in pattern recognition, regression estimation, probability density function estimation and so on.In the field of text classification, support vector machine classifier has better classification performance and generalization ability, and is widely used in classification field [16][17][18][19][20][21] .The Liblinear classifier based on the theoretical design of quadratic soft-interval support vector machine designed by Lin Zhiren et.al. of Taiwan University has effectively solved the classification speed problem of traditional support vector machine classifiers, and further accelerates the application of support vector machines in the field of text classification.However, when the amount of data is large, the high-dimensional sparse text vectors bring higher VC dimensions, which will increase the gap between the expected risk and the empirical risk of the classifier, and reduce the performance and accuracy of the classification.
Recently, research on deep learning has received great attention.The deep learning algorithms have achieved amazing results in the field of image recognition and speech recognition [22] [23] .For text data, it is mainly used in natural language processing and semantic mining, such as the presenting of algorithms of word vector, convolutional neural network (CNN) [24] [25] , and recurrent neural network (RNN).The idea of training a language model with a neural network was first proposed by Xu Wei of Baidu IDL in 2000 [26] [27] .The paper proposed a method for constructing a binary language model using neural network.Subsequently, Bengio et al. published an article on NIPS [28] to give a classical algorithm for training language model with neural network.Then, in the field of NLP, the neural network enters a rapid development period, until recently the results of deep learning once again create a huge wave.Usually, the task of text is done with CNN, its unique convolution, pooling structure can extract the structure, and finally combine the fully connected network to achieve the aggregation and output of information.While RNN provides a way to handle the context in NLP for its memory function.
However, in the field of text classification, the convolution neural network and word vector based on semantics have not achieved theoretical results.The reason is mainly because that deep neural network consumes a lot of computation in semantic recognition, while the semantic quasi-region can not bring the accuracy of classification, instead, the accumulation of defective products and the traditional BOW mode can improve the accuracy of classification.In addition, because the amount of computation of deep learning is large, it is usually necessary to use clusters to calculate, so the computing power of distributed computing engines and clusters is also one factor that affects the classification performance [29] .

PREPROCESSING OF TEXT
The preprocessing of text mainly includes text formatting, word segmentation, and removing the stop words and other operations.After the text is merged, the entire training set is merged into one file, one line representing a text.The operations of text formatting, word segmentation, and removing the stop words are conducted with the text as a object, so the text preprocessing module has natural parallelism [30] .Currently, the commonly used Chinese word segmentation tools include the word segmentation system ICTCLAS of Chinese Academy of Sciences [31] , IKAnalyzer [32] and Paoding [33] .The word segmentation system ICTCLAS of Chinese Academy of Sciences is adopted in this paper, which uses the N shortest path algorithm.ICTCLAS won the first place in the 973 evaluation in 2002, as shown in Table 1 [34][35][36][37][38] .Before preprocessing, it is necessary to merge the corpus and respectively merge the training set text and the test set text into one text, and each line represents a record.The corpus text content needs to be formatted during the process of merging files.The merged text is uploaded to the HDFS distributed file system as an input file.While segmenting each record, we use the stop word list to remove the word in the text.The word after the word segmentation has a single word and phrase.This is selected as the feature item in this paper, so in the preprocessing, the single word obtained after removing the participle is needed [39][40][41] .Text preprocessing is implemented on a Spark cluster, and the data items in the RDD are the content of each line in this text.The text content is divided, and the stop words are removed, and the term frequency forms a property dictionary.These distributed operations are performedon the Worker node [42] .The execution flow of preprocessing under Spark is shown in Figure 1.

Text vectorization
The common text vectorization algorithms include word frequency statistics technology [43] [44] , TF-IDF algorithm [45] [46] , LDA [47] [48] , and Word2vec [49] [50] [51] .The TF-IDF algorithm is the most common algorithm, which combines the Spark.Li Tao et al. proposed the calculation process of improved feature weighting algorithm in the Spark in the article "Study on Efficient Web Text Classification System under Spark Platform".The classic TFIDF weight calculation is very difficult in the calculation of massive text classification, which takes dozens of hours or even days.This is obviously an unimaginable disaster for occasions with high real-time performance.Therefore, the distributed computing model Spark based on memory is introduced: Luo Yuanshuai also used TF-IDF as a word vector algorithm combined with Spark in the article "Study on Parallel Text Classification Algorithm Based on Random Forest and Spark".
And he proposed the process: A complete text vectorization process first reads the text RDD fenci after the word segmentation, then uses the feature lexicon RDD features to filter the text content, and counts the TF value of the words in the feature lexicon, and counts the IDF value based on this.After the statistics of TF and IDF are completed, the text is vectorized combining the feature lexicon RDD features to obtain the vector space model RDD TF-IDF [38] .
Yu Pingping et.al. have published "Efficient KNN Chinese Text Classification Algorithm Base on Spar".In this paper, the authors show that it will reduce the classification accuracy in the general parallelization process of KNN text classification, so when using, it is necessary to introduce the relevance of words in the process of calculating the similarity between the training samples and the samples to be tested, improve the classification accuracy and achieve parallelization under the Spark calculation framework, and reduce the computation time [3] .
Yan Jiaming et al. proposed a weighted naive Bayes algorithm in the article "Research and Application of Text Classification Based on Cloud Computing" [52] , which is an improvement on the naive Bayes algorithm.On the basis of this, it is further improved to the weighted naive Bayes algorithm based on cosine similarity in this paper, and the over-dependence of traditional naive Bayes on the independence of conditional hypothesis is further improved.For example, when the predetermined assumption cannot be satisfied in the actual processing, the classification effect will also be reduced under the premise that the processing data attribute correlation is large.In the classification, if the distribution of the training set can not reflect the distribution of all the data, then the credibility of the prior probability obtained in the middle is not very accurate, its accuracy is needed to be improved through adjustment.
In It can be found from the above that under the Spark framework, the traditional text classification algorithm is usually simply adjusted or optimized, running on a distributed cluster, so that the traditional processing efficiency can be greatly improved, which has a major contribution to production, life and learning [52][53] .

Conclusion
With the rapid development of the Internet, text classification technology has also been driven by demand, and has been greatly developed.Currently, there are many mature algorithms that are mature or have been verified, but the classification of massive Chinese texts is still in the stage of development.When classifying, word segmentation may lead to the semantic deviation of articles and error in classification.The incorrect classification will be caused by the inaccurate keyword extraction due to the incorporation of some English words.It is also possible that because of a textual description it may associate with multiple classes.At this time, it is difficult to classify it explicitly.With the emergence of massive texts, text categorization technology is particularly important.Through the reviews on massive Chinese text classification based on Spark technology by predecessors , it is found that although some people have been moving forward in this direction, and it is difficult to improve the accuracy and efficiency based on massive data processing.Therefore, it is still worth studying to conduct segmentation, vectorization and classification of massive texts based on the Spark framework.

Figure 1 .
Figure 1.Preprocessing Structure Diagram of Spark Text Classification

Table 1 .
Test Results of ICTCLAS in 973 Evaluation addition, Luo Yuanshuai et al. proposed the combination of random forest and Spark in "Study on Parallel Text Classification Algorithm Based on Random Forest and Spark", and then realized the classification of massive Chinese texts.Ren Yitian et al. published "Study on the Parallelization Technology of Massive Text Classification Based on Support Vector Machine", which combines SVM algorithm with Spark to classify massive texts.It requires good RDD to parallelize SVMs with Apache Spark.The RDD is generated from the input of the training data set, and multiple transformations and buffers are performed in the later calculations.Through the transformation, the intermediate variables of the size of the SVM model and the the data grouped by label can be calculated.