Grid text classification method based on DNN neural network

. With the rapid development of network technology, the electric power Internet of Things needs to face a large number of electronic texts and a large number of distributed data access and analysis requirements. If the system wants to complete accurate and efficient data analysis and build an existing data and service standard system covering the entire chain of energy and power business on the existing basis, it must implement massive electronic text retrieval, information extraction and classification in the power grid system. In order to achieve this purpose, a DNN neural network classification model is constructed to classify the text information of the power grid, and the effectiveness of the method is verified by experiments based on data from the substation information system.


Introduction
With the rapid development of network technology, the power IoT is facing the access of a large number of distributed data, the concurrency of multiple services, the need for joint data analysis, and the sharp increase in electronic text (such as grid customer information, grid business data, etc.). If the system wants to complete accurate and efficient data analysis and build an existing data and service standard system covering the entire chain of energy and power business, expand the data model, and support unified management of all data on the existing basis, it must implement massive electronic text retrieval, information extraction and classification in the power grid system.
Text Classification refers to a technique for classifying a given text object in a fixed category that has been defined based on the characteristics of the text. It is one of the main research issues of Natural Language Processing (NLP). Typical applications include judging spam, automatic web page classification [1], sentiment classification [2], and news personalized recommendation [3]. The initial solution is to rely on the word matching method in the document to classify the document [4], but the algorithm is mainly done by manpower, the efficiency is not high, and the classification result can not meet the requirements. On this basis, people have studied the vector space modeland the knowledge engineering method, but there is still a problem of low accuracy. With the development of machine learning algorithms, algorithms such as SVM model [6], Bayesian network [7], and decision tree have also begun to be applied to text classification. Nowadays, the rapid development of artificial intelligence (AI) technology has led to new developments in text categorization, making it an important branch of natural language processing (NLP) in the AI subfield. Neural networks, such as convolutional neural networks (CNN) and deep neural networks (DNN) [10], are also increasingly being applied to text categorization. This paper uses DNN neural network to classify grid text.

Natural language processing (NLP)
Natural Language Processing (NLP) is a human-computer interaction method that allows computers to understand the natural language used by humans to achieve functions such as human-computer interaction and language translation [11]. It is an important branch of artificial intelligence, involving three areas of artificial intelligence, linguistics and computer science. From the perspective of linguistics, language can be divided into formal language and natural language. Formal language is a human-created language that can be processed by machines and symbols, such as programming languages and chemical symbols. Naturally evolved languages, such as human language, are natural languages. Compared with formal languages, they lack a fixed format, and there are a large number of ambiguous statements, similar statements, etc., so that they cannot be directly understood by machines. Sentence understanding, expression learning, and choice of context for human language are highly complex for machines. Natural language processing is a discipline that studies how to process natural language to achieve human-computer interaction.
Technologies related to NLP include named entity recognition, part-speech tagging, dependency parsing, text semantic similarity analysis, document analysis, text classification and machine translation, etc. Besides, text classification is the focus of this paper. Since natural language is a language evolved from a large number of people for long conversations, it is an "empirical" language model that can be modeled using statistical-based models. Therefore, by collecting large-scale real language text to format the real language library, and analyzing the language library using statistical techniques, the language text can be classified. Text classification is generally divided into three steps: text preprocessing, text feature extraction and text classification.

DNN
The full name of DNN is called deep neural network [13]. The neurons of the DNN are fully connected and do not contain convolutional units. The depth of the DNN refers specifically to the number of layers of the neural network. The original neural network had only the input layer, the output layer, and an implicit layer, called the perceptron, which could not perform complex operations. Later, in order to overcome this shortcoming, experts invented a multilayer perceptron with multiple hidden layers. Multilayer perceptrons use functions such as sigmod to simulate the response of neurons to excitation and use backpropagation algorithms for network training. However, as the number of network layers deepens, the gradient disappearance problem becomes very serious, and the result of the optimization function is more likely to fall into the local optimal solution. In order to solve the problem of local optimal solution, Hinton adopted a pre-training method, which can make the number of layers of the neural network reach 7 layers [13]. In addition, the use of ReLU and other functions instead of sigmod solves the problem of gradient disappearance, which constitutes the basic form of current DNN. A three-layer DNN network model structure is shown in Figure 1.
l is used as the input of the 3 L layer into the activation function, let (2) 0 =1 l , then the output of each neuron in the 3 L layer is

Method construction
In this section, a text classification model based on DNN neural network is proposed to solve the problems of text classification in the power grid industry. The model is mainly divided into three parts: preprocessing stage, feature extraction and text classification. Figure 2 shows the three-layer framework of the model.

Preprocessing stage
In the text classification process, due to the diversified characteristics of grid data, most of the stored data is unstructured data. Faced with this complex data, computers cannot directly process it. This requires pre-processing the text and transforming it into a form that can be recognized by a computer. This paper uses the ICTCLAS Chinese lexical analysis system of the Chinese Academy of Sciences to perform word segmentation preprocessing and uses a vector space model (VSM) to pattern the text. Assume a certain text X in the document set Y, where the number of documents of Y is N. A vector space model is a model that uses vectors to represent data. Through the patterning of vector spaces, it can reduce the difficulty of text classification. For text X, ′ = {( , )} =1 can be obtained from the vector space model, where n is the number of words in text X, x is i-th word in text X, and w is the feature weight corresponding to x . The details are shown in the following Equations 2: where is the number of occurrences of x in document X, m is the total number of texts in which x appears in set Y. Normalize it, then w is shown in Equation 3: (3)

Feature extraction
The text vector space ′ = {( , )} =1 is obtained after the preprocessing module. Suppose the corresponding category set of document set X is = {( )} =1 , where l is the number of categories. The amount of grid data is very large, so the number of features after data preprocessing is often considerable. If the text is directly classified without any processing, it will not only have a certain impact on the accuracy of the classification model, but also its classification efficiency is not high. For these considerations, we need to extract features from X', select the feature vectors that are most conducive to classification, and improve efficiency and accuracy for subsequent classification. This article uses improved mutual information (MI) for feature selection and extraction Since the mutual information (MI) only considers the relationship between and text category , this paper considers that the choice of features will also receive the influence of the frequency of in the entire text set Y to a certain extent. By improving the MI algorithm, it is shown in Equation 4: where represents the proportion of documents belonging to in the set Y,  is the control threshold, and is the proportion of text containing the word belonging to the text category . Its expression is shown in Equation 5 below: where ℎ is the number of texts belonging to the category , Su is the total number of words belonging to the category , and is the number of all words belonging to the category. Set a proper feature selection threshold , select words with mutual information values higher than the threshold , and treat them as text feature values for text classification.

Text Categorization
Assume that the corresponding feature vector of the text X obtained after the above preprocessing and feature extraction is ′′ = {( , )} =1 , where w <= n. The text classification model is trained by a text training set with known corresponding category labels. This paper uses DNN neural network as a text classification model for classification training. The algorithm pseudo code is as follows: Define the input as text Y, and a certain text X is preprocessed and feature extracted to obtain the feature vector ′′ = {( , )} =1 , as input node of DNN neural network. The output is the classification prediction set C Y made by the classification model for all text sets Y.

Experimental verification
The data in the experimental part of this paper comes from the data of the substation information system provided by the State Grid. According to the relevant requirements of the power grid, these data can be specifically divided into grid equipment maintenance operation tickets, information system maintenance schedules, information system maintenance work tickets, information system maintenance operation tickets, and customer service work tickets. The total number of texts is 3000, with an average of 600 per category. 70% of each class is selected as the text training set for training the model, and the remaining 30% of each class is used as the test set to test the performance of the classification model. After training and testing, the results are as follows: The average rate can reach more than 91%.

Conclusion
This article is based on the need for the grid system to build a data and service standard system covering the entire chain of energy and power business, expand the data model, and support the reality of unified management of all data. In order to retrieve and extract the massive electronic texts in the power grid system, this paper constructs a DNN neural network classification model to classify the grid text information. The validity of the method is verified by experiments based on the data of the substation information system provided by the State Grid.