A kind of entity recognition algorithm based on Hadoop for power big data

With the coming of the era of big data, traditional entity recognition technologies have been unable to effectively finish data preprocessing due to large scale of power grid data and complex volume type features. The rising of Hadoop technologies in these years can deal with big data processings better. Therefore, this paper proposes a power big data entity recognition algorithm based on Hadoop. It applies the discretization algorithm to select higher information accuracy discrete points and put forward a discretization evaluation indicator. In the end, we finish entity recognition of the monitoring data of wind turbines on Hadoop platform. Experimental results show that the proposed algorithm performs well in terms of correctness and breakpoint number experiments and it has a good speed-up ratio. The proposed algorithm can apply to power large data entity recognition processing.


Introduction
Along with the advance of information and communication technology, digitization and informatization have been deeply penetrated into every aspect of our lives.Also informatization process in electric power enterprise also get rapid development.Analysis of power effective information in large data processing requirements also enhances unceasingly.How to capture the electric power big data when enterprise decision-makings happen in the era of big data grid enterprises is an important problem in the case of data pre-processing.Entity recognition has always been a key technology of data quality management research which can play a vital role in improving the quality of the data preprocessing.In the power of big data, complex data type, data inconsistent phenomenon is more common.Therefore, entity recognition technology in the power of big data also has a wider application in the future.
Power big data entity recognition accurately identify different entities belonging to the same entity name or attributes and clustering in a given data set.It makes each entity in the decision-making of power grid can be more valuable to identify.It is different from.In literature [1] the author proposes big data entity recognition algorithm based on parallel machines.This algorithm solve the problem that the same object owing different properties by means of "n -Gram".It achieved good results efficiently for large data entity recognition in a short period of time.There have been lots of traditional entity recognition technologies which are mainly focusing on the text in the form about the phrases or relational data.Technologies aiming at different types of data entity recognition research have just started.Literature [2] presents a two-stage associated entity recognition model which fully considers the mode characteristics of the entity and attributing characteristics.And this paper proposes an incremental algorithm of the recognition results based on iteration incremental verification and correction to ensure the accuracy of the results.
Current existing methods researching are mainly to identify the effectiveness.There are seldom studies in entity recognition efficiency of the large data oriented technology now.Most of these methods are aiming at the tuple and string.However, relationships of XML data and graph data discriminant method of unstructured data research is still with less research [3][4][5][6].At the same time, these algorithms are lack of effective evaluation of big data entity recognition results quality theory and public test data set.
Hadoop is a kind of distributed processing of large data infrastructure platform.Its architecture is the underlying Hadoop distributed file system (HDFS) which is mainly responsible for store files on all the nodes on the Hadoop cluster.We presents a large data entity recognition algorithm based on information accuracy (ERBIA) under the background of electric big data.Firstly the algorithm calculates class attribute of similar degree distribution and the value of an attribute in discretization scheme.Then ERBIA algorithm select the information accuracy high discrete points.In the next step we propose an improve discrete evaluation index final decision and obtain results.Finally, we perform experiments for real data sets and random data to multiple sets of contrast test on the Hadoop platform.And we obtain better processing scheme effectiveness and efficiency for power big data.

Entity recognition discretization scheme for power big data describing
Chief problem in data processing is the expression of knowledge.In order to facilitate data integration process and improve the efficiency of data pretreatment, we adopt the contingency table for large data attribute formal definition in this paper.Each group of data partition formal definition attributes is showing as follows: In the expression: is defined as the data is a not empty finite set, and we call it attribute domain.
is defined as the effective information of the range of values of the function f.
C is defined as the attribute domain and C   .
presents associated list information function and f a is information function of attribute a .
According to the above definition, power big data set S can be expressed as the attribute domain element of number N for the list.The relationship in the power of large data sets a property value.The attribute i has a value of i a V  , and its domain is C i .In the set S of values i a can be expressed as We assume the data is set of continuous attribute a , and the continuous attributes in each has a discretization scheme R .The set of threshold value for the attribute domain is divided into an intersection zero interval The range of values of the attribute a .We plan the values in the order and form the corresponding breakpoint set 0 1 { , ,..., } n c c c .Owing to the breakpoint set and proposed corresponding discretization scheme, we can use one in two to express attribute discretization.According to the above definition of correspondence we can establish some attribute a discretization scheme D corresponding to the Table 1.
Table 1.Some attribute a corresponding discretization scheme D corresponding table.

Attribute category
Discretization intervals Decision attribute From the above definition we can see that the proposed discretization algorithm for big data sets entity recognition is essentially based on choosing appropriate continuous interval attribute sets of data.So that we can avoid the problem in traditional data entity recognition method which is usually used for single entity model features or based on the method of the single type entity attribute of the correlation in the data measured.The problem is to effectively integrate with both of them.Here comes a Hadoop platform on a big data entity recognition algorithm based on information accuracy.

Hadoop platform on big data entity recognition algorithm based on information accuracy
Traditional attribute discretization algorithm is mainly used for decision making in areas such as knowledge discovery and knowledge, and examining the main effect of discretization of index to be performed by information entropy.The concept of information entropy works as a measure of the amount of information and it can be more carefully for discretization intervals.Also it makes the discretization between the information more clearly.But the disadvantages of evaluation index based on information entropy is that discreting interval differentiate too elaborate lead to scale of calculate process too large although classification of the content in information contained is more concisely.Moreover the algorithm's efficiency and hardware consumption are affected.And it is not conducive to the follow-up data processing process [7][8][9].Therefore, in view of the large power data attributes, in this paper we propose a big data entity recognition algorithm based on information accuracy (ERBIA) on the basis of information theory.

Definition of information accuracy
The essence of power big data in attribute discretization is to discrete demarcation points within the range of values of the attribute set.And the attribute of the domain is divided into interval.At last point with an integer value represents each division of property values.It has been proved in the literature [10] that the importance of the attributes on the probability and statistics are independent each other.In the information table it can be defined as discrete points total accuracy When the amount of data tends to infinity, we can see that all i Q are equal, and we mark it as Q .And

Improved discretization of the evaluation index
After determining the definition of information accuracy, in this paper we put forward an improved discretization based on information entropy evaluation index.We use it to measure some attributes of the discretization scheme discreting effectively.
Traditional information entropy is defined as following: In it X means base of X , and i n presents the number for instance attributes i .
On each interval d of information entropy is expressed as in this discretization scheme in this article.If the discrete points d can divide collection X into two subsets and points d to a collection X of information entropy can be defined as: For the proposed improvement discretization of evaluation index are defined as follows: When the value of ( ) H X is greater, it means the accuracy of the information handled by the continuous attribute discretization is higher.And there is a higher quality of divided in the discretization scheme.
In this paper, we use 2 ( ) log n as an operator to discrete interval number limited in a reasonable range as far as possible.It is used to avoid to interval discrimination too rough or too precise.When the range X is zero, we can conclude that all class interval distribution is even, and ( ) H X takes the minimum value.

Experimental analysis
In order to validate the effectiveness of the proposed algorithm based on information accuracy of big data entity recognition, we use one company's on-line monitoring data of grid wind turbines as an example to analysis the algorithm on the aspects of breakpoint number ,the correctness and speedup ratio.In the Eclipse environment we use the algorithm of ERBIA after discretization of the attribute data and show the result in Table 3.It can be seen that to the same set of data, applying ERBIA discretization processing algorithm has the same effort in the calculation with the regular one.But in conventional algorithms its adopting the integral calculation of the average algorithm can make the individual attribute evaluation get rougher deviation, and it make the decision results and the actual operation get deflection.

Speedup ratio
Speed ratio is used to measure the performance and effect of parallelization.It can be defined as in a single run time and the ratio of the running time in the cluster.This paper provides the test data set volume of 2G, respectively working in the node number of different cluster of 2,4,6,8.The experimental data is shown in Table 5.It can be seen that with the increase of number of nodes, the running time significantly declines.Operation speed of the algorithm is also improved.So we can conclude that the proposed algorithm obtain a good speedup and it is well applied in big data environment.

Conclusion
Traditional entity recognition algorithm can only realize relationship identification, such as simple naming.With the coming of the era of power big data, problem in relationship between complex data attributes in the big data entity recognition is imminent [11][12].We proposed ERBIA algorithm in this paper, aiming at solving the shortcomings of the existing entity recognition algorithm.This paper proposed a discretization scheme based on information accuracy, and it put forward an improved discrete evaluation index to evaluate algorithm.Finally we finished the experiment on a Hadoop cluster.Experimental results showed that the validity of the algorithm in this paper and the advantage of discrete breakpoint number and speedup ratio.Our next focus is the study of large data sets redundant and related analysis.We look forward to the preprocessing of large data sets to provide support for the final decision in the power grid.

Table 3 .
Part of the values of monitoring data of attribute ERBIA discretization algorithm.
So the first problem to solve is to study the selection of demarcation points for us.In this paper,classification point selection standard is defined as information accuracy.We assume that there is an information table S ,and information accuracy i i a taken to the number of values.

Table 2 .
This article selects some operation monitoring data in December 2015 of which several operating parameters are class attributes.As decision conditions, we choose six different temperature as input data of wind turbines to measure the effect of discrete.They are .In order to facilitate its showing, in this paper, the decision results are expressed by three kinds of coding, respectively normal with 00, qualified with 10 and unqualified with 11.There are monitoring data of attribute value from capture part in Table2(in Celsius).Part of the values of the monitoring data attribute. a

Table 4 .
For the same set of data we firstly use CAIM discretization processing algorithm to deal with the data.CAIM algorithm is a kind of global, static, top-down supervised discretization algorithm.The algorithm is based on maximizing the attribute correlation and minimum break point as the goal.It has the advantage of breakpoints.So we use the proposed ERBIA algorithm comparing with CAIM algorithm in terms of number of breakpoints.It can be seen from Table4that breakpoints of the ERBIA algorithm significantly reduced.Two kinds of algorithm comparing in breakpoint number.

Table 5 .
Speedup on different node of the cluster.