Resident user load classification method based on improved Gaussian mixture model clustering

In view of the limitation of "hard assignment" of clusters in traditional clustering methods and the difficulty of meeting the requirements of clustering efficiency and clustering accuracy simultaneously in regard to massive data sets, a load classification method based on a Gaussian mixture model combining clustering and principal component analysis is proposed. The load data are fed into a Gaussian mixture model clustering algorithm after principal component analysis and dimensionality reduction to achieve classification of large-scale load datasets. The method in this paper is used to classify loads in the Canadian AMPds2 public dataset and is compared with K-Means, Gaussian mixed model clustering and other methods. The results show that the proposed method can not only achieve load classification more effectively and finely, but also save computational cost and improve computational efficiency.


Introduction
Load classification refers to the processing of load data from a large number of power devices to extract typical load profiles [1], which can be applied to electricity consumption behavior analysis, load forecasting, tariff setting, demand-side response, etc. Accurate and effective load classification is helpful for the precise marketing of power supply departments. Therefore, the implementation of accurate load classification is of great significance to real-time dispatching, improving the economic efficiency of enterprises, and saving energy [2][3]. Load classification is a research hotspot in recent years. The existing methods can be mainly divided into artificial neural network methods [4][5][6][7] and cluster analysis methods [3][4]. Among them, load curve cluster analysis generally uses K-Means clustering . The papers [8][9][10][11][12][13][14] use clustering algorithms such as K-Means, K-Medoids, and Hierarchical clustering to achieve the classification and identification of daily load curves, electricity consumption trajectories, and typical electricity consumption patterns of commercial and residential customers. The above research All effectively realize the load classification. The paper [15] compared several clustering algorithms and found that the divisional clustering algorithm was more efficient but less accurate. This is also true for hierarchical clustering. Both hierarchical and partitioned clustering suffer from the problem of "hard assignment" (i.e., each point is explicitly assigned to a cluster center) and they are not adaptive for large-scale data sets. Hierarchical clustering has high time complexity, and partitioning clusters may fall into local optima. Gaussian mixture model (GMM) clustering is used to assign cluster members according to the clustering probability, which is called "soft classification". It can effectively solve the problem of "hard assignment" with more information, and better clustering quality for largescale data sets. The K-Means algorithm assumes that each cluster is approximately spherical in shape and approximately equal in size. In contrast, GMM clustering has a more flexible cluster shape. The Gaussian mixture model has been widely used in speech, image recognition and other fields, but is less applied in the classification of load state. Therefore, this paper proposes a hybrid PCA-GMM-based load state classification method by combining the advantages of "soft classification" and clustering flexibility of GMM clustering with PCA from the perspectives of improving clustering quality, clustering efficiency and saving computational cost. The dimensionality of load data is reduced by PCA, and then used as the input to GMM clustering algorithm. Thus, an accurate and effective classification of load states is achieved. To illustrate the effectiveness of the proposed method, PCA-GMM clustering and other methods are applied to the AMPds2 dataset [19]. The results show that the proposed method has better clustering quality and clustering efficiency, and it does effectively reduce the computational cost.

BIC
Probabilistic estimation of the number of groups of GMM using BIC-based model selection theory. The definition of BIC is shown in the formula, and the optimal number of clusters is gradually obtained by approximation.
In Equation (7): CBIC is the BIC value; np is the number of hyperparameters; L is the maximum value of the estimated model likelihood function.
Assuming that the errors or disturbances of the model are normally distributed, BIC can be expressed as: In Equation (8): SRSS is the residual sum of squares of the estimated model. CBIC is an increasing function of SRSS and np, that is, the introduction of residuals and unknown parameters will increase CBIC. Therefore, in judging the number of load classifications, the model with a low BIC value is preferred

DBI
Davies-Bouldin Index (DBI), also known as classification adequacy index, is an index to evaluate the pros and cons of clustering algorithms.
Suppose there are m time series, and these time series are clustered into n clusters. The m time series are set as the input matrix X, and the n cluster classes are set as N as the parameters passed into the algorithm. Use the following formula to calculate: The meaning of this formula is to measure the mean value of the maximum similarity of each cluster class. In the formula, Si is the average distance between the data in the cluster and the centroid of the cluster, which represents the degree of dispersion of the time series in the cluster i , and Mij is the distance between the cluster i and the cluster j.

Principal component analysis (PCA)
PCA is a dimensionality reduction method that converts multi-dimensional data into a relatively simple spatial mapping in the simplest and most economical way. It can convey important relationships between data through highly intuitive visual output, and the dimensionality reduction quality and dimensionality reduction rate are better. Each point in the low-dimensional space obtained by PCA represents an object, and each point obtained after dimensionality reduction in this paper represents the load characteristic of each day.
The PCA algorithm flow is as follows: Input: n-dimensional sample set D=(x(1),x(2),...,x(m)), the number of dimensions to be reduced to n'.
• Centralize all samples: • Calculate the covariance matrix XX T . • Perform eigenvalue decomposition on matrix XX T . • Take out the eigenvector (w1, w2, ... , wn') corresponding to the largest eigenvalue. After all the eigenvectors are standardized, the eigenvector matrix W is formed.
• For each sample x(i) in the sample set, transform it into a new sample z(i)=W T x(i) • Obtain the output sample set D′=(z(1), z(2), ... , z(m)).

Data set description
This article selects the AMPds2 data set released by Canadian scholars in 2016. The data is collected from a household user in Vancouver, Canada, with a total living area of 199 m 2 and a basement area of 100 m 2 . The data set is collected from April 1, 2012 to 2014. On April 1, 2010, there were 1051200 records in total. The sampling interval of power data was 1min, including the total meter data and sub-metering data such as active power, reactive power, voltage and current. The sampling interval of external environment data was 1h, including temperature , Air pressure, wind speed, etc.

Data pre-processing
Due to possible errors, loss, and anomalies in the data during measurement, recording, and transmission. This requires pre-processing of the data. At first, the maximum value was estimated for the total load of the AMPds2 dataset, and values exceeding two times the estimated value were considered as outliers for deletion. Then the arithmetic mean of the two hours of data before and after was used to fill in the missing values. Finally, a sliding filtering algorithm is used for noise reduction of the data. A data buffer is created in RAM to store N sampled data in order, and for each new data read, the earliest one collected is discarded and the arithmetic mean of the N data in the buffer is calculated as the result of filtering. the overview of the AMPds2 data set is shown below.

Comparison of GMM and other cluster classification results
GMM clustering belongs to "soft classification" and the clusters are flexible. Here we visualize the comparison between K-Means clustering and GMM clustering. For the determination of the optimal number of clusters, GMM clustering often uses the BIC criterion, while K-Means clustering is most commonly obtained based on the clustering effectiveness index. The DBI index is simpler and has a small range of changes, which is more widely used. . Therefore, all the clusters involved in GMM use the BIC criterion to determine the number of clusters, and other clusters use the DBI index.
Comparing Figure 4, we can see that GMM clustering divides the electricity load into 13 categories, while K-Means clustering only divides 3 categories. From the perspective of the number of classifications, GMM classification is more refined than K-Means, and contains more information. Combining the above figure and the date distribution corresponding to the clustering results of GMM and K-Means ( Figure 5), it can be seen that the information obtained from the clustering results of K-Means is only that the electricity consumption of the building has small fluctuations in summer and low electricity consumption. In the spring and autumn, the electricity consumption fluctuates greatly, and the GMM classification results can be further detailed, such as the highest temperature in July and August and the lowest temperature in January and December. The electricity consumption habits are the same, and the fluctuations are relatively stable.

Load classification method based on PCA-GMM
The specific process of PCA-GMM method classification is shown in Figure 6. First, the pre-processed 24-dimensional load data is mapped to a 2-dimensional space through the PCA method, as shown in Figure 7. Each point in the figure represents a data object (that is, the information of a 24-dimensional daily load curve). 728 data points. The sample distribution of data clusters after dimensionality reduction is elliptical, indicating that GMM-based clustering will be more suitable for the classification of these building loads than other clusters. Secondly, input the reduced-dimensional load data into the GMM clustering algorithm.It can be seen from the figure that the number of clusters is 6 categories. In general, the clusters identified by each building are close within the clusters and finely classified between the clusters. The clustering effect is good, and the distribution area is also related to the load on the date and time. The characteristics of seasonal distribution correspond to each other.
The following are the classification results of the PCA-GMM method. Through the analysis of Figure 8-9, we can get: • From the shape of the typical daily load curve, the peak of the user's daily load curve in the past two years is mainly in the afternoon or early morning, and the shape of the class 1 and class 4 curves are relatively similar. , The peak of all-day electricity load is concentrated in the afternoon, and the magnitude of class 1 is higher; the shape of the curves of class 3 and class 5 are similar, the all-day load is at a high level, and the peak period of electricity consumption is at night; the class 2 all-day load Lower.
• From the perspective of working days and non-working days, the user did not show obvious patterns.
• From the distribution of seasons and temperature changes, category 1 is mainly distributed throughout the year with little electricity consumption at night and large electricity consumption during the day, which conforms to the normal electricity consumption law of users; category 2 with the lowest electricity load is mainly Distributed in spring and autumn when the climate is suitable and the demand for air conditioning is low; Class 3 is distributed in March, April, October, and November, with a large load throughout the day; Class 4 is similar to Class 1 distributed throughout the year, with low night load and heavy load during the day ; Class 5 electricity consumption at night is large, distributed in August and September when the temperature is higher. The above analysis shows that the classification result of the PAC-GMM method conforms to the electricity consumption characteristics of the electricity load with the seasonal and temperature changes, and it captures the correlation and dependence between the data. The information contained is better than that of K-Means. The clustering results are more refined.

Conclusion
In this paper, a load classification method based on Gaussian mixture model clustering and principal component analysis is proposed. Residential load classification is taken as an example. The pre-processed large-scale load dataset is reduced in dimensionality by PAC method and then input to GMM clustering algorithm to achieve load pattern classification. The results of the algorithm example show that: -Comparing with K-Means, GMM and other methods, it is learned that the proposed PCA-GMM method achieve the classification of load patterns for each type of buildings both effectively and accurately, with more refined classification results, higher clustering efficiency and much lower storage space.
-Using this method to classify load patterns will help the power supply sector to get a better grasp of users' load characteristics, formulate reasonable tariff policies, propose energy saving strategies. It can also provide more targeted services to various users, and MATEC Web of Conferences 355, 02024 (2022) ICPCM2021 https://doi.org/10.1051/matecconf/202235502024 motivate consumers to actively participate in the demand side management system. At the same time, it is of crucial importance to guide the rolling planning of power grid, real-time dispatching and reliability assessment of operation planning.