Performance Evaluation of Hadoop-based Large-scale Network Traffic Analysis Cluster

As Hadoop has gained popularity in big data era, it is widely used in various fields. The self-design and self-developed large-scale network traffic analysis cluster works well based on Hadoop, with off-line applications running on it to analyze the massive network traffic data. On purpose of scientifically and reasonably evaluating the performance of analysis cluster, we propose a performance evaluation system. Firstly, we set the execution times of three benchmark applications as the benchmark of the performance, and pick 40 metrics of customized statistical resource data. Then we identify the relationship between the resource data and the execution times by a statistic modeling analysis approach, which is composed of principal component analysis and multiple linear regression. After training models by historical data, we can predict the execution times by current resource data. Finally, we evaluate the performance of analysis cluster by the validated predicting of execution times. Experimental results show that the predicted execution times by trained models are within acceptable error range, and the evaluation results of performance are accurate and reliable.


Introduction
With the rapid development of cloud computing, Hadoop [1] as an advanced big data processing tool, has become the first choice for many researchers and companies.In order to analyze the massive traffic data efficiently, we developed a Hadoop-based Large-scale Network Traffic Analysis Cluster (HBLSNTAC).It consists of one master (running NameNode, ResourceManager), one backup master (also work as client access server, running SecondaryNameNode) and nine slaves (running DataNode, NodeManager, ApplicationMaster).HBLSNTAC stores massive network traffic data collected from operators of several cities in China in HDFS.Moreover, users run various off-line statistical analysis applications using the massive data on the cluster.These applications are aimed at analyzing the basic statistical characteristic of network traffic, such as the time distribution or geographical distribution of network behavior of mobile phone users.
The performance status would greatly affect the work efficiency of HBLSNTAC.A poorer performance means a lower work efficiency, which would slow down the whole analysis work schedule.If we can reliably evaluate performance of HBLSNTAC, the cluster manager could get an intuitive overview of performance status and then make timely adjustment to keep the cluster steady and in high efficiency.However, the performance of HBLSNTAC is affected by various factors, such as hardware device, software versions, Hadoop configuration parameters which are the main factor of performance, and so on.The total amount of Hadoop configuration parameters is excessive, let alone the complex interactive relationships among these parameters.So it is hard to accurately evaluate the performance of HBLSNTAC.
Since several aspects of factors would influence the performance of HBLSNTAC, it is very difficult for normal user to collect all the information of influenced factors.So normal user barely knows the actual performance of HBLSNTAC.Meanwhile, the execution time of the off-line analysis application is directly depended on the performance of HBLSNTAC.A poorer performance would cause a longer execution time.If normal user launches an application at improper timing, for example, when the performance is poor, it would end up lengthening the execution time of his own application, and further worsening the work efficiency of HBLSNTAC.Obviously, that is a dilemma we avoid to get in.
Thus, there is a pressing need for a performance evaluation system to evaluate the performance of HBLSNTAC.This paper aims to design and implement a performance evaluation system to HBLSNTAC.The main contributions of this paper are: z We propose a performance evaluation system of HBLSNTAC, and the evaluation results are verified to be reliable.Thus we could have a quantitative understanding of performance and then guide user behaviors.z Since the Hadoop configuration parameters are excessive and complex, we focus on the resource data, which is the essence of configuration parameters.And besides average value, we add three statistical dimensions of resource data, i.e., maximum, minimum and standard deviation.z We apply principle component analysis and multiple linear regression in modeling phase.After training by historical data, the prediction of this model is relatively accurate with an average error less than 10%.The remainder of this paper is organized as follows.Firstly, we introduce the related work in section 2. Then we propose the performance evaluation system and detailed introduce the design of the system in section 3. The experimental results and corresponding analysis are presented in section 4. Finally, we draw the conclusion and give the outlook of future work.

Related Work
Hadoop is a popular open source project.It has shown its significance in big data and cloud computing areas.Since increasing users and companies have applied Hadoop, the study and research of its performance has become more important and globalized.
Performance Optimization is a research focus for Hadoop cluster, because it could improve the work efficiency of cluster.Microsoft IT SES Enterprise Data Architect Team proposes various optimization techniques to maximize performance of Hadoop MapReduce jobs in [2].Paper [3] proposes a statistic analysis method to identify the relationship among workload characteristics, configuration parameters and Hadoop cluster performance.They both optimize Hadoop configuration parameters in different situations, therefore shorten the execution times of applications to improve the efficiency.
Performance evaluation is another research hot topic.HiBench [4] is a representative and comprehensive benchmark suite for Hadoop, which contains ten applications, classified into four categories.It quantitatively evaluates and characterize the Hadoop development.Paper [5] proposes a performance evaluation for Hadoop cluster on virtual environment, it builds a Hadoop performance model and then examine several influence factors of cluster performance.
In summary, the existed researches of performance of Hadoop cluster mostly focus on Hadoop configuration parameters.However, the configuration parameters are excessive and complex.So in this paper, we study on resource data, which is the essence of Hadoop configuration parameters.Especially, we build a performance model and identify the relationship between resource data and execution times of applications, then we evaluate performance of HBLSNTAC by the validated performance model.

The Design of Performance Evaluation System
In this section, we introduce the design of performance evaluation system.The framework of system is presented in Figure 1.Firstly we set execution times of three benchmark applications as the performance benchmark of HBLSNTAC, and select total of 40 metrics of resource data, which cover the main resources of HBLSNTAC, such as CPU, memory, disk and network.Then we model historical data of execution times of three benchmark applications and resource data through principal component analysis and multiple linear regression.After that, we predict the execution times by the trained models.Finally, we evaluate the performance of HBLSNTAC based on the predicting of execution times.

Performance Benchmark
Hadoop applications can be classified into four categories, CPU intensive, memory intensive, disk I/O intensive and network I/O intensive [6].Since applications could never finish successfully unless there is sufficient memory resource, we exclude memory intensive applications.Moreover, Network I/O is not a bottleneck in our experiment, because the scale of cluster is small and all nodes are connected to the same switch.As a result, we select a CPU intensive application, i.e., Pi; a disk I/O intensive application, i.e., TeraSort; and a both CPU and disk I/O intensive application, i.e., WordCount.Furthermore, we use the execution time of three selected Hadoop benchmark applications, Pi, WordCount and TeraSort, as the performance benchmark of HBLSNTAC.
More importantly, the main function of HBLSNTAC is running off-line applications to analyze the basic statistical characteristic of network traffic.Both WordCount and TeraSort applications have statistical approaches as counting and sorting separately.So it makes them much more suitable to evaluate the actual performance of HBLSNTAC than other applications.

Statistical Resource Data
Hadoop has more than 200 configuration parameters [7], and the interactive relationships among parameters are complex.Although many parameter researches have proposed various parameter tuning strategies, but with the difference of Hadoop cluster environments, not all the strategies are suitable for any environment.So it is unachievable to propose a universal full-cover parameter tuning strategy for all environments.Thus we can't evaluate performance of HBLSNTAC through configuration parameters without that universal parameter tuning strategy.
Resource data of server, such as CPU, memory, disk and network, is the underlying data of Hadoop cluster.Application on Hadoop cluster would not be launched if not enough resources are allocated to it.Furthermore, most of configuration parameters are specified by resources.For instance, most of configuration parameters of YARN are used for allocations of CPU and memory resources.In other words, resource data is the essence of configuration parameters.The parameters tuning essentially means resources reallocation.
Thus, to conduct a more simplified and essential research of performance of HBLSNTAC, we study on resource data of server instead of configuration parameters.Based on the study of Hadoop resources, combining with accumulated management experience of HBLSNTAC, we select 7 categories of resource data, total of 40 metrics.The list of metrics is shown in Table 1.Since the resource of Disk, Network and HDFS, are subdivided into two smaller categories, we now have ten small categories.These ten small categories are the common resource data, which are studied frequently within Hadoop researches.Most of the researches only study the average value of resource data.It is generally known that average value does not reflect all of the statistical characteristics of the data.Therefore, for each small category, we add three statistical dimensions, maximum, minimum and standard deviation besides average.Then we have 40 metrics totally.Now, we not only study on the central tendency, but also study on the dispersion degree of resource data, which would improve the reliability of modeling analysis.

Table 1. List of Resource Data Metrics
Here, the customized Load metric is used to describe the extent of a single server load exceeds the average server load of whole cluster, which would be expressed as i Load .It is easy to understand that for metrics of CPU, Memory, Disk, Network and Open files, the larger the value is, the heavier the server load is.For server i , assume that n R is the value of one of metrics, avg R is the average value of the same metric of the cluster.Then we have n L , i.e., the value of the customized load of a single metric, which could be computed by (1) , and i Load , i.e., the value of the customized server load of server i , which could be computed by (2) .

Modeling Analysis
The purpose of modeling analysis is to figure out the relationship between performance benchmark, i.e., the execution times of three benchmark applications, and the recourse data.After training the model by historical data, we could predict the execution times by current resource data.

Principal Component Analysis
Principal Component Analysis (PCA) uses dimensionality reduction technique to make a set of possibly correlated original indicators into relatively fewer comprehensive and linearly uncorrelated indicators by linear combination, and retain most of the information of original target [8].
If we have the original data matrix as: where m is the number of sites, n is the number of variables in the index system, is the value of index variable j of site i .
(1) Standardization of original data Usually the variables have different units of measurement (i.e., pounds, feet, rate, etc), that would cause different weight for each variable in the analysis.In order to make each variable to receive equal weight, it is necessary to standardize the original data.
(2) Normalized correlation matrix of data where ij r is the correlation coefficient of index i and index j .

E
are the parameter of model and H is the deviations of model.

Performance Evaluation
After verifying the trained model reliable, we could evaluate the performance of the HBLSNTAC by the prediction of the three execution times.We propose a score criterion by theses three execution times, then we select a typical practical application, whose execution time is a reflection of the practical performance of HBLSNTAC.By studying the relationship between the score and the execution time of practical application, we can find out several patterns between them.Finally we could evaluate the performance of HBLSNTAC by these patterns, thus to help us manager the HBLSNTAC in a better way.

Experiment Setup
We setup a simple classical Hadoop cluster that consists of one master (running NameNode and ResourceManager) and three slaves (running DataNode, NodeManager and ApplicationMaster).All servers are connected to a 100Mbps Ethernet switch.Hardware and software configurations of the servers in this cluster are the same, as shown in Table 2.

Modeling and Prediction
We run three benchmark applications for 25 times separately.The input of modeling is the resource data within 3 minutes before application get launched.And the output of modeling is the predictions of execution times of applications.It means that we would finally predict the execution time of an application by the resource data within 3 minutes before this application get launched.

Principal Component Analysis
We get total of 40 metrics of resource data, meaning 40 dimensions for data analysis, that is a quiet large number of dimensions.Moreover, there would be correlations between these metrics.Therefore, we would reduce dimensions from total 40 dimensions to much more fewer uncorrelated dimensions by Principal Component Analysis.The main results of three benchmark applications by Principal Component Analysis using SPSS software [10] are shown in Table 3, Table 4, Table 5 and Figure 2.
For Pi application, we can find out from Table 3 and Figure 2a that the cumulative contribution rate of first 6 Principal Components (PCs) is 98.422%, meaning they retain 98.422% information of original 40 metrics.Here, we have reduced dimensions from 40 metrics to 6 PCs, which we would express as Similarly, for WordCount application, the first 5 PCs expressed as

Multiple Linear Regression
The purpose of MLR is to model the relationship between principal components and execution time of benchmark application, then make a prediction of execution time by a new set of principal components.
As for Pi application, the multiple linear regression equation of principal components as  As for TeraSort application, the multiple linear regression equation of principal components as

Prediction Verification
After training model by historical data, we need to verify the model by current data.We run three benchmark applications for 8 times separately to verify the prediction result of the trained models.The results are shown in Figure 3, Figure 4 and Figure 5.The relative error between predicted execution time and actual execution time of Pi, WordCount and TeraSort are 7.62%, 7.97% and 8.18% respectively.It means that these models can accurately predict the execution time of benchmark applications.The trained models, whose prediction average errors are all less than 10%, are verified to be reliable.As a result, they can be used for performance evaluation of HBLSNTAC.

Performance Evaluation
We convert the prediction of each application into a centesimal score, then set the average of the three scores as the evaluation score of the performance of HBLSNTAC.The main function of HBLSNTAC is running off-line statistical applications.And a longer execution time of application means a poorer performance of HBLSNTAC.So we select a typical practical application, whose goal is to find out the top 3 visited websites of 6 time segments of the whole day for each user, and set its execution time as a reflection of the performance of HBLSNTAC.Then study the relationship between the evaluation score and execution time of this practical application.

One Single Practical Application
Relationship between evaluation score and execution time of one single practical application is shown in Figure 6.Roughly, the lower the score is, the longer the execution time is, thus the poorer the performance is, and vice versa.More closely, we can see that there is turning point near 60.When score is higher than 60, a decrease of 5 points of score would cause an increase of 50 seconds of execution time.When score is lower than 60, the same decrease of 5 point of score would cause an increase of 250 seconds of execution time, which is 5 times larger than the increase when score is higher than 60.It suggests that when the performance of HBLSNTAC reaches at a certain poor degree, such as 60 in this case, it is not a good choice to launch an application, because that would cause a much larger increase of execution time.

Multiple Practical Applications
Since the HBLSNTAC is running multiple application parallelly for most of the time, we further study the relationship between evaluation score and total execution time of multiple practical applications.We launch two practical application successively, and launch the second application before the first application completed.The circumstance before the first application launched is the same as one single practical application, so the evaluation score here is predicted before the second application launched.The result is shown in Figure 7.The trend of curve in Figure 7 is about the same as in Figure 6.However, comparing with Figure 6, the turning point is advanced from 60 to 65.It suggests that the greater the number of running applications is, the sooner that certain poor degree is coming.When score is higher than 65, a decrease of 5 points of score would cause an increase of 630 seconds of execution time.When score is lower than 65, the same decrease of 5 points of score would cause an increase of 11210 seconds of execution time, which is more than 17 times larger than the increase when score is higher than 65.Comparing with the circumstance of single practical application, the number of running applications doubles from 1 to 2, however, the increasing range is more than three times from an increase of 5 times to an increase of 17 times.It suggests that the increasing speed of total execution time, in order words, the worsening speed of performance, is more faster than the increasing speed of number of running applications.

Summary
The evaluation results provide a much more straightforward understanding of performance of HBLSNTAC.We used to only have a qualitative understanding of the performance, but with no idea of how exactly the patterns are.Now, we can get a quantitative understanding of performance with the clear evaluation scores.With accumulation of more experimental evaluation results, we can propose a comprehensive user guide strategy.It can guide using behavior of common users, such as suggest the proper timing for users to launch applications and forbid users from launching applications at improper timing, so that to keep HBLSNTAC in a continual good performance status.

Conclusion and Future Work
In this paper, we have proposed a performance evaluation system of HBLSNTAC.This system associates execution time of three benchmark applications and statistical resource data of servers, then model and analyze these two parts by principal component analysis and multiple linear regression.An evaluation score of performance of HBLSNTAC would be marked by the trained models.
The experimental results show that the prediction of execution time of application is reliable with average error less than 10%.Furthermore, the evaluation score can quantitatively reflect the performance of HBLSNTAC, so it could be further used for user behavior guidance.The performance evaluation system is validated to be feasible and reliable.
As for the future work of next stage, firstly, with the expansion of application scenarios of HBLSNTAC, we would include more kinds of applications, such as machine learning and web searching, besides Pi, WordCount and TeraSort.Moreover, the modeling analysis approaches are both linear in this paper, we would change and choose nonlinear modeling approaches to analyze the nonlinear relationship between the execution time of application and resource data.Then study the pros and cons of linear and nonlinear modeling approaches.

Figure 1 .
Figure 1.Framework of Performance Evaluation System . The component matrix would not be shown here due to space limitation.

Figure 2 .
Figure 2. Scree Plots of Pi, WordCount and TeraSort (from left to right)

Figure 6 .
Figure 6.Relationship between Evaluation Score and execution time of One Single Practical Application

Figure 7 .
Figure 7. Relationship between Evaluation Score and Total execution time of Two Practical Applications The eigenvalues of matrix R are ^ǹ O i "

Table 2 .
Configurations of Experimental Servers

Table 3 .
Total Variance Explained of Pi

Table 4 .
Total Variance Explained of WordCount

Table 5 .
Total Variance Explained of TeraSort