Research on task offloading based on deep reinforcement learning in mobile edge environment

With the rapid development of Internet technology and mobile terminals, users' demand for high-speed networks is increasing. Mobile edge computing proposes a distributed caching approach to deal with the impact of massive data traffic on communication networks, in order to reduce network latency and improve user service quality. In this paper, a deep reinforcement learning algorithm is proposed to solve the task unloading problem of multi-service nodes. The simulation platform iFogSim and data set Google Cluster Trace are used to carry out experiments. The final results show that the task offloading strategy based on DDQN algorithm has a good effect on energy consumption and cost, it has verified the application prospect of deep reinforcement learning algorithm in mobile edge computing.


Introduction
With the rapid development of smart devices, cloud computing has been unable to meet the growing demand for data processing. On the one hand, because all the required business data need to be transmitted through the core network, it will cause a great load on the core network during the peak period of the network. On the other hand, according to the relative distance between smart devices and cloud data centers, there will be a large network delay, which seriously affects the quality of service of delay-sensitive applications. In response to the above problem, MEC(mobile edge computing) provides services by using an open platform that integrates network, computing, storage, and application core capabilities on the side close to the physical entity or data source. MEC executes its applications on the edge side, resulting in faster network service response, meeting the industry's basic needs for realtime processing, smart applications, security and privacy protection. Therefore, this paper will design a task offloading strategy based on deep reinforcement learning, which can effectively reduce the energy consumption of each service node and the cost of user payment [1].

MEC structure
The composition of the MEC usually consists of three parts: the cloud data center layer, the edge server layer, and the terminal device layer. As shown in Figure 1 , the terminal device layer includes various types of sensors, computers, and mobile phones having certain processing performance; The edge server layer divides all edge servers according to relative distance, and each area contains an edge server with moderate performance and heterogeneity; The cloud data center layer contains a large number of high-performance physical servers that form a cluster to serve users [2]. When the task from the mobile terminal needs to be offloaded, the whole mobile application is first divided into sub-tasks that have data interaction with each other but can be executed independently by some sorting algorithm. Some of these sub-tasks must be executed locally. Others are tasks that can be offloaded, which are usually data processing tasks with a large amount of computation. In this paper, all computing devices in the entire MEC are represented by (D, B, G), where D represents a cloud data center with massive computing resources, which is mainly composed of one physical machine cluster; B = { |i ∈ [1, m]} denotes a set having m edge servers; G= { j |j ∈ [1, n]} denotes a set having n mobile terminal devices [3].

Performance metrics
In this paper, the energy consumption model and the cost model are designed to evaluate the results of task offloading strategy based on deep reinforcement learning. The algorithm model is as follows: Power model: For the total energy consumption generated by all computing devices including smartphones and remote servers in a certain period of time, this paper first defines the power model formula of the i-th computing device as follows: where represents the percentage of energy consumed by the computing device in the idle state, represents the energy consumed by the i-th computing device at full load, and is the CPU utilization. In addition, the computing device load is constantly changing with time. Now assume that u(t) is a function of calculating the CPU utilization of the device per unit time, then the computing device with a total number of (1+m+n) starting from time 0 . The total energy consumption per unit time t is: (2) Cost model: Users need to pay for computing resources provided by remote servers. In this paper, a dynamic price model based on resource surplus is used. When the resource surplus is less, the resource price will be higher. At this time, users tend to choose service nodes with lower unit price as the offloading target, so as to reduce user costs while at the same time improving resource utilization [4]. The formula is based on a dynamic price model that calculates the remaining amount of resources: CC indicates the cost that the current device has generated, UT indicates the unit interval time for the fee calculation, RPM indicates the price set by the unit computing resource, LU indicates the computing resource ratio that the current device has used, and TM indicates the total computing resource of the current device. At the same time, since the computing resources of the local device belong to the user and do not need to be calculated as the service provided by the operator, the total cost formula of all charging devices (1+m) is:

Algorithm design 3.1 DDQN algorithm
The DDQN algorithm is a value iterative algorithm that combines deep learning and reinforcement learning. It uses the experience pool and the target network to solve the nonstatic distribution problem and the model instability problem [5]. At the same time, in order to minimize the influence of overestimation, the DDQN algorithm splits the work of selecting the optimal action and estimating the optimal action [6]. Therefore, the calculation formula of the loss function is: where Q(s , a , θ) represents the output of the current network MainNet, used to calculate the Q value of the current state action pair; (s +1 , ′ Q(s +1 , ′ , θ); θ − ) represents the output of the target network TargetNet, which is used to calculate the target Q value after tsaking the optimal value action.

MDP model
State space: In order to comprehensively consider the characteristics between MEC neutron tasks and server resources, this paper defines the state space at time step t as = (M, 1 , 2 , … , , … , 2+ ), where M Indicates the CPU resources required for this subtask deployment; represents the CPU utilization of the i-th computing device at time step t. At the same time, in order to ensure that the sub-task can only be selected to be executed on the local mobile terminal device or the remote server, only (2+m) calculations need to be considered for the sub-task [7]. The device contains one cloud data center, one local device and m edge servers.
Action space: In order to offload the subtask to the appropriate computing device, the action space is specified to correspond to the set of available computing devices in the deep reinforcement learning, and (0/1) i j is used to indicate whether the i-th sub-task is offloaded to the jth computing device [8]. Therefore, for a cluster containing (2+m) computing devices, the action space is = ( 1 , 2 , … , 2+ ).
Reward function: This paper will consider the energy consumption and cost of all equipment in the MEC to evaluate the advantages and disadvantages of the offloading decision. The calculation formula of the reward function is: where and cost represent the energy consumption and cost of the i-th device, respectively; α and β represent the weights they occupy, and their sum is 1. In addition, since the mobile terminal device belongs to the user privately, the module deployed on the local device does not calculate the fee, therefore ∑ cost means that only the total cost of remote services provided by the cloud data center and edge servers is calculated [9].

Simulation experiment 4.1 Simulation environment
This paper uses iFogSim to simulate the MEC-based task unloading problem, and compares the energy consumption and cost of each algorithm in large-scale heterogeneous clusters to reflect the advantages and disadvantages of the offloading decision[10]. The algorithms implemented include a policy Mobile based on local device prioritization, a policy Edge based on edge server prioritization, a policy DQN based on deep reinforcement learning, and a policy DDQN based on improved deep reinforcement learning. The equipment cluster simulated by the simulation experiment mainly includes a cloud data center, 60 edge servers and a number of mobile terminal devices, in which all edge servers are equally divided into 10 different regions, and each mobile terminal device can only send one application offload request at the same time. In this paper, the corresponding simulation device configuration and average performance-to-power ratio are set with reference to SPEC (Standard Performance Evaluation Corporation). The larger the value, the less energy the device consumes under the same performance. The detailed information is shown in Table 1.
In order to simulate the offloading process after the mobile application is split into different subtasks, this paper constructs a sub-task dependency of online shopping. As shown in Figure 4, the application is mainly by edge, front end, login, accounts, orders, shipping. Subtasks such as catalogue, cart, and payment, where edge must be executed on the mobile device, and the remaining subtasks can be selected based on the decision whether to offload. Online shopping applications usually have very high requirements for offloading decisions. On the one hand, they need to offload high-computing data processing modules to remote servers to minimize the energy consumption of mobile devices. On the other hand, computing modules need to be as close as possible to data sources to reduce the delay caused by data transfer between modules. In this paper, the parameters of all deep reinforcement learning algorithm models are uniformly set to ensure the fairness of training results. The memory space size of deep reinforcement learning is defined as M=100000, the learning rate of optimization algorithm SGD is α=0.005, and the batch learning size is K=32. The update period of the target network parameter is C=50, and the discount coefficient γ=0.9.

Analysis of experimental results
In order to reflect the changes in resource utilization of mobile applications in different time periods, this paper uses Google Cluster Trace dataset to simulate the change of each module utilization over time. In addition, in order to ensure that the strategies generated by each deep reinforcement learning algorithm are highly efficient, this paper first selects some data from the Google dataset to train each neural network, then select other data to test the trained network model to compare the versatility and efficiency of each strategy. Figure 3 shows the loss function scores of the small batch samples of the DQN algorithm and the DDQN algorithm during training. The smaller the loss function, the better the network model results. It can be seen from the figure that the DDQN algorithm has faster convergence speed and higher stability than the DQN algorithm under the same parameters. Figure 4 and Figure 5 is resource loss graphs generated by each algorithm in task offloading. As the number of applications increases, the offloading strategies generated by each algorithm increase in energy consumption and cost. Among them, the offloading strategy based on Mobile algorithm achieves good results in cost, while the performance in energy consumption is general. The offloading strategy based on Edge algorithm performs generally in terms of energy consumption and cost; because DQN algorithm converges slowly and is unstable, the offloading strategy generated by DQN algorithm performs worst in terms of energy consumption and cost when the training process only iterates 300 times; the offloading strategy based on DQN algorithm performs better than DQN algorithm in terms of energy consumption and cost. In terms of energy consumption, it consumes the least of all the algorithms, and is second only to Mobile in terms of cost.

Summary
In this paper, deep reinforcement learning is proposed to solve the task unloading problem in mobile edge computing. Experiments on iFogSim simulation platform verify the advantages and disadvantages of each algorithm in terms of energy consumption and cost. According to the experimental results, the unloading strategy based on DDQN algorithm has the best comprehensive performance compared with Mobile algorithm, Edge algorithm and DQN algorithm, which proves the application prospects of deep reinforcement learning algorithm in mobile edge computing.