A Reinforcement Learning Approach to Call Admission Control in HAPS Communication System

The large changing of link capacity and number of users caused by the movement of both platform and users in communication system based on high altitude platform station (HAPS) will resulting in high dropping rate of handover and reduce resource utilization. In order to solve these problems, this paper proposes an adaptive call admission control strategy based on reinforcement learning approach. The goal of this strategy is to maximize long-term gains of system, with the introduction of cross-layer interaction and the service downgraded. In order to access different traffics adaptively, the access utility of handover traffics and new call traffics is designed in different state of communication system. Numerical simulation result shows that the proposed call admission control strategy can enhance bandwidth resource utilization and the performances of handover traffics.


Introduction
Wireless communication system based on HAPS is a new type of wireless communication system which is being studied in the world at present.Due to the influence of the stratospheric winds and the limitation of the position and attitude keeping technique, HAPS is usually in a quasi-stationary state, also known as perturbation.The perturbation of HAPS will cause the dynamic change of cells, resulting to the cell edge users to handover frequently, which brings more challenges to the call admission control strategy.The purpose of studying call admission control strategy is to ensure the lower new call blocking probability and handover dropping probability without degrading the performance of the system.Since the cellular coverage area can improve the system performance and increase the flexibility of the system [1], Li use location information of HAPS and users to make the overlapping area to ensure the handover strategy, which helps determine and block new calls that may cause a handover failure, resulting in a near-zero handover dropping rate [2].These methods can adapt to dynamic change of platform position, but cannot deal with the situation of attitude change.
An call admission control for adaptive resource allocation is introduce in [3], which reduces the blocking rate of new call and the handover dropping rate by service degradation.However, since the fixed admission parameters are set for new and handover calls, it has poor adaptability for changes of system state.In [4], the system resource allocation is optimized to maximize the number of users to improve the system performance.Du to this method only deal with the current call request and system state, the adaptability of dynamic state is poor.
The problem of call admission control can be modelled as a Markov decision processes (MDPs) or Semi-Markov decision processes (SMDPs), which may adapt to the dynamic change of system state [5].The methods such as policy iteration and dynamic programming can solve this kind of problem [6], but the current research assumes that the link capacity is constant.In fact, due to the presence of signal fading, interference, user mobility and using adaptive modulation and coding(AMC), the link capacity is time-varying in actual communication networks (such as World Interoperability for Microwave Access(WiMAM), Long Term Evolution(LTE) etc.) [7].The quasistationary state of HAPS will also cause the change of link capacity in HAPS communication system.
Although the above research can ensure the performance of handover in a certain extent, the dynamic changes of users in cell and link capacity caused by perturbation and mobility of users are considered inadequate, as well as different business requirements.In addition, it is difficult to obtain the exact model parameters of the MDPs.To solve the above problems, we introduced the reinforcement learning which does not need the exact model parameters to maximize the long-term benefits of system, aiming at deal with different traffics.By cross layer interaction and business downgrade ideas, the adaptive call admission control strategy based on reinforcement learning was put forward.

System model
Thornton pointed out that the change of carrier interference ratio from cell centre to edge is large, because of the power rolling down quickly at the edge zone [8].So the adaptive modulation coding can get very good spectrum efficiency.Meanwhile call admission control strategy should not only consider the change of link capacity, but also guarantee the QoS of different traffics.
The multi-beam antennas from HAPS produce an elliptical beam that is projected onto the ground as a circular.For each deviation T of the beam direction from HAPS, the beam directivity gain ( ) D T can be approximated as Where max G is directivity gain at the centre of the cell and n is the main lobe decay rate.S f is sidelobes parameter.Assuming that the center coordinates of cell i is ( , x y , the coordinates of user q in cell i is expressed by ( , ) x y , as shown in Figure 1.The deviation , i q T from user q to the beam boresight of cell i can expressed as Where h is the height of HAPS.Due to the quasi-stationary state of the HAPS, the directivity gain of the user changes dynamically in the cell even under the condition of user's still, which will cause the change of signal to interference plus noise ratio(SINR) .In addition, the wireless channel fading characteristics also cause changes of SINR.Assuming that the radio channel of the multi-user OFDMA communication system from HAPS is flat fading and the channel conditions are stable and independent in each call admission control period.The subcarriers used in the cell are allocated equal power.The instantaneous SINR for the subcarriers n in cell i can be expressed as 2 , , , , Where n i P is the allocated transmit power of the subcarrier n in the cell i, n q N and cc n are the Gaussian white noise and the number of all cells in same frequency., H is channel response of the user q on the subcarrier n of the cell i, which generally include path loss, small scale fading and large scale fading.
, ( ) is the beam gain of the user q in the cell i , and , ( ) is interference gain of the user q from the interference cell j ., j q T and , i q T can be calculated by (2).
For the AMC mechanism based on the Multiple Quadrature Amplitude Modulation(MQAM), the instantaneous bit rate of the carrier n in the cell i can be expressed as Where 2 / 3 ln(5BER) * is the signal to noise ratio difference between the spectral efficiency of the AMC mechanism based on MQAM and the Shannon capacity [9].B and N are the channel bandwidth and the number of subcarrier of the cell.
According to the changing state of the channel environment, the instantaneous bit rate of OFDM subcarriers can be dynamically changed.So the link capacity can be updated based on the capacity of the subcarrier instantaneous state.It is assumed that a single subcarrier can carry the 1bit information at its minimum, its basic transmission capability is R B bps.If the subcarrier can carry M bit information at a certain time, then the transmission capacity is M • R B bps.For the subcarrier which is not allocated, it has the basic capacity.Therefore, the total reachable link capacity [ of the cell i can be expressed by where occ N is the number of subcarriers occupied.Assuming that the link reachable capacity has a total of L states and the link capacity is represented by l [ , then 1 , 1,..., 1

Call Admission Control scheme based on reinforcement learning
Reinforcement learning (RL) interacts with the environment to obtain optimal strategy by learning agents.Q learning a kind of model free reinforcement learning algorithm , mainly through MDPs modelling and iterative method to approach the optimal solution.The call admission control(CAC) problem is modelled as a discrete time MDPs, the following are the state space and action set, reward function design, realization process of Q learning and adaptive CAC.

system state space and action sets
The vector Where k rb and k x are the average rate and number of the traffic type k.There are available subcarriers can allocate for the new call request or handover request in state 1 S .Although there is no available subcarrier in state 2 S , the sum of link rate of all traffics is less than link capacity of the system, which means that it is possible to allow call request by redistributing subcarriers to release some subcarriers.Since the sum of link rate exceeds the link capacity in state 3 S , some traffic must be degraded or interrupted.No matter what state the cell is in, when call request of the traffic k arrives, the system must choose to reject or accept the request.Then the action sets can be expressed by

Reward Function
The goal of maximizing the reward of the system can be expressed as the sum of the utility functions of traffics in service, we consider four types of traffics, voice, fixed data, multi-media and best effort traffic.The priority of these four traffics are ( 1,..., 4) k k

E
, and the priority of handover traffic is 0 E .
The voice and fixed rate data traffic are belong to constant bit rate(CBR) with strict guaranteed bit rate(GBR) limits, and the quivalent bandwidth k rc must reach a certain value.While the requirement of multimedia traffic for bit error rate and delay is not high, which belongs to the GBR service with minimum rate guarantee.So the equivalent bandwidth is in a certain range[ min max , rc rc ].The best effort traffic generally does not have constant bandwidth requirements, and the equivalent bandwidth is represented by e rc .Since QoS requirements for different are measured by the equivalent bandwidth requirements of the traffics, the utility function of new call request with different traffics can be to access more traffics k than others ( k J is the blocking probability weighting factor of traffic k).While the state is 3 S , the reward value includes both the access revenue and the inherent loss, which can be expressed by ( , ) ( ( )/ 1) ( )

implementation process of reinforcement learning
We use the Q value iteration method to achieve reinforcement learning in this paper.The iterative formula is [10] 1 ( , )=(1 ) ( , )+ ( , ) max ( , ) s a s a s a s a (10) where [0,1)

D
is the learning rate and [0,1) T s a represents the number of visits on state-action).Obviously, when D tends to 0, ( , ) t Q s a converges to * ( , ) t Q s a .The optimal action set can be obtained by repeated learning and decision making.In order to balance the exploration and exploitation, the standard algorithm -greey H is used to select the action of the controller.
(1) Parameter initialization: make 0 ( , ) Q s a equal to ( , ) r s a and initializeD , J and the exploration rate 0 H ; (2) Obtain the current state: When a new call arrives or a call leaves the cell, the controller collects the number of traffic in the current cell, the number of subcarriers occupied, and the current cell link capacity; (3) Select action: the random decision is made to access or reject the arrival request for the exploration stage.While the action of the current state maximum rewards value is selected to decide whether to access or reject the arrival request for exploitation stage.
(4) Update the value ( , ) Q s a : Caculate ( , ) Q s a according to the reward function ( , ) r s a of the current state-action-pair ; (5) Update parameters: the parameter D and H are gradually reduced by anti-scaling function after each iteration, and returned (2).

adaptive subcarrier allocation strategy
To ensure the long-term gains of the system, the effective subcarrier resource allocation strategy can ensure the QoS requirements and efficient resource utilization.According to the most stringent QoS requirements for CBR traffics, the service of these traffics should not downgrade and the AMC scheme is always successful implementation.While the AMC scheme of VBR and BE traffic is controlled by the adaptive subcarrier allocation strategy to accept more users and reduce the call blocking probability.
Suppose that the equivalent bandwidth of the user q for the traffic k corresponds to the number of subcarriers required is , r k q n , the number of allocated subcarriers is , a k q n .We use N occ to express the total number of occupied subcarrier.
(1) occ N N K : Since the cell is in a state of light load, all users does not distinguish the traffic types.The allocated subcarriers equals to the need of this traffic, namely , , = a r k q k q n n . ( The cell should distinguish traffic type in the heavy load state.For the CBR traffics, the number of allocated subcarriers equals to the requirements ( the number of the allocated subcarriers for VBR or BE traffic is discounted on the number of requests J is associated with the current load state, the link capacity, the new call blocking rate and the handover dropping rate in the cell.Assuming the minimum number of subcarriers allocated is min , k q n for VBR and BE traffic, the number of subcarriers that can be released by the service degradation is: According to the relationship between : the traffic k can be accessed through service degradation in the cell.In order to In order to access serve more traffics, the cell will allow access to the traffic k. 2) occ r k r e l n N N N !: Although service degradation can release some subcarriers, it is not enough to accept traffic k.For the purpose of reducing handover dropping rate of the GBR (including CBR and VBR), the system can force call termination of the BE traffic, which has the largest number of occupied subcarriers currently to release subcarriers.If the arrival traffic is belong to the GBR and the required subcarriers r k n is satisfied with the formula ^4, the system allows this handover access to the cell.Otherwise the access is refused.

simulation analysis
The simulation parameters of HAPS communication system are consistent with [11].The mapping relationship between AMC and SINR used in this system can refer the air interface of IEEE802.16, which can be found in [12].The simulation traffic parameters are shown in Table 1.It is assumed that the relationship between handover call arrival rate h O and new call arrival rate n O is h n O UO , in which U is a random variable with mean and variance of 0.5.In order to verify the performance of the proposed algorithm, the strategy based on guard channel (guard channel with 10%), adaptive bandwidth allocation with handover priority strategy in [3] and the proposed algorithm are compared under the same conditions.We use five metrics to test the performance of the algorithms, which are handover call dropping probability, new call blocking probability, bandwidth utilization, forced call termination probability and Average system revenue.Forced call termination probability include the current dropped call resulting from Link capacity reduction or arrival GBR handover request.
The performances of this three strategies are shown in figure 2. Compared with the guard channel strategy, the proposed method have better performance in handover dropping probability, new call blocking probability and bandwidth channel utilization, the price is slightly higher rate of call termination probability.Although adaptive bandwidth allocation strategy has lowest new call blocking probability and highest bandwidth utilization, the handover dropping probability rate and forced call termination probability are worst.In comparison, the performance of the proposed method is superior, especially handover dropping probability and bandwidth utilization.shows that the proposed method always has the most average system revenue, The reason is that this method takes full account of the changing system state , the performance of the user and the system, in order to ensure long-term revenue.

conclusions
It has always been a hot issue in the research of wireless communication system that how to effectively access various types of traffics, and ensure the QoS requirements of various traffics services and improve the utilization of system resources.By introducing cross layer interaction and service downgrade, the proposed strategy considers the utility and the blocking rate of different traffics and maximizes long-term gains of system to balance the handover performance and system resource utilization.In the future work, we also need to consider the dynamic interference between cells to further improve the system performance.

2 .
rate /(calls/min) Average system revenue Guard channel Adaptive bandwidth allocation Proposed method (d) Forced call termination probability (e) Average system revenue Figure The performances of the three strategies

Figure 2 (
Figure2(e) shows that the proposed method always has the most average system revenue, The reason is that this method takes full account of the changing system state , the performance of the user and the system, in order to ensure long-term revenue.