Regret of Multi-Channel Bandit Game in Cognitive Radio Networks

The problem of how to evaluate the rate of convergence to Nash equilibrium solutions in the process of channel selection under incomplete information is studied. In this paper, the definition of regret is used to reflect the convergence rates of online algorithms. The process of selecting an idle channel for each secondary user is modeled as a multi-channel bandit game. The definition of the maximal averaged regret is given. Two existing online learning algorithms are used to obtain the Nash equilibrium for each SU. The maximal averaged regrets are used to evaluate the performances of online algorithms. When there is a pure strategy Nash equilibrium in the multi-channel bandit game, the maximal averaged regrets are finite. A cooperation mechanism is also needed in the process of calculating the maximal averaged regrets. Simulation results show the maximal averaged regrets are finite and the online algorithm with greater convergence rate has less maximal averaged regrets. Keywords-cognitive radio networks; adversarial bandit problem; congestion game; online learning; dynamic spectrum access.


Introduction
With the emergences of new wireless services and applications, the demand for spectrum increases.However, some studies show that many licensed spectrum bands have not been utilized efficiently [1].In order to alleviate this contradiction, cognitive radio users called secondary users (SUs) are proposed and allowed to access spectrum bands belong to primary users (PUs) as long as those bands are sensed to be idle.How to select a proper channel to sense and access will affect the idle spectrum usage.Therefore, the channel selection problem is important to each SU.
Firstly, some works about channel selection under incomplete information are introduced.In [2], the problem of how to select a channel set for a SU to access at a time has been investigated under the condition that the statistical information of PUs' traffic is assumed as a stationary and simple distribution which is unknown to SUs in advance.This work is mainly based on classical bandit models [3].Due to the stochastic nature of cognitive radio networks (CRNs), the real primary traffic distribution is not always stationary.Under this scenario, the channel set selection problem is analyzed in [4].Furthermore, a similar problem of multiple SUs who need to select and access a channel at a time is studied in [5] considering the effect of the interactions among SUs on individual benefits.Previous works have studied the channel selection problem with incomplete information by bandit models.
From the perspective of game theory, the channel selection problem can be also modeled as a congestion game.Congestion game is a kind of non-cooperative game with the utility of each player by using a certain resource depending on the total number of players who are using the same resource [6], [7].Some works in the field of game theory have shown that congestion games are a kind of potential game in fact [8].Potential games are used to study the channel selection problem in [9] and the network selection problem [10] while some online learning algorithms are applied to obtaining the Nash equilibrium (NE) solutions.Potential games always have potential functions which can guarantee at least a pure strategy NE [8].Although these works claim that NE can be found under incomplete information, they need more information than our model because the potential functions will be given before searching the NE.These works do not consider the convergence rate of the online algorithm.The online algorithm with greater convergence rate will reduce the search time while increase the transmission data time.Therefore, the problem of how to evaluate the convergence rate of online algorithm under incomplete information should be studied.
In this paper, the channel selection problem with incomplete information is modeled as a multi-channel bandit game from the view of bandit models The MAR is used to evaluate the convergence rates of online algorithms.

Model and Problem Formulation
Consider a CRN with N SUs as well as S SU base stations and M (M=S<N) primary channels which belong to PUs.All SUs and PUs operate in the slot transmission structure.Each PU has only one channel.If any one of SU nodes wants to communicate with a SU base station, it must use the idle slots of primary channels.A primary channel just serves a SU base station.For example, there are N=3, M=S=2 in a CRN.When SU n=1 communicates with SU base station A, SU node n=1 can only utilize the idle slots of channel m=1 which belongs to PU 1.Similarly, if SU n=2 communicates with SU base station B, SU n=1 can only utilize the idle slots of channel 2 which belongs to PU 2. Here, we assume connecting different SU base station do not affect the further communication with the destination of each SU.The spectrum sensing is assumed to be perfect for each SU and all primary channels are idle during the total simulation time for ease of researching.This assumption is rational.The idle time of some primary channels are much longer than the data transmission durations of secondary users who have little data to transmit (like the temperature sensors).The effect of primary traffics on the performances of convergence rates of online algorithms will be studied in our future works.
Each SU selects a channel from M channels at a time to sense and access if the channel is sensed to be idle and the number of using this channel is not too much.Each SU is rational and selfish with the goal of maximizing its own averaged transmission rate during the process of channel selection.At each slot, each SU selects a channel at a time and receives a reward if the channel is accessed, which is analogous to the process of a gambler selecting the arm of a slot machine and receiving a reward.If a SU transmits successfully, the reward is the channel transmission rate which is given by (1) and 0 for the transmission failure.The channel transmission rate of SU n at slot t in channel m can be written in (1), where W is the channel bandwidth, P is the transmit power and σ2 is the thermal noise level, which are same to all SUs for the simplification of research.Note that g t n,m denotes the channel gain at slot t and changes over the time but keeps unchanged during each slot.
When more than one SUs access the same channel simultaneously, the individual transmission success probability decreases because of the mutual interference.Therefore, the rewards obtained by SUs are affected by the number of SUs using the same channel at the same time.We assume s t = (s t 1 , … , s t N ) is a pure strategy profile at slot t for all SUs, where s t n denotes the selection strategy of SU n at slot t. c t (s t ) = ( c t 1 , … , c t M ) denotes the total number of SUs in each channel corresponding to the strategy profile s t at slot t.The total number c t m affects the successful transmission probability p(c t m ) of SUs who use the channel m at slot t.When c t m increases, p(c t m ) of SUs decreases.In this paper, we adopt a common MAC protocol (the slotted Aloha) [7] and the specific expression of p(c t m ) is given in (2).When only one SU is in the channel m, c t m =1, it is certain to transmit its data successfully with probability 1.As mentioned before, we model the channel selection problem as a bandit problem.However, the classic bandit model is not fit for our situation because the rewards of each accessing a channel for each SU are not independently drawn from a fixed and unknown distribution.
Here, we utilize a variant of the classic bandit model called adversarial bandit [11] which is a non-stochastic bandit problem to model our scenario.The adversarial bandit is also used in [4] to find the optimal channel.We assume that all SUs use the same online algorithm to find the equilibrium channel for themselves at the same time.
In our model, we use T to denote the slot at which each user makes a selection using online learning algorithms.This structure is illustrated in Fig. 1.At slot T, each SU will select a channel to access based on a specific distribution over M channels.This specific distribution is decided by the updating rule of an online learning algorithm.After all SUs have completed the process of channel selections at slot T, the pure strategy profile can be represented by s T .

Figure 1. An illustration of the process of calculating regrets under incomplete information in our model
The following M-1slots will be used to calculate the regret for each channel which has not selected at slot T. In order to express our idea clearly, (j, ), which is s T .When SU n does not select a channel j at slot T, it may care about how much reward SU n has lost for not playing the pure strategy j at slot T under the condition that all the other SUs keep their selection strategies ( s T -n ) at slot T unchanged.If SU n knows its own payoff function and the strategy set s T at slot T, SU n can calculate the regret of not selecting the channel j and the regret R T n (j,T) is given as follows [11], ( , ) , ( ) Where with other SUs.The cooperation among SUs will help each SU to find the equilibrium channel, which has been showed in Fig. 1.Therefore, each SU has an incentive to cooperate with others.We use an example to explain the cooperation process.For example, if SU n does not select the channel j at slot T, it may select channel j at slot t+1 while other SUs should select the channels which have been chosen at slot T, namely, s T -n .In other words, the channel selection process will be repeated at slot t+1 and SU n will select channel j without doubt while other SUs keep s T -n unchanged.Therefore, SU n will repeat this process for M-1 slots because there are M-1 channels not be chosen at slot T. Hence, the next M-1slots are used to calculate the regrets which are produced by not selecting the channel j ( 1 j M ) at slot T. The selecting order for M-1 channels is decided by each SU.All SUs know M and When SU n completes the regret calculations for M-1 channels, SU n+1 will be informed and continue to repeat the same process with SU n for M-1 slots until SU N completes this process.The next time for all SUs using the online algorithm to select a channel is at slot t+N(M-1).For example, in the Fig. 1, the secondary selection time decided by online algorithms is T=2.In our model, if the strategy set (j, s T -n ) is the same with s T , the strategy j is no need to try again.The total number of selecting the channel decided by online algorithms is Ta.If the online algorithm selects total T=Ta times, the averaged regret of SU n for channel j ( j M ) after Ta times is as follows, ( , ) ( ) Here, we define the MAR as max max( ( )) is the maximal averaged regret and use the MAR to evaluate the performances of online learning algorithms.

Online Learning Scheme
In this section, we use two existing online learning algorithms to find the NE solution and calculate the MAR for all SUs using two existing online learning algorithms.The first one is based on the algorithm Exp3 [11].The second one is based on the stochastic learning algorithm [12].

Online Learning Algorithm Based on Exp3
Each SU calculates the selection probability distribution

P T p T p T !
over M channels based on (5) according to the corresponding weight value for each channel.The weight value is updated by the normalized reward based on (8).Here, the normalized reward is the ratio between the actual reward and maximal probability reward which is constrained by the hardware of SUs.The maximal probability reward in our simulation part is produced by all random communication conditions. is always unchanged.SU n can obtain the rewards for each channel from the set of channels M, If the channel m has been selected at slot T, the reward is ( , ) When SU n completes its process of calculating the R n , m (t), the SU n+1 is informed and continues until SU N. 6: Cooperation end 7: T=T+1 and back to Step 2. 8: Calculate the MAR based on (4) and find the MAR

Stochastic Learning Algorithm
Each SU selects a channel at random according to a dynamic distribution ,

P T p T p T !
over M channels and uses the normalized reward to update the selection probability distribution for the next round based on (11).At the first slot (T=1), the probability distribution ( ) n T P of selection channel is assumed as the uniform distribution over M channels.U m s r .

The Relationship between NE and MAR
In this subsection, we use the definition of regret to evaluate the convergence rates of online algorithms.We can find based on the theorem 1 that the MAR will be reduced if the selection strategy converges to the equilibrium channel quickly.
Theorem 1: When the multi-channel bandit game has a pure strategy NE, the MAR of online algorithms for each SU is finite.
Proof: The total selection times which is decided by online algorithms is Ta.We assume channel J is the equilibrium strategy for SU n.At a special selection time T', each SU will select optimal channel J and does not change its selection after slot T'.In other words, when the total Ta is large enough,

Simulation Results
The performances of online learning algorithms are evaluated in this section.There are N=4 SUs and M=2 primary channels in a CRN covering a 500m×500m area.S=2 SU base stations locate in (0,250), (500,250) which are accessible for all SUs.When the 4 SUs are selected randomly, the locations of SUs are fixed during the simulations.All SUs adopt the channel model of Flat/Light tree density proposed in [13].The transmission power is 10 -3 W and the noise power level for all SUs is assumed to be 10 -12 W. The bandwidth for all SUs is assumed as 1Hz and the learning rate b=0.01 for the two algorithms.In figures, Ta is showed in abscissa axis, which is from 1000 to 20000.Each Ta runs 100 simulations.We can obtain NE based on (2) which is that SU1 and SU4 selects channel 1 as well as channel 2 for SU2 and SU3.The NE strategy can be written as (1, 2, 2, 1).First, we are interested in the behaviors of SUs in different channels.In Fig. 2, we show the individual accumulated transmission rates of each SU obtaining from channel 1 and channel 2 using algorithm 1 (A1) and algorithm 2 (A2).From the Fig. 2, we can see that each SU has a dominant strategy which is consistent with the pure strategy NE (1,2,2,1).From Fig. 2(a), we can find SU 1 has obtained more individual profits from the channel 1 regardless of A1 and A2.The same result is also showed in Fig. 2(d).Therefore, the channel 1 is the equilibrium solutions for both SU 1 and SU 4. In Fig. 2(b), SU 2 has received more individual profits from selecting the channel 2 and so dose SU 3 which is showed in Fig. 2(c).
In Fig. 3, we can note that all MARs are finite.In order to evaluate the convergence rates of two online algorithms, we compare the MARs of two online algorithms for all SUs in Fig. 3. From Fig. 3, we can see the MAR of the equilibrium strategy for each SU by A1 is more than A2.Due to the lower convergence rate, each SU using A1 will select non-equilibrium channel more frequently, which will increase the MAR.The differences of MARs between SUs using the same online algorithm are dependent on the different locations of SUs.

Conclusion
In this paper, we have modeled the channel selection problem under incomplete information from the perspective of connection between then bandit model and the game model.When the channel selection game has a pure strategy NE, the MAR of selection strategies is finite.Two existing online learning algorithms are used to obtain the equilibrium channel for all SUs.The stochastic learning algorithm outperforms the Exp3 based online learning algorithm in terms of convergence rate to the Nash equilibrium solution, which is because the MAR of stochastic learning algorithm is less than Exp3.This work provides the connection between non-cooperative games and bandit problems, which will help us to illustrate the convergence rate of different online algorithm without knowing the payoff functions and other SUs' selection strategies.

Algorithm 1 :
Exp3 based online learning algorithm 1: Initialize total times Ta, parameter 0<b<1, R n (0)=0, R n , m (0)=0, and the weight value begin From n=1 to N, each SU n selects channel m from the set of channels M according to the predetermined order and the selection strategy of other SUs s T -n

Algorithm 2 : stochastic learning algorithm 1 :
Initialize total times Ta, parameter 0<b<1, R n (0)=0, R n , m (0)=0 for each channel m M for each SU n N , and ( 1) n P T is the uniform distribution over M channels.2: Let each SU select a channel randomly according to the probability distribution ( ) begin From n=1 to N, each SU n selects channel m from the set of channels M and according to the predetermined order and the selection strategy of other SUs s T -n is always unchanged.SU n can obtain the rewards for each channel from the set of channels M, If the channel m has been selected at slot T, the reward is ( , ) R n , m (T)=R n , m (T-1)+U n (m, s T -n ) (12) When SU n completes its process of calculating the R n , m (t), the SU n+1 continues until SU N 6: Cooperation end 7: T=T+1 and back to Step 2. 8: Calculate the MAR based on (4) and find the MAR During the cooperation, if SU n has selected channel m at slot T, SU n has no need to try again.Therefore, we think ( , )

1 [
passes slot T'.This is because the (J, s T -n ) is the same with s T at slot T= T' and the reward of each accessing the optimal channel J is random for SU n.When all SUs do not change their selections, the result of searching optimal channel will converge to an NE SU n.Therefore,

Figure 2 .
Figure 2. Individual profits obtained from different channels using two online learning algorithms.

Figure 3 .
Figure 3.Comparison of 4 SUs' MAR of two selection strategies