Reinforcement Learning Based Network Selection for Hybrid VLC and RF Systems

For hybrid indoor network scenario with LTE, WLAN and Visible Light Communication (VLC), selecting network intelligently based on user service requirement is essential for ensuring high user quality of experience. In order to tackle the challenge due to dynamic environment and complicated service requirement, we propose a reinforcement learning solution for indoor network selection. In particular, a transfer learning based network selection algorithm, i.e., reinforcement learning with knowledge transfer, is proposed by revealing and exploiting the context information about the features of traffic, networks and network load distribution. The simulations show that the proposed algorithm has an efficient online learning ability and could achieve much better performance with faster convergence speed than the traditional reinforcement learning algorithm.


Introduction
Visible light communication (VLC), as an emerging wireless access technology has been regarded as a promising member in the 5G era that possesses tremendous value and potential [1].It exhibits multi-fold advantages such as high data rate, huge bandwidth, no electromagnetic interference and high security.In a heterogeneous wireless access environment for indoor communication, where LTE, WLAN and VLC are available, simultaneously.Selecting network intelligently based on user service requirement is essential for ensuring high user Quality of Experience (QoE).However, different types of wireless technologies show varieties in the aspect of coverage, data transmission rate and other features.Meanwhile, end users are no longer satisfied with the basic data communication and emerging virtual reality, ultra-high definition video etc., pose higher requirements on both uplink and downlink performances.Due to these two concerns, selecting the optimal access network is always a challenging task.
In this paper, a reinforcement learning solution is proposed for indoor network selection.Specifically, context information is leveraged to tackle the network selection on two aspects.On one hand, the feature of asymmetric downlink and uplink performance requirements of traffic are explicitly revealed and modeled.On the other hand, some distinguishing features of network as well as the stationary distribution law of network load are used to assist the algorithm design.In particular, such information enables us to present knowledge transfer for reinforcement learning, providing an effective algorithm for the network selection in dynamic and unknown environment.
Our main contributions are two-folds.First, we proposed a fine-grained network selection model that takes the diverse traffic requirements and network performance of uplink and downlink into account.This vision is important since many newly emerging traffic types such as virtual reality need customized performance requirements.Although there is extensive research on network selection, e.g., [2][3], the utility design differentiating uplink and downlink requirements of different traffic types proposed in this paper seems to be absent.Second, the idea of transfer learning [4] based algorithm is used in network selection.Even though some works such as [5] have studied the context-aware network selection, they worked in different ways from the knowledge transfer.Compared with some existing work using reinforcement learning [6] [7], the introduction of transfer learning could significantly enhance the algorithm performance.This method may provide a new perspective on endowing context awareness in solutions for related problems [8].

System Model
We consider an indoor heterogeneous wireless access environment which consists of N networks of {1,2, , } N = LTE, WLAN and VLC.For simplicity, we use the term "network" to represent a base station (BS) in LTE or an access point (AP) in WLAN and VLC.We assume that a user locates in the overlapping area of N wireless networks and equipped with multi-homing capability.In a slotted system with epoch duration l seconds, the user can dynamically change its access network but only one access network can be selected at any given slot.
We use throughput as the main performance metric of the networks.The max instantaneous rate of a user that is determined by SNR (signal to noise ratio) according to Shannon formula constitutes the upper bound of its throughput.Meanwhile, the multi-user access behavior determines the network load distribution and thus affects the achieved throughput of each user in the network.Therefore, the achieved instant throughput ( , ) in  of user i in network is a function of the instantaneous rate and the network load n K (the total number of users in network n ) as ( , )   ( , ) for a given slot.The function () f could be modeled depending on specific network.In the following, the uplink and downlink throughput models of LTE, WLAN, and VLC are given.
1. LTE: OFDMA is the downlink multiple access technology of LTE.According to the model in [2], the throughput under weighted-proportional fairness can be expressed as DL ( , ) is the total users' weight, n is the set of users in network In the uplink, LTE uses the SC-OFDMA based MAC protocols with fair subcarrier sharing.Hence, the throughput of a user i is roughly dependent on the total number of users sharing the same network, 2. WLAN: In 802.11WLAN MAC protocols, the distributed coordination function (DCF) leads to a fair access opportunity to uplink users.Hence, the low rate user capturing the channel will use it for a long time thus penalizes high rate users.The uplink throughput of a WiFi user can be expressed as UL ( , ) Here, L is the packet size.The throughput a user can obtain on the downlink is related to the schedule mechanism of the access point.According to [10], when a round-robin (RR) scheme is used, then the uplink can also be derived via replacing

VLC:
We consider an all-optical VLC network.Downstream data transmission and illumination are combined.Currently, there is no common view on the MAC protocol specified for VLC.In most existing works, it is assumed that the system uses TDMA with RR scheduling.Thus, if user i is assigned to the n -th VLC AP, the achieved throughput becomes [11] DL , Note that the intensity modulation with direct detection (IM/DD) is used in VLC and only real-valued signals can be transmitted to receivers.Thus, at least half of the sub-carriers must be used to realize the Hermitian conjugate of the complex-valued symbol after modulation.Consequently, the formula is divided by 2.
Using visible light in uplink may not be practical, as it would constrain equipment power and user's psychological feelings.Referring to [12], we use infrared in uplink.The main limitation of infrared link is determined by its low power transmission, thus it often leads to a low rate data transmission (up to 4Mbps or 1.152Mbps in [9]).As visible light and IR light exhibit very similar qualitative behavior, the uplink throughput model could also be derived by

3
Reinforcement learning based network selection framework

Problem formulation
Considering the diverse features of various traffic types, we propose a general utility model differentiating uplink and downlink performance requirements, which is still absent in current network selection research to our knowledge.Note that we mainly focus on the throughput, but this model can be easily extended to incorporate many other performance metrics.The achieved utility UL, DL () u  is designed from a novel perspective.

Uplink dominant traffic:
For traffics such as sending files or backing up files on the cloud, the uplink throughput is the main factor affecting the performance, but the downlink throughput is negligible since it is just for transmitting some control and feedback messages (no less than a small threshold, e.g., 0  ).As an example, it can be defined using a similar utility representing file transfer.
is the minimal downlink throughput requirement and UL log( )   models the utilitythroughput function [2], where  and  are parameters dependent on special maximal and minimal throughput demand of the user.effect on throughput, then a piecewise function of the downlink throughput plus the basic uplink throughput requirement is

Where
c is a constant.Hence, it is reasonable to select the network providing the best average performance.However, since we have no prior knowledge on the average performance of the available networks, we have to learn the optimal selection from the interaction with the environment.Mathematically, this learning problem can be formed to select a network selection policy *  maximizing the long term average reward, that is, a series of actions that can maximize the total expected return as Where (0,1)   represents the discount factor, which reflects the future returns relative to the current level of importance.

Algorithm design
Q learning is the most commonly used reinforcement learning algorithm for above problem in femto and small cells net [13].In Q learning algorithm, the controller (learner) to learn how to optimize its decision through historical experience.However, the standard Q learning algorithm may show slow convergence speed and poor performance due to the exploration.Especially, when the available strategy set is relatively large, there will be significant random exploration costs on bad strategies.Nevertheless, the idea of transfer learning [4] provides a feasible way to enhance the Q learning algorithm.More specifically, the transfer learning enables us to speed up the algorithm convergence by using some knowledge or context information.Fortunately, we notice that the following observations may be useful Observation 1: Not all networks are inherently suitable for all traffic types.There could be mismatch between the downlink/uplink features of networks and traffic requirements.For instance, VLC itself has poor uplink throughput due to the inherent limitation as we have mentioned, thus, it is not suitable for the traffic with strict requirement on uplink performance.Some other prior rules could also be applied, such as fee and privacy considerations.
Observation 2: Network load distribution is space-time dependent.The recent literature [14] has revealed that the traffic/load shows spatial and temporary distribution law, which means that the information about the load dynamics of networks may be used.For example, the load dynamics of a specific location and a fixed duration of weekdays are generally the same.With these observations, we propose the Q learning algorithm with knowledge transfer as shown in algorithm 1.
Algorithm 1 Q learning with knowledge transfer For each slot t ,based on the traffic type, select network ()  at from the refined action set s   as follows 8: • With probability  , choose an action at random; 10: Receive the reward ( ) ut.
( are the current traffic type, available network set and time period index, respectively.is the set of traffic types, e.g., the three types defined in Section 3, and is the maximal available network set as introduced in Section 2. Note that since the available networks may changes in different location, we use the set of available network set to indicate the "location" instead of exact coordinates.One day is divided into several time periods.For example, the daytime of weekdays from 8:00 pm to 17:00 am could be divided into 9 periods each corresponding to 1 hour duration.The load distribution law is assumed to stay unchanged in each time period.Specifically, observation 1 enables us to decrease the size of action set according to the traffic types.That is, choices are removed in the Q learning action set considering they are not suitable.This is realized by selecting the traffic type-dependent action set, i.e., the refined action set si , the learn Q table will be used; otherwise, the Q table is initiated with 0 vector, as shown in the 1st to 5th line of the algorithm.Accordingly, the initial exploration probability     .

Performance Evaluatiion
We consider an indoor scenario composed of LTE femtocells, WLAN and VLC with single cell overlap.In LTE, WLAN and VLC standards, the user achieved instantaneous rate is discrete, which is determined by the user's location and varies with the fading effect over time.Similar to the lecture [15], we make a set of discrete achievable peak rates 1,  We use the uplink dominant traffic to examine the convergence performance of the proposed algorithm.Since observation 1 has revealed that the uplink of VLC could hardly support the high uplink performance requirement, we can remove VLC to reduce the action space.This reduced action set combined with the standard Q learning is denoted by algorithm 2. The reusing of learning experience that revealed by observation 2 combined with the standard Q learning is denoted by algorithm 3. Finally, the proposed Q learning with knowledge transfer algorithm (i.e., Q learning using observations 1 and 2) in algorithm table is denoted by algorithm 4. In addition, the random access in each slot and the standard Q learning are added for comparison.As we can see in Fig. 1, the random selection obtains the lowest and constant average reward.The other four Q learning based algorithms could converge after a certain number of iterations.We can observe that: i) algorithm 3 converges much faster than algorithm 1(the standard Q learning); ii) algorithm 2 achieves a significant gain in the average reward compared with algorithm 1(the standard Q learning) and iii) the proposed algorithm performs the best in terms of both the convergence speed and average reward.These results indicate that the considerations of observation 1 and observation 2 could improve the algorithm convergence speed and the achieved performance, respectively.Note that the fast convergence speed (less than 50 iterations) of the proposed algorithm is important for practical applications.The performance comparison with different maximal user number of each network in Fig. 2 further shows that the proposed algorithm is the best.Moreover, its performance gain grows as the maximal user number increases.

Conclusion
In this letter, we studied the indoor network selection problem under dynamic environment, taking into account both the uplink and downlink performance requirements of traffics.We first formulated the network selection differentiating the network performance and traffic requirements of uplink and downlink as a learning problem.On this basis, we exploited the context information by resorting to transfer learning to propose a reinforcement learning with knowledge transfer based algorithm.The simulation results revealed that the introduction of transfer learning could significantly improve both the convergence speed and achieved performance of reinforcement learning based network algorithm.

2 .
Downlink dominant traffic: On the contrary, downloading files and watching online video mainly care the downlink throughput and can be classified as downlink dominant traffic.Since most of existing works focus on such traffic type, the utility DL () u  can be easily derived by explicitly indicating the downlink throughput DL  in existing utility models.For instances, the file download utility can using the above model by replacing UL  with DL  .Video traffic shows threshold the 7th line of the algorithm.Observation 2 actually indicates that the load distribution laws of the same time period across different weekdays are approximately the same, thus, we can reuse the learned experience in the past.In the algorithm, the context-specific learning experience in terms of Q tables ( ) * ,, si Q are stored in a data base.Once it is found that there is already some learning record for the current ( )

Table 1 .
Parameter set