Traffic model using a novel sniffer that ensures the user data privacy

Nowadays, the traffic over the networks is changing because of new protocols, devices and applications. Therefore, it is necessary to analyze the impact over services and resources. Traffic Classification of network is a very important prerequisite for tasks such as traffic engineering and provisioning quality of service. In this paper, we analyze the variable packet size of the traffic in an university campus network through the collected data using a novel sniffer that ensures the user data privacy. We separate the collected data by type of traffic, protocols and applications. Finally, we estimate the traffic model that represents this traffic by means of a Poisson process and compute its associated numerical parameters. /


Introduction
Understanding the behavior of the network traffic is crucial and an important prerequisite for planning the traffic engineering and apply quality of service; also, for traffic modeling and prediction.
Additionally, the network traffic is changing because of the convergence (voice, data and video). The applications are heterogeneous and complex; the number of mobile devices accessing the networks are increasing exponentially. New protocols like the IPv6 are present in the internet, and technologies such as Internet of Things (IoT) will allow the connection of millions of new devices. The study from Cisco Systems: forecast and trends [1], predicts that by 2022, the number of devices connected to IP networks will be more than three times the global population; the smartphone traffic will exceed PC traffic; and traffic from wireless and mobile device will account 71 percent of total IP traffic.
In packet-based networks, like the internet or the Local Area Networks (LANs), the transmission of information is performed in discrete packets [2]. When we need analyze and modelling the network traffic, we can to considerate two stochastically variables: the packet size and the inter-arrival time [3]. This study is focus on packet size (packet length).
Normally we can measure the traffic network by means of active polling and passive monitoring [4]. The active method generates new traffic, inject it into the network, while passive method consists on monitor, and capture the network traffic. In this case, we use the passive form for capture traffic, analyze the packet headers and produce statistics. One drawback of the method is the privacy of the data to be captured, because the traditional sniffers saves the entire packet: headers and payload. The passive measurement can be performed at various levels like byte, packet, flow, and session [5]. We use packet level because the most of the network's problems occur in this level; is independent of the protocols, and avoid the encrypted payload.
In this work, we propose to develop a sniffer that assures the user data privacy in order to analyze the traffic of a university campus network to estimate the model for such traffic.
The rest of the paper is organized as follows: section 2 provides information about related works; in section 3 we present the novel sniffer; in section 4 we show the data collection, classified by type of traffic, by protocols, and by application, according to the variable packet size. Section 5 presents the traffic model that characterize the realistic traffic analyzed. The paper ends with the conclusion in section 6.

Related works
Many works have analyzed the network traffic based on packet size, using methods such as statistical analysis, pattern recognition methods, length of the application messages, packet flows, user behavior, etc. Additionally, these studies had suggested models to simulate the realistic network traffic.
In [6], Sinha et al. observed that the internet traffic was bimodal at packet sizes of 40 and 1500 bytes, different to data in [7] that was tri-modal with packet sizes around 40, 765 and 1500 bytes. Wu et al. in [8] analyzed flow records and classified this by applications using machine learning. A study for identifying network traffic based on message size analysis is present in [9], and a Gaussian model is proposed for characterize the application-level protocols. Lee et al. in [10] present a study about the self-similarity of traffic using bandwidth frequency distribution. In [11] , 0 (20 19) MATEC Web of Conferences https://doi.org/10.1051/matecconf/2019 292 2920 0 0 CSCC 2019 a work that classify network traffic using three classification approaches based on transport layer ports, host behavior and flow features is present. In [12] Zhang et al. evaluate the amount of UDP and TCP traffic, in terms of flows, packets and bytes. A work over internet data traffic generated in a university campus and a model for predict internet data traffic is present in [13]. Cao et al. in [14] demonstrate that the number of active connections has an effect on traffic characteristics.
Regarding the traffic modelling, Vicari present in [15] a model for internet traffic from the user perspective, using distribution functions applied to data. In [16], Maheshwari et al. design a Hidden Markov model for network traffic and validate it for different packet sizes. A study for modeling TCP/IP traffic over a wireless network is present in [17]. Mueller in [18] specifies a traffic model based on object sizes at the application layer applied to wireless network.

Proposed sniffer
One of the critical issues in the process of capturing network traffic is the use of the sniffer. This is owing to the fact that they normally capture the entire packet, which includes headers and payload. Network administrators need some kind of confidentiality agreement to avoid problems because of the inappropriate use of the user information. This motivates our work. We propose to implement a sniffer that guarantees the privacy of the information avoiding the capture of the payload of the packages. In addition, with the deployment of IPv6, our sniffer would be able to differentiate a dual stack environment with IPv4. Finally, the sniffer should have low resource consumption, which allows a more efficient capture of the data. The sniffer, called TinySniff, is written in C language and runs under Linux operating system. It is portable and lightweight software consumes small amount of resources (i.e. memory and CPU). Can capture traffic in LAN and WLAN scenarios, and store the headers captured in flat files, in text format.
TinySniff is design to capture the following fields in the header of a package for further analysis: total length (IPv4) o payload length (IPv6), source address, destination address, protocol (IPv4) or next header (IPv6), source port, and destination port, as shown in figure 1. An example of data in TXT format is showed in figure 2.

Data collection and analysis
We implement a scenario for capture realistic traffic in an university campus network shown in figure 3. We install TinySniff on a desktop computer with Linux Ubuntu version 16.04 LTS. Its technical specifications are: AMD FX-8300 Eight-core processor, 24 GB of RAM, and twonetwork interface cards (NIC) Ethernet. One NIC is for PC management, and another for capture traffic. We connect the NIC for capture, in a gigabit port of access layer Cisco switch, and configure this port in trunking mode for access all VLAN traffic.   This work analyzes the variable packet size; the packet length usually is between 40 and 1500 bytes. To analyze the packet size, we take intervals of 10 bytes for discrimination (i.e. 0-10, 11-20, 21-30, etc.). Figure 4 shows the behavior of packet size according to traffic type (IPv4, IPv6, ARP). Figure 5 and 6 present the variable packet size for IPv4 protocol and for IPv6 respectively. The analysis of IPv4 applications (under TCP and UDP) and packet size are shown in figures 7 and 8.
From Fig. 5 to 8, we can see that there is a bimodal traffic distribution with 48.32% of packets around of 60 bytes size, and 38,42% around 1500 bytes. For the first size, all traffic types contribute to this trend, while for second size only IPv4 traffic contributes. If we analyze the IPv4 traffic, it can be observed that TCP is the main protocol over UDP and contributes over both bimodal trends.
This IPv4 traffic is bimodal too, with 40.27% of packets around 60 bytes and 45.13% around 1500 bytes. TCP packets are the main factor in this behavior with 41.66% around 60 bytes and 49.28% around 1500 bytes. HTTP, SSL and TLS are the main applications and represent more than 95% of total IPv4 TCP packets and contributes with 41.66% of packet around 60 bytes and 49.28% around 1500 bytes. UDP packets contributes mainly around 1400 bytes with 38.88%, and the main application for this behavior is GQUIC (around 1400 bytes). Other UDP applications contribute with packets between 60 and 300 bytes in a sparse form.

Traffic modelling
Taking into account the analysis of the network traffic analyzed in the previous section, we estimate some models using the Poisson probability distribution function, based on traffic type, protocols and applications. For total traffic presented in fig. 4, results a fitted model as a mixture of two Poisson distributions with parameters λ1 = 84.38, and λ2 = 1457.11. The probability that the length of a packet belongs to the first distribution is 0.545, while for the second distribution the probability of a packet following that distribution is 0.455. Finally, the model is the result of the sum of two Poisson distributions as in (1) Where x is the ocurrence of packet size variable. In fig. 9 we show the histogram of data and the simulate model for network traffic total.
For IPv4 network traffic the parameters are λ1 = 90.61 and λ2 = 1458.72. The probability that the length of a packet belongs to the first distribution is 0.469, while for the second distribution the probability of a packet following that distribution is 0.531. The model is showed in (2). For IPv6 network traffic, the model is as in (3), with parameters = 1083.92, and = 103.86. The probability that the length of a packet belongs to the first distribution is 0.0505, while for the second distribution the probability of a packet following that distribution is 0.9495. Fig 10 and 11 show these simulate models.
Additionally, we present models for protocols TCP and UDP, over IPv4 and IPv6. Table 4 resume the parameters of the models, where λ1 represent average occurrence in interval 1, λ2 represent average occurrence in interval 2, P1 is the probability for a packet following the first distribution, and P2 is the probability of a packet

Conclusions
This paper presents results for stochastic behavior of packet size variable using network traffic measurements in a university campus network. The results show that there is a bimodal traffic distribution with packets around 60 and 1500 bytes. IPv4 packets represents a big impact in this behavior, mainly TCP packets, and the applications that mark this trend are HTTP and SSL. Network administrators can use these results to design better networks and optimize network traffic in order to give security policies, QoS provisioning, and ensure efficient utilization of resources.
We development models for characterize the network traffic based using mixture Poisson distribution and provide the best statistical fit to the packet size variable of the dataset considered in this paper. These models simulate the data by traffic type, protocols and applications. Research community can use these distribution parameters presented for built traffic models and apply in other studies in the areas of computer networking and traffic engineering.