DualBLESS : Bufferless Router with Dual Ejection Ports for 2 D and 3 D NoC

In this paper, the authors proposed a 1-cycle bufferless router with dual ejection ports (which is also called DualBLESS) for 2D and 3D Network-on-Chip. The router uses a simple route computer module, a MUX module instead of the Flit Ejector module and a MUX module in a baseline bufferless router to achieve high performance. Simulation results under six synthetic workloads illustrate that the two proposed DualBLESS routers achieve less average packet latency and higher throughput than the baseline 2D and 3D bufferless routers.


INTRODUCTION
As the transistors continue to scale with Moore`s law, the number of cores on a single chip continues to increase in order to utilize the transistors efficiently.Network-on-Chip (NoC) was shown to be feasible and easy to scale for supporting a large number of cores [1].
Previous on-chip network designs commonly assumed that each router in the network contains buffers to store the packets which are transmitted within the network.Though buffers can improve the network bandwidth utilization rate and reduce the packet lost and misroute rate, it has some shortcomings: buffers take up a significant fraction of NoC power [2] and area [3].Thus, the bufferless NoC designs have come forth, and a few recent work have been investigated how to eliminate the in-router buffers [4,5].Bufferless router only contains pipeline registers, so it can reduce power and area of router significantly.The bufferless router can be separated into two classes: the drop-based router [5] and the deflection-based router [6,7].The drop-based bufferless router needs additional logic to deal with dropped packets, which increases hardware cost.In deflection-based bufferless router and packets are deflected to another output port if an output port which has the lowest distance to the destination node is not available.It needs to avoid livelock due to the deflection routing is a non-shortest path routing algorithm.At present, the deflection routing uses packet priority to avoid livelock.However, this scheme results in a very long critical path, which limits the frequency of the router.CHIPPER [7] was proposed to eliminate the complex logic of the output port allocator.But CHIPPER router still adopts two pipelines, and introduces a complex rule to avoid livelock.BLESS_PERM [8] is a simple 1-cycle high performance bufferless router, which uses a simple permutation network to replace the serialized allocator and switch to reduce the critical path length.However, BLESS_PERM router still has low throughput and high latency when injection rate is increased.
In this paper, we propose a 1-cycle bufferless router with dual ejection ports for 2D and 3D NoC (called DualBLESS).This router increases an additional output port at the baseline 2D and 3D router [8,9], which can eject two flits in one cycle.DualBLESS router decreases flits deflection rate, which can decrease packet latency and improve network throughput.Experimental results indicate that in synthetic workloads, the 2D and 3D DualBLESS routers have less average packet latency than the baseline 2D and 3D bufferless routers.
The rest of paper is organized as follows.The baseline 2D and 3D bufferless routers are introduced in Section 2. Section 3 proposes the 1-cycle 2D and 3D bufferless router with dual ejection ports.In Section 4, simulation experimental results are presented, analyzed and followed by the conclusion in Section 5.

Baseline 2D bufferless router
This paper adopts BLESS_PERM router as the baseline 2D bufferless router.The structure of the baseline 2D bufferless router is shown in Figure 1.
This router works as follows: The flit ejector judges whether input flits arrive at the destination node or not.If the destination of the flit is the current node, the flit ejector produces the selection signal of the

Shaojun Wei
The Department of Electronic Systems, Tsinghua University, Beijing, China ABSTRACT: In this paper, the authors proposed a 1-cycle bufferless router with dual ejection ports (which is also called DualBLESS) for 2D and 3D Network-on-Chip.The router uses a simple route computer module, a MUX module instead of the Flit Ejector module and a MUX module in a baseline bufferless router to achieve high performance.Simulation results under six synthetic workloads illustrate that the two proposed DualBLESS routers achieve less average packet latency and higher throughput than the baseline 2D and 3D bufferless routers.Each stage includes two 2*2 permutation cells.In the case of two flits contending for the same output of the permutation cell, the permutation cell selects the flit with the higher priority to the shortest output port and misroutes the other flit to the other output.When two or more flits arrive at the destination node at the same time, the 4-to-1 multiplexer only selects one flit to the destination node and misroutes other flits, which increases flit latency and reduces the throughput.Therefore, it is necessary to redesign the router to make it accept multi-flits at the same node in a cycle to improve performance of the router.

Baseline 3D bufferless router
The baseline 3D bufferless router is a 1-cycle router with a 3-stage permutation network [9].The structure of the baseline 3D bufferless router is shown in Figure 2.This router works similarly as the baseline 2D bufferless router.The difference is that the 4-to-1 multiplexer is replaced as the 6-to-1 multiplexer and the permutation network contains 3 stages.

DualBLESS router for 2D NoC
Although the BLESS_PERM router can enhance the performance of the bufferless router, it can only eject  4. The header of the flit has 32 bits, the data of the flit has 96 bits.There are four fields in the header of the flit.Both destination address and source address adopt relative address.In the address field, six bits are the row address and the other six bits are the column address.Hop counter with seven bits records the number of hops and the flit going through the network.The flit which has larger hop counts can be routed to its desired port.When a flit goes through a router, both address fields and hop counts field are updated in the Header Update module in Figure 3.The empty bit in the flit shows whether the flit is valid or not.Each flit of the packet contains a header field which can be independently routed in the network and arrive at the destination out of order.Therefore, it needs a reassemble buffer in the destination node to reorder the arriving packet.
Due to the fact that the flit is not buffered in the router, the deflection routing algorithm is deadlock-free.When the shortest port is occupied by the other flit, the flit will be deflected to the other port and far away from the destination node, which may produce livelock so that the flit will never arrive at the destination.Thus it is necessary to restrict the deflection to avoid livelock.Permutation cell will compute the shortest path port according to the destination addresses of two flits.The flit with larger hop counts will be routed through the shortest path port, and the other flit will be deflected to the other port.Two stage permutation cells guaranteeing the higher priority flit will be routed through the shortest path port, which can avoid livelock.
In Figure 3 The Route Computer module also decides whether the local node can inject a flit to network or not.If the number of input flits is less than four, the local node can inject a flit to the empty input port of the permutation cell.The order of output ports of the permutation cells is South, North, West, and East,

DualBELSS router for 3D NoC
Although the bufferless router with the permutation network can enhance the clock frequency of the baseline 3D bufferless router, the router can only eject one flit in a cycle.The router in the center of the network may have six flits come to the destination node at the same time.This router can only eject one flit in a cycle which makes too many flits to deflect far away from the destination node.It may increase the latency of the flit and easy to cause the network in saturation status.Therefore, we redesign the flit reject module and the multiplexer to increase a flit ejection port, which can eject two flits in a cycle.The structure of the 3D DualBLESS router with two ejection ports is shown in Figure 6.This router works similarly as the 2D DualBLESS router.

Experiment results for 2D router
We evaluate the performance of two DualBELSS routers using a cycle-accurate NoC simulator which is developed in VHDL under six synthetic workloads.The simulations are performed on an 8*8 2D mesh.Each packet contains 4 flits.For synthetic workloads, we use four traffic patterns: uniform random, transpose, bit complement and bit reverse.We measure throughput and average packet latency in the experiment.The throughput of the network is defined as the saturation point of the network which means the maximum accepted traffic.The formula is shown as follows: The unit of throughput is flits/cycle/node, where N total_received_flits represents the total received flits of all nodes in the network, N nodes is the number of nodes in the network and T measure_time is the total measurement time in the simulation.
The packet latency is defined as the time between the source node which produces the first flit and the destination node which receives the last flit.The latency includes the flit waiting in the buffer queue before injecting into the network.The formula of the average packet latency is shown as follows:

Experiment result for 3D DualBELSS router
We evaluate the performance of two 3D bufferless routers using a cycle-accurate NoC simulator which is developed in VHDL under synthetic workloads.The simulations are performed on a 4*4*4 3D mesh network.Each packet contains 4 flits.We use six traffic synthetic patterns: uniform random, transpose, bit complement, bit reverse, shuffle and tornado.Other configurations are the same as the 2D DualBLESS router.We measure throughput and average packet latency in the experiment.

CONCLUSION
In this paper, we propose a 1-cycle high-performance bufferless router with dual ejection ports for 2D and 3D NoC.The router used a simple route compute module andMUX module instead of the Flit Ejector module in BASEBLESS router to achieve high performance.Compared with the baseline 2D/3D router, the proposed router can achieve less packet latency and higher throughput.
Keywords: DualBLESS ;bufferless router; Dual Ejection Ports DOI: 10.1051/ C Owned by the authors, published by EDP Sciences, 2015 ejector also judges whether the local node can inject a flit or not.If the number of input flits is less than 4, the local node can inject a flit.The permutation network of the router includes two stages.

Figure 1 .
Figure 1.Baseline 2D bufferless router , the Route Computer module judges how many flits arrive at the destination.If only one flit arrives at the destination node, the Route Computer module will produce the selecting signals of the MUX model.If one more flits arrive at the destination node, the Route Computer model will judge two flits that have larger hop counts than other flits and produces another selecting signals of the MUX model.The MUX model accepts signals from the Router Computer module to eject one or two flits in a cycle.The work flow of the MUX model is shown in Figure 5 .

Figure 3 .
Figure 3. DualBLESS router with two output ports