A scalable ASIP for BP Polar decoding with multiple code lengths

: In this paper, we propose a flexible scalable BP Polar decoding application-specific instruction set processor (PASIP) that supports multiple code lengths (64 to 4096) and any code rates. High throughputs and sufficient programmability are achieved by the single-instruction-multiple-data (SIMD) based architecture and specially designed Polar decoding acceleration instructions. The synthesis result using 65 nm CMOS technology shows that the total area of PASIP is 2.71 mm 2 . PASIP provides the maximum throughput of 1563 Mbps (for N = 1024) at the work frequency of 400MHz. The comparison with state-of-art Polar decoders reveals PASIP’s high area efficiency.


Introduction
The Polar code has been proved to be the first kind of error correction code that can achieve the Shannon capacity [1]. In 2016, 3GPP adopted the Polar code as the channel coding scheme for the control channel of the Enhanced Mobile Broadband (eMBB) scenario in future 5G standard [2]. The Polar code has drawn more and more attention these years.
Basically, Polar codes can be decoded by two kinds of algorithms: the successive cancellation (SC) algorithm and the belief propagation (BP) algorithm. SC decoders suffer from the high decoding latency and the low throughput caused by the serial processing nature. Compared with SC decoders, BP decoders provide higher throughputs because of the inherent high parallelism. However, BP decoders suffer from higher computation complexity and require larger memories for implementation. There have been several BP decoding proposals in literature for solving these problems. Yuan et al. proposed an early stopping criteria that can reduce about 30% of the average BP decoding complexity [3]. Park et al. [4] and Sha et al. [5] reduced about half of the memory requirement of BP decoding by double-column processing and combine-stage processing separately.
The BP Polar decoders mentioned above are all implemented as application specific integrated circuits (ASICs) that each supports only one fixed code length. However, eMBB control channel coding requires support for multiple code lengths [6]. ASICs lack flexibility for this application scenario. Pamuk proposed a FPGA implementation for Polar decoding with multiple code lengths [7]. But the throughput of this FPGA-based design is only about 50Mbps at the clock frequency of 160 MHz, which is too small for 5G communication systems. Application-specific instruction set processor (ASIP) is a promising solution to the predefined scope of application that requires high performance and sufficient flexibility at the same time [8], and thus is more suitable for Polar decoding in complex application scenarios.
In this paper, we propose a flexible scalable Polar decoding ASIP (PASIP) that supports multiple code lengths and any code rates. Single-instruction-multipledata (SIMD) architecture is adopted in PASIP. The operands are divided by multiple parallel windows and processed in parallel to enhance the decoding throughput. A specially designed instruction set that contains Polar decoding acceleration instructions ensures the high throughput as well as the sufficient programmability of PASIP. PASIP is synthesized using 65 nm CMOS technology. And the synthesis results show that PASIP achieves much higher area efficiency than the sum of multiple single-code BP Polar decoders.
The rest of this paper is arranged as follows. Section 2 introduces some necessary background on the Polar BP decoding algorithm. Section 3 presents the architecture of PASIP in detail. Section 4 analyses the performance of PASIP. Section 5 provides the synthesis results and comparisons. Finally, Section 6 concludes the paper.

BP decoding algorithm
The BP Polar decoding procedure can be described using a factor graph [1]. Figure 1 (a) shows a unified factor graph with the code length N = 8 [7]. There are n computational stages (n = log2N) that each consists of N/2 basic computational units (BCUs) in the factor graph. The structure of the BCU is shown in Figure 1 (b). There are n + 1 columns (Col.) that each consists of N nodes in the factor graph. The coordinates (i, j) represent the j-th node in Col.i. Each node saves one intermediate result. All nodes should be initialized before the decoding begins. The nodes in Col.0 (the left most column) are set to infinity or 0 according to the locations of frozen bits and information bits separately. The nodes in Col.n are initialized by log likelihood ratios (LLR) of received data. Other nodes are set to 0.
As shown in Figure 1 (a), the BP Polar decoding procedure can be divided into three steps: the right-toleft propagation (LP), the left-to-right propagation (RP), and the final result hard decision. The nodes are updated column by column with right-to-left messages (Li,j) calculated as Formula (1) and (2)

PASIP design 3.1 Top level architecture and pipeline structure
The top level architecture and the pipeline structure of PASIP are shown in Figure 2. PASIP adopts the SIMD architecture, and accelerates the decoding calculation through data-level parallelization. P = 16 homogeneous parallel windows are adopted in this paper.
PASIP consists of three submodules: the data path, the memory subsystem, and the control path. The decoding computation is executed in the data path that is composed of 16 homogeneous soft-in-soft-out (SISO) modules. The memory subsystem consists of a memory array, an address generation unit (AGU), and permutation networks. There are 8P 128-bit × 32 memory banks in the memory array (8 in each parallel window and 128 in total) for saving all N(n+1) nodes in the factor graph. The AGU calculates the address signals, including memory access addresses, the write enable signal (wen), the sliding window number (w), and the sliding window address (t). And the permutation networks handle the data shuffling between memories and SISOs. Other blocks belong to the control path. The programmability of PASIP is mainly achieved by the control path.
The pipeline of PASIP contains 6 stages, including an instruction fetch (IF) stage, an instruction decoding (ID) stage, an address generation (AG) stage, an execution (EXE) stage, and a write-back (WB) stage. During BP decoding process, the intermediate results of one stage are part of inputs of the next stage. To avoid the potential read-after-write (RAW) data hazard, data forwarding technique is introduced, and the permuted SISO outputs are sent to the input-end of the EXE stage.

BCU segmentation and data path design
As shown in Figure 1, BCUs in the same computational stage do not have data dependency. And thus, one stage of BCUs can be divided by multiple parallel windows and computed in parallel.
The BCUs are segmented using parallel windows (PWs) and sliding windows (SWs) according to the segmentation scheme shown in Figure 3. One position in the SW contains k = 32 BCUs. The number of positions in one SW is referred to as the sliding window length (SWL). One SISO only handles one position in the SW at a time, and one stage of computation takes up SWL clock cycles in total. In this way, the fixed hardware is able to support multiple code lengths.     Figure 1), and the computation logic is designed according to Formula (1) to (5). As the Formula (1) and (2) have the same format as Formula (3) and (4) The function f (x, y) is implemented as Figure 4 (b). The truth table of the selector is shown in Table 1. The multiplication with 0.9375 is replaced by x + (-x >> 4) for easier hardware implementation.
Guard bits are inserted before inverse operations and add operations in order to eliminate data overflow. The saturation block shown in Figure 4 (c) is introduced to saturate the computation results before writing back. If the sign bits are not equal, the data overflows, and the saturation result should be +127 or -128 according to the most significant bit (MSB) of the 10-bit input. Otherwise, the saturation result is the low 8 bits of the 10-bit input.
The final result hard decision is implemented as shown in Figure 4 (d). The MSB of the sum of R0,j and L0,j can be taken as the final result directly according to Formula (5). Conditions Output x ≥ y, x ≥ -y y x ≥ y, x < -y -x x < y, x ≥ -y x x < y, x < -y -y   According to the requirements mentioned above, we propose a confliction-free memory management scheme that is applicable to any code length N. We allocate the data nodes at fixed positions in the memories, and shuffle the nodes using permutation networks between memories and SISOs. N nodes in one column are separated by 16 PWs, and PW p only saves nodes [Np/16, ……, N(p+1)/16-1] in each column (0 ≤ p ≤ 15). All PWs share the same data allocation pattern. Figure 6 gives a data allocation example in PW 0 when N = 4096. Each column takes up only 4 memory banks. The columns linked by a line are accessed together in the SW marked on the line. The data allocation pattern of each column is also the same. Figure 6 shows the data arrangement inside the Col.12 as an example. The first half of nodes is arranged simply in order. Other nodes are arranged in a shifted order. The nodes filled with the same colour are accessed in the same clock cycle when used as the left column. The nodes with the same memory address are accessed in the same clock cycle when used as the right column.

Memory subsystem design
When accessed as the left column: When accessed as the right column: Col. As the data nodes are allocated at fixed positions in the memories, the read address and the write address of one node are actually the same. From Figure 6, we can see that the memory access addresses are related to the corresponding sliding window number w and the sliding window address t. Table 2 summarizes the address calculation methods for memory banks in one PW under different conditions, wherein ad1, ad2, ad3, and ad4 are calculated as Formula (6) to (9). As all PWs share the same data allocation pattern, four access addresses are enough to access 128 memory banks at a time.
The permutation networks are controlled by a 3-bit signal, and there are 8 different permutation patterns. The value and the meaning of each bit in the control signal are shown in Table 3. Figure 7 gives a permutation example when w = 0, t = 0, and N = 4096.

Control path and programmability design
The programmability of PASIP is realized by the control path hardware and a specially designed instruction set.
The program memory saves instructions, and instructions can be decoded by the instruction decoder into control signals to control other hardware modules. The instruction set consists of many kinds of instructions, such as system instructions (like RESET, NOP), move instructions (in charge of data exchange among registers, memories, and outer storages), and most importantly, the Polar decoding acceleration instructions.
From the decoding process shown in Figure 1, we extract three Polar decoding acceleration instructions. The instruction names and the corresponding functions are listed in Table 4. These three instructions introduce two control signals, "aw" and "as", to control the sliding window number w and the sliding window address t separately. There are four operators for "aw" and "as": "++" means increasing by 1, "--" means decreasing by 1, "clr" means setting to 0, and "set" means setting to the maximum value. With these instructions and control signals, PASIP is able to provide a fully programmable decoding procedure as shown in Figure 3. The control path also contains a parameter register file. The decoding settings and the code configurations, including the total stage number of the factor graph n, the code length N, the parallel window length, the SWL and the hardware loop settings, are saved in the parameter registers. The support for decoding with multiple code lengths is realized by adjusting these parameter registers using move instructions. Figure 8 shows a slice of assembly program for Polar BP decoding when SWL = 0, where "rep = xx" makes the instruction self-repeat for "xx" times.   Table 5.

Performance analysis
We can see that the performance of PASIP shows two phases. From Subsection 3.2, we know that PASIP can handle at most 512 BCUs every clock cycle. When N ≤ 1024, one time of propagation is finished within one clock cycle so that SWL = 1. The characteristic of this phase can be summarized as Formula (10). In this phase, TP increases as N increases. The maximum TP is reached when N = 1024. When N > 1024, SWL can be calculated as N/1024, which is always larger than 1.
The decoding throughput in this phase can be calculated as Formula (11), and thus TP decreases as N increases.
The performance and the supported code length range mentioned above are only the results when PASIP adopts P = 16, k = 32, and 128-bit×32 memories. PASIP is scalable. The supported minimum code length is 2k. The supported maximum code length is determined by the depth of the memories. The supported code length range can be extended by decreasing k or simply increasing the depth of memories. The maximum throughput of PASIP is determined by Pk. PASIP can be capable of satisfying higher throughput requirements in future high speed communication standards by increasing P or k.

Synthesis results and comparison
PASIP is synthesized by Synopsys Design Compiler. The 65nm CMOS technology is used in this paper.
The synthesis results show that the total area is 2.71 mm 2 when PASIP adopts P = 16 and k = 32, wherein the memory subsystem takes up 74.67%, SISO modules take up 19.43%, and the control path only takes up 5.90%. Table 6 compares PASIP with state-of-art Polar decoders, wherein all areas, frequencies, and throughputs are normalized to 65nm process.
Compared with the multi-code SC Polar decoder proposed by Coppolino et al. [9], PASIP has a smaller supported code length range, and occupies a larger total area mainly because of the larger memory area consumption. However, PASIP provides much higher throughputs and better area efficiency than this multicode SC Polar decoder due to the high parallelism.
PASIP is also compared with other BP Polar decoders. As far as we know, there are no other CMOS implementations of multi-code BP Polar decoders in literature until June 2018. Therefore, PASIP is compared with the single-code BP Polar decoders [5,10] instead. PASIP occupies larger total area and has lower area efficiency than the single-code decoders supporting N = 1024. This is mainly because PASIP introduces a larger memory area for supporting larger code lengths (2048 and 4096). On the other hand, PASIP achieves much higher flexibility than the single-code BP decoders. PASIP supports multiple code lengths (from 64 to 4096) with only little hardware overhead compared with the single-code decoder [5] that only supports N = 4096. PASIP achieves much smaller total area than the sum of multiple single-code decoders, and thus achieves better area efficiency.

Conclusion
In this paper, we propose a scalable Polar decoding ASIP in 65nm CMOS technology, namely PASIP. The SIMD architecture and the specially designed Polar decoding acceleration instructions ensure the high performance as well as the programmability for PASIP. PASIP supports multiple code lengths, from 64 to 4096, with a total area of 2.71 mm 2 . PASIP provides a maximum throughput of 1563 Mbps at the maximum frequency of 400 MHz when N = 1024. PASIP achieves higher flexibility and higher area efficiency than the sum of multiple single-code BP Polar decoders. And thus, PASIP is more practical for future base stations and terminals.