High Performance Systolic Array Core Architecture Design for DNA Sequencer

. This paper presents a high performance systolic array (SA) core architecture design for Deoxyribonucleic Acid (DNA) sequencer. The core implements the affine gap penalty score Smith-Waterman (SW) algorithm. This time-consuming local alignment algorithm guarantees optimal alignment between DNA sequences, but it requires quadratic computation time when performed on standard desktop computers. The use of linear SA decreases the time complexity from quadratic to linear. In addition, with the exponential growth of DNA databases, the SA architecture is used to overcome the timing issue. In this work, the SW algorithm has been captured using Verilog Hardware Description Language (HDL) and simulated using Xilinx ISIM simulator. The proposed design has been implemented in Xilinx Virtex -6 Field Programmable Gate Array (FPGA) and improved in the core area by 90% reduction .


Introduction
DNA is made up of thymine (T), guanine (G), cytosine (C) and adenine (A) nucleotides (NTs) [1]. Biological processes such as mutation produce unknown DNA sequences as it alters the natural DNA sequences [2]. Therefore, sequence alignment is important as it facilitates the mutated DNA sequences detection by finding the similarities NT region.
Aligning DNA sequence is an operation that computationally intensive. The task time execution cannot be achieved realistically by desktop computer systems as sequence databases growing exponentially as shown in Fig. 1. Hence, a faster computation platform is needed to overcome the aforementioned problem. Currently, reconfigurable hardware FPGAs has been suggested to enhance the DNA sequence alignment performance [3]. Certainly, FPGAs are attractive platforms to speed up DNA sequence alignment computation compared to General Purpose Processor (GPP).
SA architecture has been introduced by researchers to further improve the DNA sequence alignment process. SA is a pipeline network arrangement where the process task is divided among several processors. It is a build-up of row data processing units called cell or Processing Element (PE). One of the advantages of using SA architecture is that the data is streamed across the array due to the presence of local connection between the cells [4]. The sequence alignment algorithms can take advantage of SA to realize the parallelism on FPGAs. SW algorithm, a type of Dynamic Programming (DP) algorithm is used in aligning DNA sequence where the common regions between two or more subsequence of DNA sequences represent the optimal sequence alignment [6][7]. DP-based algorithms usually exhaustive due to the accurate analysis. Unlike heuristic sequence alignment such as FASTA, give sub-optimal alignments that help to reduce the computational burden in DP algorithm analysis with uncertain accuracy of the result [8]. Thus, the sequence alignment algorithm can be implemented in SA to speed up its computation time and uses FPGA to improve the DNA sequencer computation process.
The implementation of FPGA related to DNA sequence alignment architectures are extensively reported in [9], [10], [11], [12], [13] and [14]. All of these architectures are designed based on SA with realization of the SW algorithm with linear gap penalty. The architectures are differentiated based on penalty gap used and also additional features that are included in their designs such additional algorithm for trace back (TB) step. For further information of the related works, please refer to [13].
The remainder of this paper will present the general information of SW algorithm with affine gap, SA and FPGA. Next, comparative timing performance evaluation of the proposed design against other FPGA platforms. Finally, conclusion of this work.

Sequence Alignment Algorithms
Pairwise Sequence Alignment (PSA) is one of the alignment analyses to align DNA sequence. It investigates the relationship between a newly discover query sequence and subject sequences that are taken from databases. T. F. Smith and M. S. Waterman have introduce an algorithm in 1981 known as SW algorithm to find the best local alignment between the aforementioned sequences [15]. There are two ways in penalizing insertions and deletions gaps; linear and affine.
A more complex SW algorithm was introduced by Gotoh which is suitable for the affine gaps due to mutation in DNA sequence [16] as shown in (1). Implementation of this algorithm in hardware requires more logic resources and computation time. The time complexity of this algorithm is O(mxn) where m and n are the length of subject and query sequences respectively. Based on (1), d is penalty for open gap, e is penalty for extended gap and γ(ri,sj) is a score of substitution matrix for subject sequence ri (r1,r2,r3 … rm) and query sequence sj. (s1,s2,s3 … sn).
The alignment between each NT pair of r, indexed by i and s, indexed by j is computed in the matrix form F(i,j). This matrix computes the three neighbouring cells; the diagonal cell, M(i,j), the top cell Ix(i-1,j), and the left cell, Iy(i,j), for the maximum score as shown in Fig. 2.
The boundary values F(0,0), F(i,0), F(0,j) are set to zero as there is no alignment in either r or s as shown in Fig. 4. TB starts after the scores were completely filled the matrix F(i,j) by locating the maximum score in as the starting point. The next highest score will be the highest value among its neighbouring cells (top, left, diagonal). This process continues until it reached the origin as shown in Fig 4. TB is used to obtain the optimal alignment between r and s as shown in Fig. 3.  Fig. 3. Optimal sequence alignment [17]. Fig. 4. Alignment matrix of local alignment with affine gap penalty with its optimal sequence alignment, match=+5, mismatch=-2, d=-12, e=-2 [17] 3 Systolic Arrays Parallel architectures such as SA have been used in computing technologies to accelerate the algorithm in FPGA [18]. A linear SA consists of an array of PEs with each PE holds one DNA character of a query DNA sequence as shown in Fig. 5. The highest score will be stored in SA after the subject sequence is streamed in. If r is related to s, the score will be high. The internal architecture of PE implement the affine gap penalty SW algorithm as shown in Fig. 6. The computation of the alignment algorithm ise main task of a PE comparing the score in the top, left and diagonal elements which have been described in previous section based on (1). Each PE performs one calculation due to the comparison of a DNA character between the query and subject sequence respectively. In each clock cycle, the PE generates one alignment matrix element. After one clock cycle, a column is generated in each PE where the column size is based on m. The execution of full alignment between the two sequences using SW algorithm in O(m+n-1) time complexity is illustrated in Fig. 7.

FPGA, the accelerator
The affine gap penalty SW algorithm is used as the backbone SA to speed up the computation process in aligning DNA sequence where the FPGA acts as an accelerator. Different FPGA vendors have different FPGA architecture. Altera uses ALMs (Adaptive Logic Module) while Xilinx uses Slices to represent their logic resources [20]. Slices are the combination of Flip Flops (FFs) and Look Up Tables (LUTs). The LUTs can either be 4-input or 6-input depending on devices. Virtex -4 uses 4-input LUTs whereas Virtex-6 uses 6-input LUTs. Altera's FPGA devices also uses 4-input LUTs or 6input LUTs as the building block of ALM. Thus, normalization is required to analyze speed performance equally between different logic resources which will be discussed in the next section.
The results were normalized in terms of Slices as the proposed core architecture is implemented in Virtex-6 FPGA. The performance of the core architecture will be evaluated in terms of (Slice / PE), total number of PEs that can be occupied by the device (#PE), peak frequency (Peak Freq), and Peak Cell Update Per Second (Peak CUPS). Slice / PE are used to find the total number of slices that can reside in a PE. #PE can be calculated by dividing the total slices of the device with Slice/PE. Peak Freq is the time in Hertz (Hz) of the longest path between 2 FFs in a PE which can be taken in the timing report generated by Xilinx software. Peak CUPS gives the total time based on #PE. It can be calculated by the multiplying the peak frequency with the #PE.

Result and discussion
The Verilog HDL which has been used is targetable for a variety of FPGAs platforms. The score obtained from the matrix elementary operation can be compared with the waveform from the proposed design simulated in Xilinx ISIM simulator as shown in Fig. 8. This is to ensure the best-matched score from the core architecture is the same as the matrix alignment in Fig. 4. The parameters used for the core architecture were the same as the parameters for matrix alignment operation. In this work, the proposed core architecture is implemented in Virtex-6 FPGA whereby 6-input LUT is the building block of the device. For a fair performance comparison, the equivalent 6-input LUT is determined based on (2) for Stratix IV [20] and Virtex-4 and Spartan III [21] (3) first before determining the equivalent Slices using (4) [22]. The Logic Cell value (LC), Slices and ALM value can be retrieve from User Guide and Device Handbook respectively. Then, the equivalent slices are compared to reference slices as shown in Table 1 for normalized Slice/PE calculation. Virtex-6 FPGA has 11,640 slices where 776 PEs can be implemented in the aforementioned device with a 15 slices/PE. The proposed design was operated 600 MHz for the sequence alignment computation that is shown in Fig. 4. The comparison against other FPGAs platforms is reviewed in Table 2. Usually, the bottleneck for FPGA designs for area is the LUTs number [23] and timing is Peak CUPS [24]. Thus, based on Table 2, the proposed designed is the second slowest. However, the proposed design has the smallest slice as compared to other previous work. Since smaller core area can implement more # PEs, the speed performance is flopped to second place.

Conclusions
This paper reviewed on the current FPGA platform on implementing SA architecture for DNA sequence alignment. The proposed design is implemented on Xilinx Virtex -6 FPGA using the affine gap penalty SW algorithm. The result shows that the proposed design has the smallest area core architecture and gained 465 GCUPS as compared to other architecture. Although it has the second highest in terms of speed, the core area is the smallest. This architecture design is suitable for high speed algorithm computation specifically for early disease detection such as cancer using its genetic profile.