A Method to Detect Hazards in Pipeline Processor

. In order to improve the throughput of the processors, pipeline technique is widely used to implement the instruction-level parallelism. However, this technique also leads to data hazards which has a great influence on the performance. This paper proposed a method called supply-matching to detect and solve data hazards efficiently. The logic of bypassing and stalling can be easily realized through this method. Furthermore, an RTL description of instructions was also introduced in this paper to reduce resource utilization. The case study was conducted through a five-stage microprocessor based on the PowerPC architecture with different approaches. Experiment results show our method requires less resources and achieves better performance.


Introduction
Pipeline technique is widely used in microprocessor design to improve performance. A pipeline processor can execute multiple instructions within a clock cycle. However, hazards arise if pipeline architecture is used, including data hazards, control hazards and structural hazards. Moreover, the problem becomes more complex when pipeline depth increases. Forwarding and stalling are effective solutions to resolve hazards for processors in embedded system. However, the complexity of detecting hazards increases rapidly as the number of instructions increases. In this case, many combination of instructions may lead to hazards in unanticipated way. Performing a highly effective method to resolve huge plenty of hazards oriented from deep pipeline and large instruction set is necessary.
Many studies have proposed some methods in pipeline processor design motivated by these situations mentioned above. Amit Pandey and Yu Qiaoyan used class-based method to detect data hazard [1,2]. This method divides the whole instruction set into several parts so big problem are broken up into smaller ones. P. Bernardi and D. Boyang proposed a SBST algorithm [3]. Jiajing Lu designed a dynamic scheduling algorithm to improve the pipeline efficiency, which only increases one singleinstruction buffer and some combination logic [4]. Also, Schönherr J, Schreiber I, and Fordran E proposed a method using symbolic model checking to detect hazards in pipelined processor [5].
In this work, we aim to find solutions to resolve huge plenty of hazards oriented from deep pipeline and large instruction set. Firstly, we briefly introduce the architecture, controllers and discuss our method of generating datapath according to the RTL description in Section 2. Next, a method called supply-matching is introduced to detect and resolve data hazards completely and efficiently in Section 3. Then we introduce control hazards and structural hazards in Section 4. Finally, we do an experiment to verify our method in Section 5.

Architecture
In this paper, processor adopts the architecture with five pipeline stages, as shown in Fig.1 (The italic words mean the name of the core units. Other words mean the name of interface and data).   Fig. 1. The architecture of a 5-stage-pipeline microprocessor IF stage means fetching instructions from program memory. PC, Next PC (NPC) and Instruction Memory (IM) are assigned in IF stage. ID stage is assumed to decode the instruction and read or write register file. Also, the operation of expanding immediate is done in ID stage. EXE stage is used to execute arithmetic operations and logic operations. MEM stage is supposed to read or write memory. And WB stage means the execution result of the instruction will be wrote back to register file. Four pipeline registers are distributed in the pipeline architecture to store data tentatively.

Notations and datapath
An RTL description of instructions was also introduced in this paper to reduce resource utilization. The rule is defined as follow: (1) A.B means port B of core unit A.
(2) X_Y means pipeline register. For example, IF_ID means pipeline register between IF stage and ID stage. It is more easily to structure datapath by using the RTL description. A datapath is a collection of functional units (such as ALU or multipliers), registers and buses. Follow the RTL rule, the data flow of each instruction is clear and all units have been linked. Then merging the data flow in the vertical direction to remove the repeated data flow of the whole instruction set. Adding MUX unit if the core unit has multiple inputs. The MUX control signal is generated by the controller which described in Section 2.3. Fig.2 shows the result of the method which mentioned above that aimed at ADD, SUBF, STW, LWZ, B instructions of PowerPC instruction set.

Controller design
A three-controller architecture which includes functioncontroller, bypass-controller and stall-controller is used in this paper, as shown in Fig.1. Function-controller is responsible for decoding the instruction and creating the function-signals to indicate what the core units should do. For example, if the instruction is ADD, the functioncontroller would create signal like DM_Wr to denote whether Data Memory is wrote or not. Also, functioncontroller determines the data source of core units by creating selecting signal of function-multiplexers. This function-signals are certain since they are directly appeared in RTL description for every instruction. So the final logic of the signals can be created by integrating all instructions' RTL directly.
Bypass-controller is mainly in charge of bypasssignals which choose the right data source as input to bypass-multiplexers. Bypass-controller design is the key to realize bypassing technique. The logic that generates the bypass-signal is more complicated than the logic of function-signals as hazard detection becomes more difficult when the sum of instruction or the pipeline depth increases. This paper will introduce an effective method to detect hazards in the next chapter, so the logic of bypass-signals can be much easier to get. Stall-controller is responsible for pipeline stalling as some hazards situation cannot be resolved by using bypass technique. In this situations, stall-controller generates stall-signals to stall IF stage and ID stage. For pipeline registers, ID_EXE clears all data and IF_ID remains unchanged. Also, PC should remain the value of PC+4 to ensure the correctness of the order in which instructions are executed.

Problem definition
Data hazards are the hazards which are most frequently occurring in pipeline processor. Forwarding and stalling are effective solutions to resolve this problems. However, the complexity of detecting hazards increases rapidly as the number of instructions increases. In this situation, many combination of instructions may lead to hazards in unanticipated way. It is imperative to take completeness detection of data hazards to ensure that all hazards combinations are considered.

Solution
This paper proposed a method called supply-matching to detect and solve data hazards completely and efficiently. The method can be divided into two steps.  Build Tuse-Tnew matrix of all instructions for specified instruction set. Then all value of Tuse of registers and all value of Tnew for the processor can be got by synthesizing all records of Tuse-Tnew matrix.  Build register strategy matrix according to Tuse-Tnew matrix. A Tuse-Tnew record can be uniquely determined for specific register. According to the supply-matching model which is described in section 3.2.1, any data hazards can be detected and resolved by using formula (1).

Supply-matching model
Data hazards occur when instructions that exhibit data dependence modify data in different stages of a pipeline. Therefore, data hazards detection can be transformed into the detection of relationship between data demand and data supply. In this case, provider is the pipeline register which saved the execution result of last instruction.  The stage of the instruction decoding must not be earlier than ID stage whether the centralized decoding mode or distributed decoding mode is adopted.  Forwarding technology has higher priority when both forwarding and stalling technology can be used to resolve data hazards. Besides, two parameters called Tuse and Tnew are defined.  Tuse means the number of clock cycles that a certain functional unit will use the value saved in register after the instruction enters ID stage. Tuse is a static value and an instruction can have multiple Tuse according to the number of operands of the instruction. Meanwhile, Tuse≥0.  Tnew means the minimum number of clock cycles that the instructions which at stages after ID stage will produce the result that will be wrote back to registers. Tnew is a dynamic value. The value reduces by 1 as instruction flows through the pipeline stage and the value will no longer change once the value is 0. So an instruction has different Tnew at different stage. Meanwhile, Tnew≥0. The management of Tnew and Tuse in pipeline processor is shown in Fig.3.

Fig. 3. The management of Tnew and Tuse in pipeline processor
Based on the above, solution of data hazards becomes digital. According to the definition of Tuse and Tnew, each instruction has a set of digits which represent Tuse and Tnew. For a specific instruction set, the comparison between Tuse and Tnew of different instructions reflects data dependence. If the Tnew > Tuse, data hazards can only be solved by stalling as the result writes back too late. If the Tnew ≤ Tuse, data hazards can be solved by forwarding. The formula is as followed.

Tuse-Tnew matrix
The Tuse-Tnew matrix is used to find all the value of Tuse and Tnew of all instructions. The Tuse-Tnew record of an instruction is certain depending on the definition of Tuse and Tnew if the pipeline architecture stays the same. Therefore, a Tuse-Tnew matrix is certain too if all instructions of specific instruction set are taken into account. In this situation, all the work is focusing on the execution semantics of each instruction and completing the matrix line by line, rather than concentrating on the specific classification of the instructions. This method greatly reduces the complexity of hazards detection even if the number of instructions increases. For example, supposing the instruction set includes instructions of ADD, SUB, ANDI, ORI, LW, SW and BEQ. Concerning the five-stage pipeline, the Tuse-Tnew matrix is shown in Table 1. Using the supply-matching model can easily build Tuse-Tnew matrix. The procedure also applies to processor which has deeper pipeline stage or larger instruction set.

Register strategy matrix
Register strategy matrix provides the resolution strategy of data hazards for specific register. Based on the Tuse-Tnew matrix, a complete strategy matrix can be built for specific register as the Tuse-Tnew matrix considers all instructions and all pipeline stages except the stage before ID stage (Basic principle 2 makes the rule. If the number of stages which before ID stage more than one, some additional work should be done to detect and resolve the data hazards for this part). Formula (1) is used to structure strategy matrix. Taking the RT register as example, the result is shown in Table 2. 4. If Tuse = Tnew, it shows that current instruction can get related data immediately as the instruction before finish computing at the same time. Forwarding technology can be used here. A complete stalling control signal can be created by using logic operation OR to integrate all the stalling conditions. In other cases, data hazards can be resolved by forwarding. However, there may have many forwarding sources represented data from multiple pipeline stages when using forwarding technology. In this situation, the priority of forwarding source should be set. This paper adopts a forwarding strategy based on the pipeline priority. The strategy sets the stage which Tuse stages behind ID stage has the highest priority among the whole pipeline stages and the priority of other stages behind it are decreasing in turn. Based above, just selecting forwarding source which has the highest priority when there are multiple forwarding sources.

Control hazard and structural hazard
Control hazards (branch hazards) cause by branch instructions. There are two main techniques to resolve branch hazards, including branch prediction and branch delay slot. Other studies have already done this part efficiently. Please refer to [6,7,8] for details.
Structural hazards occur when a part of the processor's hardware is needed by two or more instructions at the same time. The strategy for resolving this hazards is simple. Just stalling the pipeline or copying the basic unit.

Framework
The methods mentioned above were adopted to implement a five-stage PowerPC microprocessor which supports 72 instructions. It is significant to realize the whole microprocessor rather than the logic of hazards detection alone because microprocessor cannot work normally if the system only supports the logic of hazards detection.
There are two decoding modes which are centralized decoding and distributed decoding in processor design. In centralized decoding mode, controllers are assigned at ID stage. Then control signals created by controllers transfer through the pipeline. But in distributed decoding mode, controllers are distributed in multiple stages and controllers only create the control signal which related to the core units in the same stage [9,10]. In this paper, the supply-matching method uses a hybrid decoding mode which means Tuse uses centralized decoding mode and Tnew uses distributed decoding mode. However, in order to set up a control experiment, we also implemented the supply-matching method only using centralized decoding mode.
The framework for the experiment as shown in Fig.4. Firstly, instruction set, core units and pipeline architecture were determined according to architecture specification. After that, core unit modules, controllers, datapath, and multiplexers were implemented with Verilog HDL. Meanwhile, hazards in pipeline are resolved based on our method. Finally, binary codes were got from compiling the programs written in C or assembly by GCC. The result of registers after executing every instruction is obtained from QEMU in single-step mode. The correctness of the processor can be determined by comparing the value of registers with the result from QEMU during the simulation. Spartan6-6SLX150FGG484 made by Xilinx is the FPGA used to synthesize and implement with ISE in our experiment.

Result and analysis
By comparing the value of registers with the result from QEMU, the correctness of the processor has been proved. This also indicates the correctness of the supply-matching method. Meanwhile, we have compared the synthesize reports of two microprocessors which implemented the supply-matching method using different decoding modes in three aspects: clock frequency (MHz), the number of Flip-Flops (FF), the number of BELs (which includes all basic logic primitives like LUT, MUXCY, etc). The result is shown in Table 3. It is obvious to conclude that our method which using hybrid decoding mode can gain faster clock frequency with less resources.

Conclusions
In this paper, an efficient method called supply-matching to completely detect and resolve data hazards has been proposed. The logic of bypassing and stalling can be easily integrated and implemented due to this method. Finally, we conducted extensive experiments based on a state-of-art microprocessor, PowerPC architecture with five stage pipeline. Experiment results proof that our method can achieve faster clock frequency with less resource than the well-known method called stagedecoding.