Processor Pipeline Hazards¶
Overview¶
This lecture moves from the single-cycle processor we built in Project 6 to a pipelined processor that overlaps instruction execution across five stages to increase throughput. Pipelining introduces a new problem: hazards, situations where the overlap causes an instruction to use a value (or a PC) that is not yet ready. We study the three classes of fixes you will implement in Project 7's Hazard Unit — inverting the register-file clock, data forwarding, and load stalling for data hazards, plus flushing for control hazards — and show with cycle-by-cycle diagrams how each fix removes nop instructions and shortens the running time of a program.
Learning Objectives¶
- Explain how a 5-stage pipeline (IF, DR, EX, MEM, WB) overlaps instructions to improve throughput without raising the clock rate
- Describe the role of pipeline registers in carrying intermediate values between stages
- Identify the four kinds of modifications needed to handle hazards: clock inversion, forwarding, load stalling, and control-hazard flushing
- Recognize a read-after-write (RAW) data hazard from a cycle/stage diagram
- Show how inverting the RegFile clock removes one
nopper dependency - Design the forwarding (bypass) logic that selects RD0/RD1 from the EX, MEM, or WB results
- Detect a load-use hazard and stall the pipeline by gating enable and clear signals on pipeline registers
- Resolve control hazards (jumps and branches) by updating the PC early and flushing wrong-path instructions
Prerequisites¶
- The single-cycle RISC-V processor datapath from Project 6 (PC, instruction memory, RegFile, ImmDecoder, ALU, Branch Unit, data memory)
- RISC-V assembly and machine code:
addi,add,ld/sd,jal/jalr,beq/bne/blt/bge - Digital design fundamentals: registers, multiplexers, comparators, clocked vs. combinational logic
- The RISC-V register file interface (RR0, RR1, RD0, RD1, WR, WD, WE)
See the processor design overview and the source PDF for the original handwritten lecture notes.
1. From Single-Cycle to Pipelined¶
In a single-cycle processor, one instruction occupies the entire datapath for one (long) clock cycle: fetch, decode, execute, memory, and write-back all happen before the next instruction begins. The clock period must be long enough for the slowest instruction to flow all the way through, so most of the hardware sits idle most of the time.
A pipelined processor splits the datapath into stages separated by registers, so that multiple instructions can be "in flight" at once — each occupying a different stage. This is the same idea as an assembly line or the laundry analogy: while one load is in the dryer, the next is in the washer.
flowchart LR
A[IF<br/>Instruction Fetch] --> B[DR<br/>Decode / RegFile Read]
B --> C[EX<br/>Execute / ALU]
C --> D[MEM<br/>Memory]
D --> E[WB<br/>Write Back]
style A fill:#cde,stroke:#333
style B fill:#cde,stroke:#333
style C fill:#fdc,stroke:#333,stroke-width:2px
style D fill:#cde,stroke:#333
style E fill:#cde,stroke:#333
The five RISC-V pipeline stages are:
| Stage | Name | Work performed |
|---|---|---|
| IF | Instruction Fetch | Read instruction from instruction memory using PC; compute PC+4 |
| DR | Decode / RegFile Read | Decode the instruction word; read source registers (RD0, RD1); produce control lines |
| EX | Execute | Perform the ALU operation; compute branch/jump target; evaluate the branch comparison |
| MEM | Memory | Load from or store to data memory |
| WB | Write Back | Write the result back into the register file |
Historical note: pipelining originated in mainframes and became common in personal-computer CPUs in the 1980s. It is the foundational technique behind all modern high-performance processors.
Throughput, not latency¶
Pipelining does not make a single instruction finish faster — that instruction still passes through all five stages. What improves is throughput: once the pipeline is full, one instruction completes every cycle. In steady state, an n-stage pipeline can approach an n-times speedup over a single-cycle machine running at the same per-stage speed.
Single-cycle (1 long cycle each):
I1: [=====]
I2: [=====]
I3: [=====]
Pipelined (5 short stages, overlapped):
I1: F D E M W
I2: F D E M W
I3: F D E M W <- one instruction finishes per cycle once full
2. Pipeline Registers¶
Between every pair of stages sits a pipeline register that captures the intermediate values produced by one stage so the next stage can use them on the following clock edge. Without these registers, the partial results of one instruction would be overwritten by the next instruction entering the same combinational logic.
flowchart LR
IF[IF] -->|IF/DR| DR[DR]
DR -->|DR/EX| EX[EX]
EX -->|EX/MEM| MEM[MEM]
MEM -->|MEM/WB| WB[WB]
The four pipeline registers are named after the stages they sit between:
| Pipeline register | Carries (examples) |
|---|---|
| IF/DR | The fetched instruction word, PC, PC+4 |
| DR/EX | RD0, RD1, immediate, register numbers (RR0, RR1, WR), control lines (ALUOp, RFW, etc.) |
| EX/MEM | ALU result, RD1 (store data), WR, control lines (RFW, memory control) |
| MEM/WB | Memory read data, ALU result, WR, RFW |
Each pipeline register has standard control inputs we will use to manage hazards:
- EN (enable) — when 0, the register holds its current value (it does not load new data). This stalls the instruction sitting in that stage.
- CLR (clear) — when 1, the register's outputs are zeroed on the next edge, which injects a "bubble" (effectively a
nop). This flushes the instruction.
A subscript convention is used throughout the project to indicate which stage's copy of a signal we mean.
RR0_2is the source-register-0 number as seen in the DR stage,WR_3is the destination register number in the EX/MEM stage,RFW_4is the register-file-write control in the WB stage, and so on. Stage 1 = IF, 2 = DR, 3 = EX/MEM, 4 = MEM/WB in this numbering.
3. The Four Modifications¶
The heart of this lecture is a list of four types of modifications ("mods") to evolve the starter pipeline so that ordinary programs run correctly without hand-inserted nop instructions:
4 types of mods
1) Invert the clock to the RegFile (removes one nop) <- checked in lecture
2) Forwarding (data hazards)
3) LD stalling (load-use hazard)
4) Control Hazard (jumps and branches)
These build on each other. The starter pipeline already runs correctly if you insert enough nop instructions (addi zero, zero, zero) between dependent instructions. Each modification removes some of those required nops and lets the pipeline run faster, until no manual nops are needed at all.
| Mod | Hazard addressed | Mechanism |
|---|---|---|
| 1. Invert RegFile clock | Data (RAW) | Write in WB on the first half of the cycle, read in DR on the second half |
| 2. Forwarding | Data (RAW) | Route the EX/MEM or MEM/WB result back into the EX stage |
| 3. LD stalling | Load-use | Detect a load feeding the next instruction; freeze the pipeline one cycle |
| 4. Control hazard | Jumps / branches | Update PC early in EX; flush the wrong-path instructions already fetched |
4. Data Hazards (RAW)¶
A data hazard occurs when an instruction needs a value that an earlier, still-in-flight instruction has computed but not yet written back. The most common form is read-after-write (RAW): instruction I2 reads a register that instruction I1 writes.
Consider this dependent sequence — add needs the values addi placed into a1 and a2:
addi a1, zero, 3 # a1 = 3, written back in WB
addi a2, zero, 4 # a2 = 4, written back in WB
add a0, a1, a2 # needs a1 and a2 -> a0 should be 7
On the bare pipeline this fails. The add reaches its DR stage (where it reads the register file) before the addis reach their WB stage (where they write the register file). It reads stale values.
Why three nops "work" on the starter¶
The starter pipeline reads registers in DR and writes them in WB. WB is three stages after DR. So a dependent instruction must be separated by enough cycles that the producer's WB happens before the consumer's DR. The original starter therefore needs three nops between a producer and a consumer:
cycle: 1 2 3 4 5 6 7 8 9 10
addi a1,..,3: F D E M W
addi a2,..,4: F D E M W
nop : F D E M W
nop : F D E M W
nop : F D E M W
add a0,a1,a2 : F D E M W
The add reaches DR in cycle 6, after a2's WB completes in cycle 6 — but only if the write happens early enough in the cycle to be readable in the same cycle. That timing is exactly what Modification 1 fixes.
This is 00-add-3nop.s in the project test suite: it passes on the unmodified starter and exists so you can see how nops paper over hazards.
5. Modification 1 — Invert the RegFile Clock¶
The first and simplest fix: invert the clock signal feeding the register file. This lets the WB stage write a register on the first half of a clock cycle and lets the DR stage read that same register on the second half of the same cycle. A producer's write becomes visible to a consumer's read one cycle sooner, which removes one required nop — going from three nops to two.
The handwritten timing diagram shows the regular CLK on top and the inverted CLK below. The DR stage reads on the rising edge of the inverted clock (the falling edge of the real clock), which is the second half of the cycle, after WB has already written on the first half.
|<-- one clock cycle -->|
___________ ___________
CLK | | | | (RegFile WB writes here, 1st half)
___| |___________| |___
_______ _______
~CLK ____| |___________| |________ (RegFile DR reads here, 2nd half)
WB writes -> then DR reads the new value in the SAME cycle
With the inverted clock, the earlier nop-padded program needs only two nops, and the whole sequence finishes in 9 cycles instead of 10:
cycle: 1 2 3 4 5 6 7 8 9
addi a1,..,3: F D E M W
addi a2,..,4: F D E M [W] <- WB writes a2 on 1st half of cycle 6
nop : F D E M W
nop : F D E M W
add a0,a1,a2 : F [D] E M W <- DR reads a2 on 2nd half of cycle 6
The two boxed cells line up in the same cycle: WB writes first, DR reads second.
This single change is what makes test cases 01-add-2nop.s, 02-jal.s, and 03-ld.s pass on the starter. All you do in Digital is route an inverted CLK into the RegFile's clock input.
Intuition: the register file is the shared "blackboard." If the writer always erases-and-rewrites at the very start of the period and the reader always copies at the very end of the period, the reader sees today's value, not yesterday's. Inverting one clock is enough to enforce that ordering.
6. Modification 2 — Forwarding (Bypassing)¶
Inverting the clock removed one nop, but two still remain because a producer's result is not in the register file until WB. Forwarding (also called bypassing) removes the rest by recognizing that the value the consumer needs already exists deeper in the pipeline — it just has not been written back yet. Instead of waiting for WB, we route the result directly to the EX stage where it is needed.
Look at the dependency again, this time annotated with where the value actually becomes available:
cycle: 1 2 3 4 5 6
addi a1,..,3: F D E M [W]
addi a2,..,4: F D E [M] W
add a0,a1,a2 : F [D] E M W
^ ^
| the add needs a1, a2 at the START of its EX (cycle 5)
a2's ALU result exists at end of its EX (cycle 4 -> in EX/MEM reg)
When the add is in EX (cycle 5), the second addi's result is sitting in the EX/MEM pipeline register, and the first addi's result is in the MEM/WB pipeline register. Forwarding adds datapaths that carry those results back into the EX stage so the add can use them immediately — no nops at all. The handwritten note draws the blue "forwarding" wires curving from the later stages back to the add's EX inputs.
With forwarding, the entire dependent sequence collapses to back-to-back instructions:
cycle: 1 2 3 4 5 6 7
addi a1,..,3: F D E M W
addi a2,..,4: F D E M W
add a0,a1,a2 : F D E M W <- gets a1,a2 by forwarding, NO nops
The forwarding datapath¶
We insert a multiplexer in front of each ALU input. The RD0 path gets one MUX, the RD1 path gets another. Each MUX chooses among three sources:
- The original register-file value (
RD0/RD1from the DR/EX register) — selector value0 - The ALU result in the EX/MEM stage (
ALUR_3) — selector value2 - The write-back-stage result (
MR_4, the output of the MEM/WB write-data MUX) — selector value1
flowchart TD
RD0[RD0 from DR/EX] --> M0[/RD0 MUX<br/>sel = FRD0\]
ALUR3[ALUR_3<br/>EX/MEM ALU result] --> M0
MR4[MR_4<br/>MEM/WB result] --> M0
M0 --> A[ALU input A]
RD1[RD1 from DR/EX] --> M1[/RD1 MUX<br/>sel = FRD1\]
ALUR3b[ALUR_3] --> M1
MR4b[MR_4] --> M1
M1 --> B[ALU input B]
The Hazard Unit logic for forwarding¶
The Hazard Unit computes the selector FRD0 (and symmetrically FRD1) by comparing the consumer's source register number in DR (RR0_2) against the destination register numbers of the instructions ahead of it, and only forwarding if those instructions actually write the register file (RFW):
// Forwarding selector for ALU input A (RD0).
// Stage 3 = EX/MEM (closest, highest priority), Stage 4 = MEM/WB.
if ((RR0_2 == WR_3) && RFW_3) {
FRD0 = 2; // forward the EX/MEM ALU result (ALUR_3)
} else if ((RR0_2 == WR_4) && RFW_4) {
FRD0 = 1; // forward the MEM/WB result (MR_4)
} else {
FRD0 = 0; // no hazard: use the register-file value RD0
}
FRD1 is identical but uses RR1_2 in place of RR0_2.
Priority: closest producer wins¶
The order of the tests matters. If both instructions ahead write the same register, the one closest to the consumer (the EX/MEM stage, stage 3) holds the more recent value and must win. The Hazard Unit checks stage 3 first.
addi a0, zero, 3 # writes a0 (older)
addi a0, zero, 4 # writes a0 (newer -> this is the value we want)
add a0, a0, a0 # must see 4 + 4 = 8, not 3
Because the second addi is closer (in EX/MEM) when the add is in EX, FRD0 = 2 fires first and forwards 4. This gives a0 = 8, which is correct.
This is test 04-add-fwd.s, worth the largest single block of points (50 pts) because it is the workhorse data-hazard fix.
7. Modification 3 — Load Stalling¶
Forwarding handles the common case where the needed value is produced by the ALU (available at the end of EX). But a load (ld) does not produce its value until the MEM stage. If the very next instruction needs the loaded value in its EX stage, there is simply no way to forward in time — the data does not exist yet. This is a load-use hazard, and the only fix is to stall the pipeline for one cycle so the load can reach MEM and then forward from there.
li a0, 0
li a1, 4
sd a1, (a0) # store 4 to memory[a0]
ld a2, (a0) # a2 = memory[a0] = 4 (value ready only in MEM)
addi a0, a2, 1 # needs a2 immediately -> a0 should be 5
unimp
cycle: 1 2 3 4 5 6
ld a2,(a0) : F D E [M] W <- a2 known only at end of MEM (cycle 4)
addi a0,a2,1: F D [E] ... <- needs a2 at start of EX (cycle 4): TOO EARLY
The consumer's EX and the load's MEM are in the same cycle, so the value cannot be forwarded backward in time. We must insert exactly one bubble.
How to stall¶
To stall, the Hazard Unit does two things at once:
- Freeze the front of the pipeline so the consumer (and everything behind it) re-tries next cycle: deassert
ENon the PC, the IF/DR register, and the DR/EX register. - Flush the instruction currently leaving EX so a bubble (effective
nop) is injected forward: assertCLRon the EX/MEM register.
cycle: 1 2 3 4 5 6 7
ld a2,(a0) : F D E M W
addi a0,a2,1: F D D E M W <- DR repeated (stalled) one cycle
^
bubble inserted into EX/MEM; PC, IF/DR, DR/EX held
After the stall, the load's value is in MEM/WB and ordinary forwarding (Mod 2) delivers it to the addi's EX. One bubble, then full speed.
The Hazard Unit logic for load stalling¶
// Load-use hazard: the instruction in EX/MEM is a load (MLD_3) that writes
// a register (RFW_3) which the instruction in DR is about to read.
if (RFW_3 && MLD_3 && ((RR0_2 == WR_3) || (RR1_2 == WR_3))) {
PC_EN = 0; // freeze PC
IF_DR_EN = 0; // freeze IF/DR pipeline register
DR_EX_EN = 0; // freeze DR/EX pipeline register
EX_MEM_CLR = 1; // inject a bubble into EX/MEM
} else {
PC_EN = EN_ORG; // preserve the original enable behavior
IF_DR_EN = 1;
DR_EX_EN = 1;
EX_MEM_CLR = CLR_ORG; // preserve the original clear behavior
}
MLD_3 is a "memory load" control line indicating the EX/MEM instruction is a load. Note the requirement that the Hazard Unit preserve the original EN/CLR lines when not stalling — the EN_ORG/CLR_ORG fall-through keeps single-step debugging working as before. This is test 05-ld-stl.s (20 pts).
8. Control Hazards¶
A control hazard arises from instructions that change the PC: jumps (jal, jalr) and taken branches (beq, bne, blt, bge). The pipeline fetches the next sequential instructions every cycle, but a jump/branch should redirect the PC. By the time the jump resolves, the pipeline has already fetched the wrong-path instructions behind it.
main:
li a0, 3
jal foo # should jump to foo
unimp # marker: should NOT execute
foo:
addi a0, a0, 4 # a0 should be 7
ret
On the bare pipeline, after jal the next instructions in IF and DR are the wrong-path unimp (and whatever follows). The starter needs four nops after a jal to keep those wrong-path slots empty until the PC is updated.
Resolving control hazards: update early, flush¶
Two coordinated changes remove the need for those nops:
- Update the PC early. Compute the jump/branch target and feed it back to the PC from the EX stage rather than later. The second input of the PC-branch (
PCBr) MUX comes from the ALU result in EX, and the MUX selector comes fromPCbr_2(the branch/jump-taken signal as seen in DR→EX), not from a later stage. - Flush the wrong-path instructions already in IF and DR by clearing the IF/DR and DR/EX pipeline registers, turning them into bubbles.
// Control hazard: a jump or taken branch is detected in the EX path.
if (PCbr_2 == 1) {
IF_DR_CLR = 1; // flush the instruction in IF/DR (wrong path)
DR_EX_CLR = 1; // flush the instruction in DR/EX (wrong path)
} else {
IF_DR_CLR = CLR_ORG; // preserve original clear behavior
DR_EX_CLR = CLR_ORG;
}
Before flush (jal fetched, wrong path entering):
cycle: 1 2 3 4
jal foo : F D E <- PC redirected here (EX)
unimp(wrong): F D *flush* <- in DR/EX, cleared to a bubble
???? (wrong): F *flush* <- in IF/DR, cleared to a bubble
addi a0,a0,4: F D E <- correct target fetched after redirect
Because the redirect happens in EX and the two instructions fetched after the jump are flushed, execution continues correctly at foo. This is test 06-jal-fls.s (10 pts). Conditional branches reuse the exact same flush logic — PCbr_2 is asserted only when the Branch Unit says the branch is taken — so 07-branch.s (5 pts) passes for free once the jump flush works.
flowchart TD
A[Instruction in EX] --> B{PCbr_2 == 1?<br/>jump or taken branch}
B -- yes --> C[Set PC = ALU result from EX]
C --> D[IF_DR_CLR = 1<br/>DR_EX_CLR = 1<br/>flush wrong-path]
B -- no --> E[PC = PC + 4<br/>no flush]
9. Putting It Together: the Hazard Unit¶
All four modifications converge in a single combinational block — the Hazard Unit — that sits beside the datapath, observes the register numbers and control lines flowing through the pipeline registers, and produces:
- Forwarding selectors
FRD0,FRD1for the two EX-stage MUXes (Mod 2) - Enable signals
PC_EN,IF_DR_EN,DR_EX_ENfor stalling (Mod 3) - Clear signals
EX_MEM_CLR(stall bubble),IF_DR_CLR,DR_EX_CLR(control-hazard flush) (Mods 3 and 4)
flowchart LR
subgraph Inputs
I1[RR0_2 / RR1_2]
I2[WR_3 / WR_4]
I3[RFW_3 / RFW_4]
I4[MLD_3]
I5[PCbr_2]
I6[EN_ORG / CLR_ORG]
end
HU[Hazard Unit]
subgraph Outputs
O1[FRD0 / FRD1]
O2[PC_EN, IF_DR_EN, DR_EX_EN]
O3[EX_MEM_CLR]
O4[IF_DR_CLR, DR_EX_CLR]
end
I1 --> HU
I2 --> HU
I3 --> HU
I4 --> HU
I5 --> HU
I6 --> HU
HU --> O1
HU --> O2
HU --> O3
HU --> O4
The summary signal list discussed in lecture for expanding the Hazard Unit interface includes RR0, RR1, WR_3, WR_4, MLD, RFW, plus the original EN/CLR lines that must be preserved.
Recommended implementation order¶
Implement and test the modifications in order — each one makes a specific test pass and builds on the previous:
| Order | Modification | Test it passes | Points |
|---|---|---|---|
| 1 | Invert RegFile clock | 00, 01, 02, 03 |
10 |
| 2 | Forwarding (FRD0/FRD1 MUXes) | 04-add-fwd |
50 |
| 3 | Load stalling | 05-ld-stl |
20 |
| 4 | Control-hazard flush (jump) | 06-jal-fls |
10 |
| 4b | Control-hazard flush (branch) | 07-branch |
5 |
| — | Full program | 08-fibrec (= 55) |
5 |
Project 7 correction announced in lecture: pull the latest in-class and test repos to get the updated Week 15 ROM directory, remove the old
NEMinstruction and the outdated ROM directory from your Project 7 folder, and copy in the new ROM directory. The top-level processor circuit must be namedproject07.dig.
10. Performance: Counting Cycles¶
A useful exam skill is counting how many cycles a sequence of m instructions takes on a k-stage pipeline. With no hazards/stalls:
total cycles = k + (m - 1)
= (fill the pipeline: k) + (one more instruction completes each
additional cycle: m - 1)
For our 5-stage pipeline running 3 independent instructions:
Each hazard fix changes the number of instructions (nops removed) and the number of stall bubbles, and therefore the cycle count. The lecture's running example illustrates the progression for the same dependent add:
| Version | Instructions (incl. nops) | Cycles |
|---|---|---|
| Starter (3 nops) | 2 producers + 3 nops + add = 6 | 10 |
| Inverted clock (2 nops) | 2 producers + 2 nops + add = 5 | 9 |
| Forwarding (0 nops) | 2 producers + add = 3 | 7 |
Each stall bubble adds exactly one cycle; each flush turns an already-fetched instruction into a bubble (it does not add a cycle beyond the redirect already happening).
Key Concepts¶
| Concept | Definition | Example |
|---|---|---|
| Pipelining | Overlapping instruction stages to raise throughput | 5 instructions in flight, one finishes per cycle |
| Pipeline register | Register between stages that holds intermediate values | IF/DR, DR/EX, EX/MEM, MEM/WB |
| Data hazard (RAW) | Instruction reads a register an earlier in-flight instruction writes | add a0,a1,a2 after addi a1,... |
| Clock inversion | Write RegFile on 1st half, read on 2nd half of the cycle | Removes one nop (10 → 9 cycles) |
| Forwarding | Routing a result from EX/MEM or MEM/WB back to EX | FRD0 = 2 selects ALUR_3 |
| Forwarding priority | Closest (most recent) producer wins | check WR_3 before WR_4 |
| Load-use hazard | Load result needed by the immediately following instruction | ld a2,(a0) then addi a0,a2,1 |
| Stall | Freeze front of pipeline (EN=0) and inject a bubble (CLR=1) |
one-cycle stall before forwarding the load |
| Control hazard | Jump/branch changes PC after wrong-path instrs fetched | jal foo then wrong-path unimp |
| Flush | Clear a pipeline register to a bubble | IF_DR_CLR=1, DR_EX_CLR=1 on a jump |
| Hazard Unit | Combinational block producing forward/stall/flush signals | inputs RR0, WR_3, RFW_3, MLD_3, PCbr_2 |
Practice Problems¶
Problem 1: Identify the hazard¶
Classify the hazard (if any) in each pair and name the cheapest fix.
# (a)
add t0, t1, t2
sub t3, t0, t4
# (b)
ld t0, (a0)
add t3, t0, t4
# (c)
beq t0, t1, done
add t2, t3, t4
Click to reveal solution
- **(a)** RAW data hazard: `sub` reads `t0` produced by `add`. Cheapest fix: **forwarding** from EX/MEM to EX (`FRD0` selects `ALUR_3`). No stall needed. - **(b)** Load-use data hazard: `add` reads `t0` loaded by `ld`, whose value is ready only in MEM. Cheapest fix: **one-cycle stall**, then forward from MEM/WB. Forwarding alone cannot fix it because the value does not exist early enough. - **(c)** Control hazard: `beq` may redirect the PC, but `add` is fetched on the wrong path. Fix: update PC in EX and **flush** IF/DR and DR/EX if the branch is taken (`PCbr_2 == 1`).Problem 2: Cycle count with clock inversion¶
The starter pipeline needs three nops between a producer and consumer. After inverting the RegFile clock it needs two. For the sequence below, how many cycles does it take with the inverted clock (two nops between the dependent instructions)?
Click to reveal solution
There are 5 instructions, 5-stage pipeline, no stalls: This matches the lecture's "10 cycles → 9 cycles" result: inverting the clock removed one `nop`, dropping the count from 10 to 9.Problem 3: Forwarding selector values¶
Given the snapshot below, what are FRD0 and FRD1 for the instruction in EX? Use the project's selector encoding (0 = RegFile, 1 = MEM/WB result, 2 = EX/MEM result).
Instruction in EX : add a0, a1, a2 (RR0_2 = a1, RR1_2 = a2)
Instruction in EX/MEM (stage 3): addi a2, zero, 4 (WR_3 = a2, RFW_3 = 1)
Instruction in MEM/WB (stage 4): addi a1, zero, 3 (WR_4 = a1, RFW_4 = 1)
Click to reveal solution
Apply the Hazard Unit logic to each source: - **FRD0** (source `a1`): `RR0_2 == WR_3`? `a1 == a2`? No. `RR0_2 == WR_4 && RFW_4`? `a1 == a1` and `RFW_4 == 1`? Yes → `FRD0 = 1` (forward `MR_4`). - **FRD1** (source `a2`): `RR1_2 == WR_3 && RFW_3`? `a2 == a2` and `RFW_3 == 1`? Yes → `FRD1 = 2` (forward `ALUR_3`). So `FRD0 = 1`, `FRD1 = 2`. The `add` gets `a1 = 3` from MEM/WB and `a2 = 4` from EX/MEM, computing `a0 = 7` with no `nop`s.Problem 4: Why priority matters¶
Show that swapping the order of the two if tests in the forwarding logic gives the wrong answer for the program below.
Click to reveal solution
When `add` is in EX: - The second `addi a0,...,4` is in **EX/MEM** (stage 3, the *newer* value 4). - The first `addi a0,...,3` is in **MEM/WB** (stage 4, the *older* value 3). **Correct logic** checks stage 3 first: Result: `4 + 4 = 8`. Correct. **Swapped logic** checks stage 4 first: Result: `3 + 3 = 6`. Wrong. The closest producer holds the most recent write, so its test must come first.Problem 5: Detecting a load stall¶
Write the boolean condition under which the Hazard Unit should stall, and state exactly which pipeline-register control signals it sets.
Click to reveal solution
Stall condition (the EX/MEM instruction is a load writing a register the DR instruction reads): When `stall` is true: When false, the Hazard Unit passes through the original `EN_ORG` / `CLR_ORG` values so normal operation (and single-step debugging) is unaffected.Problem 6: Control-hazard flush count¶
For the program below, how many fetched instructions must be flushed when jal redirects the PC from the EX stage, and which pipeline registers are cleared?
Click to reveal solution
When `jal` reaches EX and asserts `PCbr_2 == 1`, two instructions have already been fetched on the wrong (sequential) path: one in the **DR/EX** register and one in the **IF/DR** register. Both are flushed: So **two** instructions are flushed (turned into bubbles), and the correct target `foo` is fetched right after the PC is redirected. No `nop`s are needed in the source program.Further Reading¶
- Processor design overview (course guide)
- Project 7: RISC-V Pipelined Processor Hazard Unit
- Source PDF: handwritten lecture notes
- Instruction pipelining (Wikipedia)
- Hazard (computer architecture) (Wikipedia)
- Classic RISC pipeline (Wikipedia)
- Patterson & Hennessy, Computer Organization and Design (RISC-V Edition), Chapter 4: The Processor
Summary¶
-
A 5-stage pipeline (IF, DR, EX, MEM, WB) overlaps instructions to raise throughput; pipeline registers carry intermediate values between stages.
-
Hazards arise because of the overlap: a data hazard reads a not-yet-written value, a control hazard fetches wrong-path instructions after a jump/branch.
-
There are four modifications to remove hand-inserted
nops: invert the RegFile clock, add forwarding, add load stalling, and add control-hazard flushing — implemented in that order. -
Inverting the RegFile clock lets WB write in the first half of a cycle and DR read in the second half, removing one
nopand dropping the example from 10 to 9 cycles. -
Forwarding routes a result from the EX/MEM (
ALUR_3) or MEM/WB (MR_4) stage back into the EX-stage ALU inputs viaFRD0/FRD1MUXes, removing the remainingnops; the closest producer takes priority. -
Load-use hazards cannot be forwarded because the load's value is ready only in MEM, so the Hazard Unit stalls one cycle by freezing
PC_EN,IF_DR_EN,DR_EX_ENand injecting a bubble withEX_MEM_CLR. -
Control hazards are resolved by updating the PC early (from the EX ALU result, selected by
PCbr_2) and flushing the wrong-path instructions in IF/DR and DR/EX viaIF_DR_CLRandDR_EX_CLR. -
All of this lives in a single Hazard Unit that observes register numbers and control lines (
RR0,RR1,WR_3,WR_4,RFW,MLD,PCbr_2) and must preserve the originalEN/CLRbehavior when no hazard is present.