Processor Pipeline Hazards¶

Overview¶

This lecture moves from the single-cycle processor we built in Project 6 to a pipelined processor that overlaps instruction execution across five stages to increase throughput. Pipelining introduces a new problem: hazards, situations where the overlap causes an instruction to use a value (or a PC) that is not yet ready. We study the three classes of fixes you will implement in Project 7's Hazard Unit — inverting the register-file clock, data forwarding, and load stalling for data hazards, plus flushing for control hazards — and show with cycle-by-cycle diagrams how each fix removes nop instructions and shortens the running time of a program.

Learning Objectives¶

Explain how a 5-stage pipeline (IF, DR, EX, MEM, WB) overlaps instructions to improve throughput without raising the clock rate
Describe the role of pipeline registers in carrying intermediate values between stages
Identify the four kinds of modifications needed to handle hazards: clock inversion, forwarding, load stalling, and control-hazard flushing
Recognize a read-after-write (RAW) data hazard from a cycle/stage diagram
Show how inverting the RegFile clock removes one nop per dependency
Design the forwarding (bypass) logic that selects RD0/RD1 from the EX, MEM, or WB results
Detect a load-use hazard and stall the pipeline by gating enable and clear signals on pipeline registers
Resolve control hazards (jumps and branches) by updating the PC early and flushing wrong-path instructions

Prerequisites¶

The single-cycle RISC-V processor datapath from Project 6 (PC, instruction memory, RegFile, ImmDecoder, ALU, Branch Unit, data memory)
RISC-V assembly and machine code: addi, add, ld/sd, jal/jalr, beq/bne/blt/bge
Digital design fundamentals: registers, multiplexers, comparators, clocked vs. combinational logic
The RISC-V register file interface (RR0, RR1, RD0, RD1, WR, WD, WE)

See the processor design overview and the source PDF for the original handwritten lecture notes.

1. From Single-Cycle to Pipelined¶

In a single-cycle processor, one instruction occupies the entire datapath for one (long) clock cycle: fetch, decode, execute, memory, and write-back all happen before the next instruction begins. The clock period must be long enough for the slowest instruction to flow all the way through, so most of the hardware sits idle most of the time.

A pipelined processor splits the datapath into stages separated by registers, so that multiple instructions can be "in flight" at once — each occupying a different stage. This is the same idea as an assembly line or the laundry analogy: while one load is in the dryer, the next is in the washer.

flowchart LR
    A[IF<br/>Instruction Fetch] --> B[DR<br/>Decode / RegFile Read]
    B --> C[EX<br/>Execute / ALU]
    C --> D[MEM<br/>Memory]
    D --> E[WB<br/>Write Back]

    style A fill:#cde,stroke:#333
    style B fill:#cde,stroke:#333
    style C fill:#fdc,stroke:#333,stroke-width:2px
    style D fill:#cde,stroke:#333
    style E fill:#cde,stroke:#333

The five RISC-V pipeline stages are:

Stage	Name	Work performed
IF	Instruction Fetch	Read instruction from instruction memory using PC; compute PC+4
DR	Decode / RegFile Read	Decode the instruction word; read source registers (RD0, RD1); produce control lines
EX	Execute	Perform the ALU operation; compute branch/jump target; evaluate the branch comparison
MEM	Memory	Load from or store to data memory
WB	Write Back	Write the result back into the register file

Historical note: pipelining originated in mainframes and became common in personal-computer CPUs in the 1980s. It is the foundational technique behind all modern high-performance processors.

Throughput, not latency¶

Pipelining does not make a single instruction finish faster — that instruction still passes through all five stages. What improves is throughput: once the pipeline is full, one instruction completes every cycle. In steady state, an n-stage pipeline can approach an n-times speedup over a single-cycle machine running at the same per-stage speed.

Single-cycle (1 long cycle each):
  I1: [=====]
  I2:        [=====]
  I3:               [=====]

Pipelined (5 short stages, overlapped):
  I1: F D E M W
  I2:   F D E M W
  I3:     F D E M W      <- one instruction finishes per cycle once full

2. Pipeline Registers¶

Between every pair of stages sits a pipeline register that captures the intermediate values produced by one stage so the next stage can use them on the following clock edge. Without these registers, the partial results of one instruction would be overwritten by the next instruction entering the same combinational logic.

flowchart LR
    IF[IF] -->|IF/DR| DR[DR]
    DR -->|DR/EX| EX[EX]
    EX -->|EX/MEM| MEM[MEM]
    MEM -->|MEM/WB| WB[WB]

The four pipeline registers are named after the stages they sit between:

Pipeline register	Carries (examples)
IF/DR	The fetched instruction word, PC, PC+4
DR/EX	RD0, RD1, immediate, register numbers (RR0, RR1, WR), control lines (ALUOp, RFW, etc.)
EX/MEM	ALU result, RD1 (store data), WR, control lines (RFW, memory control)
MEM/WB	Memory read data, ALU result, WR, RFW

Each pipeline register has standard control inputs we will use to manage hazards:

EN (enable) — when 0, the register holds its current value (it does not load new data). This stalls the instruction sitting in that stage.
CLR (clear) — when 1, the register's outputs are zeroed on the next edge, which injects a "bubble" (effectively a nop). This flushes the instruction.

A subscript convention is used throughout the project to indicate which stage's copy of a signal we mean. RR0_2 is the source-register-0 number as seen in the DR stage, WR_3 is the destination register number in the EX/MEM stage, RFW_4 is the register-file-write control in the WB stage, and so on. Stage 1 = IF, 2 = DR, 3 = EX/MEM, 4 = MEM/WB in this numbering.

3. The Four Modifications¶

The heart of this lecture is a list of four types of modifications ("mods") to evolve the starter pipeline so that ordinary programs run correctly without hand-inserted nop instructions:

4 types of mods
  1) Invert the clock to the RegFile      (removes one nop)   <- checked in lecture
  2) Forwarding                           (data hazards)
  3) LD stalling                          (load-use hazard)
  4) Control Hazard                       (jumps and branches)

These build on each other. The starter pipeline already runs correctly if you insert enough nop instructions (addi zero, zero, zero) between dependent instructions. Each modification removes some of those required nops and lets the pipeline run faster, until no manual nops are needed at all.

Mod	Hazard addressed	Mechanism
1. Invert RegFile clock	Data (RAW)	Write in WB on the first half of the cycle, read in DR on the second half
2. Forwarding	Data (RAW)	Route the EX/MEM or MEM/WB result back into the EX stage
3. LD stalling	Load-use	Detect a load feeding the next instruction; freeze the pipeline one cycle
4. Control hazard	Jumps / branches	Update PC early in EX; flush the wrong-path instructions already fetched

4. Data Hazards (RAW)¶

A data hazard occurs when an instruction needs a value that an earlier, still-in-flight instruction has computed but not yet written back. The most common form is read-after-write (RAW): instruction I2 reads a register that instruction I1 writes.

Consider this dependent sequence — add needs the values addi placed into a1 and a2:

addi a1, zero, 3     # a1 = 3, written back in WB
addi a2, zero, 4     # a2 = 4, written back in WB
add  a0, a1, a2      # needs a1 and a2 -> a0 should be 7

On the bare pipeline this fails. The add reaches its DR stage (where it reads the register file) before the addis reach their WB stage (where they write the register file). It reads stale values.

Why three `nop`s "work" on the starter¶

The starter pipeline reads registers in DR and writes them in WB. WB is three stages after DR. So a dependent instruction must be separated by enough cycles that the producer's WB happens before the consumer's DR. The original starter therefore needs three nops between a producer and a consumer:

       cycle:  1  2  3  4  5  6  7  8  9  10
addi a1,..,3:  F  D  E  M  W
addi a2,..,4:     F  D  E  M  W
nop          :        F  D  E  M  W
nop          :           F  D  E  M  W
nop          :              F  D  E  M  W
add a0,a1,a2 :                 F  D  E  M  W

The add reaches DR in cycle 6, after a2's WB completes in cycle 6 — but only if the write happens early enough in the cycle to be readable in the same cycle. That timing is exactly what Modification 1 fixes.

This is 00-add-3nop.s in the project test suite: it passes on the unmodified starter and exists so you can see how nops paper over hazards.

5. Modification 1 — Invert the RegFile Clock¶

The first and simplest fix: invert the clock signal feeding the register file. This lets the WB stage write a register on the first half of a clock cycle and lets the DR stage read that same register on the second half of the same cycle. A producer's write becomes visible to a consumer's read one cycle sooner, which removes one required nop — going from three nops to two.

The handwritten timing diagram shows the regular CLK on top and the inverted CLK below. The DR stage reads on the rising edge of the inverted clock (the falling edge of the real clock), which is the second half of the cycle, after WB has already written on the first half.

        |<-- one clock cycle -->|
        ___________             ___________
CLK    |           |           |           |      (RegFile WB writes here, 1st half)
    ___|           |___________|           |___

           _______             _______
~CLK  ____|       |___________|       |________   (RegFile DR reads here, 2nd half)

       WB writes  -> then DR reads the new value in the SAME cycle

With the inverted clock, the earlier nop-padded program needs only two nops, and the whole sequence finishes in 9 cycles instead of 10:

       cycle:  1  2  3  4  5  6  7  8  9
addi a1,..,3:  F  D  E  M  W
addi a2,..,4:     F  D  E  M [W]   <- WB writes a2 on 1st half of cycle 6
nop          :        F  D  E  M  W
nop          :           F  D  E  M  W
add a0,a1,a2 :              F [D] E  M  W   <- DR reads a2 on 2nd half of cycle 6

The two boxed cells line up in the same cycle: WB writes first, DR reads second.

This single change is what makes test cases 01-add-2nop.s, 02-jal.s, and 03-ld.s pass on the starter. All you do in Digital is route an inverted CLK into the RegFile's clock input.

Intuition: the register file is the shared "blackboard." If the writer always erases-and-rewrites at the very start of the period and the reader always copies at the very end of the period, the reader sees today's value, not yesterday's. Inverting one clock is enough to enforce that ordering.

6. Modification 2 — Forwarding (Bypassing)¶

Inverting the clock removed one nop, but two still remain because a producer's result is not in the register file until WB. Forwarding (also called bypassing) removes the rest by recognizing that the value the consumer needs already exists deeper in the pipeline — it just has not been written back yet. Instead of waiting for WB, we route the result directly to the EX stage where it is needed.

Look at the dependency again, this time annotated with where the value actually becomes available:

       cycle:  1  2  3  4  5  6
addi a1,..,3:  F  D  E  M [W]
addi a2,..,4:     F  D  E [M] W
add a0,a1,a2 :        F [D] E  M  W
                          ^   ^
                          |   the add needs a1, a2 at the START of its EX (cycle 5)
                          a2's ALU result exists at end of its EX (cycle 4 -> in EX/MEM reg)

When the add is in EX (cycle 5), the second addi's result is sitting in the EX/MEM pipeline register, and the first addi's result is in the MEM/WB pipeline register. Forwarding adds datapaths that carry those results back into the EX stage so the add can use them immediately — no nops at all. The handwritten note draws the blue "forwarding" wires curving from the later stages back to the add's EX inputs.

With forwarding, the entire dependent sequence collapses to back-to-back instructions:

       cycle:  1  2  3  4  5  6  7
addi a1,..,3:  F  D  E  M  W
addi a2,..,4:     F  D  E  M  W
add a0,a1,a2 :        F  D  E  M  W   <- gets a1,a2 by forwarding, NO nops

The forwarding datapath¶

We insert a multiplexer in front of each ALU input. The RD0 path gets one MUX, the RD1 path gets another. Each MUX chooses among three sources:

The original register-file value (RD0 / RD1 from the DR/EX register) — selector value 0
The ALU result in the EX/MEM stage (ALUR_3) — selector value 2
The write-back-stage result (MR_4, the output of the MEM/WB write-data MUX) — selector value 1

flowchart TD
    RD0[RD0 from DR/EX] --> M0[/RD0 MUX<br/>sel = FRD0\]
    ALUR3[ALUR_3<br/>EX/MEM ALU result] --> M0
    MR4[MR_4<br/>MEM/WB result] --> M0
    M0 --> A[ALU input A]

    RD1[RD1 from DR/EX] --> M1[/RD1 MUX<br/>sel = FRD1\]
    ALUR3b[ALUR_3] --> M1
    MR4b[MR_4] --> M1
    M1 --> B[ALU input B]

The Hazard Unit logic for forwarding¶

The Hazard Unit computes the selector FRD0 (and symmetrically FRD1) by comparing the consumer's source register number in DR (RR0_2) against the destination register numbers of the instructions ahead of it, and only forwarding if those instructions actually write the register file (RFW):

// Forwarding selector for ALU input A (RD0).
// Stage 3 = EX/MEM (closest, highest priority), Stage 4 = MEM/WB.
if ((RR0_2 == WR_3) && RFW_3) {
    FRD0 = 2;          // forward the EX/MEM ALU result (ALUR_3)
} else if ((RR0_2 == WR_4) && RFW_4) {
    FRD0 = 1;          // forward the MEM/WB result (MR_4)
} else {
    FRD0 = 0;          // no hazard: use the register-file value RD0
}

FRD1 is identical but uses RR1_2 in place of RR0_2.

Priority: closest producer wins¶

The order of the tests matters. If both instructions ahead write the same register, the one closest to the consumer (the EX/MEM stage, stage 3) holds the more recent value and must win. The Hazard Unit checks stage 3 first.

addi a0, zero, 3     # writes a0 (older)
addi a0, zero, 4     # writes a0 (newer -> this is the value we want)
add  a0, a0, a0      # must see 4 + 4 = 8, not 3

Because the second addi is closer (in EX/MEM) when the add is in EX, FRD0 = 2 fires first and forwards 4. This gives a0 = 8, which is correct.

This is test 04-add-fwd.s, worth the largest single block of points (50 pts) because it is the workhorse data-hazard fix.

7. Modification 3 — Load Stalling¶

Forwarding handles the common case where the needed value is produced by the ALU (available at the end of EX). But a load (ld) does not produce its value until the MEM stage. If the very next instruction needs the loaded value in its EX stage, there is simply no way to forward in time — the data does not exist yet. This is a load-use hazard, and the only fix is to stall the pipeline for one cycle so the load can reach MEM and then forward from there.

li   a0, 0
li   a1, 4
sd   a1, (a0)        # store 4 to memory[a0]
ld   a2, (a0)        # a2 = memory[a0] = 4  (value ready only in MEM)
addi a0, a2, 1       # needs a2 immediately -> a0 should be 5
unimp

       cycle:  1  2  3  4  5  6
ld a2,(a0)  :  F  D  E [M] W       <- a2 known only at end of MEM (cycle 4)
addi a0,a2,1:     F  D [E] ...     <- needs a2 at start of EX (cycle 4): TOO EARLY

The consumer's EX and the load's MEM are in the same cycle, so the value cannot be forwarded backward in time. We must insert exactly one bubble.

How to stall¶

To stall, the Hazard Unit does two things at once:

Freeze the front of the pipeline so the consumer (and everything behind it) re-tries next cycle: deassert EN on the PC, the IF/DR register, and the DR/EX register.
Flush the instruction currently leaving EX so a bubble (effective nop) is injected forward: assert CLR on the EX/MEM register.

       cycle:  1  2  3  4  5  6  7
ld a2,(a0)  :  F  D  E  M  W
addi a0,a2,1:     F  D  D  E  M  W   <- DR repeated (stalled) one cycle
                       ^
                    bubble inserted into EX/MEM; PC, IF/DR, DR/EX held

After the stall, the load's value is in MEM/WB and ordinary forwarding (Mod 2) delivers it to the addi's EX. One bubble, then full speed.

The Hazard Unit logic for load stalling¶

// Load-use hazard: the instruction in EX/MEM is a load (MLD_3) that writes
// a register (RFW_3) which the instruction in DR is about to read.
if (RFW_3 && MLD_3 && ((RR0_2 == WR_3) || (RR1_2 == WR_3))) {
    PC_EN      = 0;          // freeze PC
    IF_DR_EN   = 0;          // freeze IF/DR pipeline register
    DR_EX_EN   = 0;          // freeze DR/EX pipeline register
    EX_MEM_CLR = 1;          // inject a bubble into EX/MEM
} else {
    PC_EN      = EN_ORG;     // preserve the original enable behavior
    IF_DR_EN   = 1;
    DR_EX_EN   = 1;
    EX_MEM_CLR = CLR_ORG;    // preserve the original clear behavior
}

MLD_3 is a "memory load" control line indicating the EX/MEM instruction is a load. Note the requirement that the Hazard Unit preserve the original EN/CLR lines when not stalling — the EN_ORG/CLR_ORG fall-through keeps single-step debugging working as before. This is test 05-ld-stl.s (20 pts).

8. Control Hazards¶

A control hazard arises from instructions that change the PC: jumps (jal, jalr) and taken branches (beq, bne, blt, bge). The pipeline fetches the next sequential instructions every cycle, but a jump/branch should redirect the PC. By the time the jump resolves, the pipeline has already fetched the wrong-path instructions behind it.

main:
    li  a0, 3
    jal foo              # should jump to foo
    unimp                # marker: should NOT execute
foo:
    addi a0, a0, 4       # a0 should be 7
    ret

On the bare pipeline, after jal the next instructions in IF and DR are the wrong-path unimp (and whatever follows). The starter needs four nops after a jal to keep those wrong-path slots empty until the PC is updated.

Resolving control hazards: update early, flush¶

Two coordinated changes remove the need for those nops:

Update the PC early. Compute the jump/branch target and feed it back to the PC from the EX stage rather than later. The second input of the PC-branch (PCBr) MUX comes from the ALU result in EX, and the MUX selector comes from PCbr_2 (the branch/jump-taken signal as seen in DR→EX), not from a later stage.
Flush the wrong-path instructions already in IF and DR by clearing the IF/DR and DR/EX pipeline registers, turning them into bubbles.

// Control hazard: a jump or taken branch is detected in the EX path.
if (PCbr_2 == 1) {
    IF_DR_CLR = 1;          // flush the instruction in IF/DR (wrong path)
    DR_EX_CLR = 1;          // flush the instruction in DR/EX (wrong path)
} else {
    IF_DR_CLR = CLR_ORG;    // preserve original clear behavior
    DR_EX_CLR = CLR_ORG;
}

Before flush (jal fetched, wrong path entering):
       cycle:  1  2  3  4
jal foo     :  F  D  E             <- PC redirected here (EX)
unimp(wrong):     F  D  *flush*    <- in DR/EX, cleared to a bubble
???? (wrong):        F  *flush*    <- in IF/DR, cleared to a bubble
addi a0,a0,4:           F  D  E    <- correct target fetched after redirect

Because the redirect happens in EX and the two instructions fetched after the jump are flushed, execution continues correctly at foo. This is test 06-jal-fls.s (10 pts). Conditional branches reuse the exact same flush logic — PCbr_2 is asserted only when the Branch Unit says the branch is taken — so 07-branch.s (5 pts) passes for free once the jump flush works.

flowchart TD
    A[Instruction in EX] --> B{PCbr_2 == 1?<br/>jump or taken branch}
    B -- yes --> C[Set PC = ALU result from EX]
    C --> D[IF_DR_CLR = 1<br/>DR_EX_CLR = 1<br/>flush wrong-path]
    B -- no --> E[PC = PC + 4<br/>no flush]

9. Putting It Together: the Hazard Unit¶

All four modifications converge in a single combinational block — the Hazard Unit — that sits beside the datapath, observes the register numbers and control lines flowing through the pipeline registers, and produces:

Forwarding selectors FRD0, FRD1 for the two EX-stage MUXes (Mod 2)
Enable signals PC_EN, IF_DR_EN, DR_EX_EN for stalling (Mod 3)
Clear signals EX_MEM_CLR (stall bubble), IF_DR_CLR, DR_EX_CLR (control-hazard flush) (Mods 3 and 4)

flowchart LR
    subgraph Inputs
        I1[RR0_2 / RR1_2]
        I2[WR_3 / WR_4]
        I3[RFW_3 / RFW_4]
        I4[MLD_3]
        I5[PCbr_2]
        I6[EN_ORG / CLR_ORG]
    end
    HU[Hazard Unit]
    subgraph Outputs
        O1[FRD0 / FRD1]
        O2[PC_EN, IF_DR_EN, DR_EX_EN]
        O3[EX_MEM_CLR]
        O4[IF_DR_CLR, DR_EX_CLR]
    end
    I1 --> HU
    I2 --> HU
    I3 --> HU
    I4 --> HU
    I5 --> HU
    I6 --> HU
    HU --> O1
    HU --> O2
    HU --> O3
    HU --> O4

The summary signal list discussed in lecture for expanding the Hazard Unit interface includes RR0, RR1, WR_3, WR_4, MLD, RFW, plus the original EN/CLR lines that must be preserved.

Recommended implementation order¶

Implement and test the modifications in order — each one makes a specific test pass and builds on the previous:

Order	Modification	Test it passes	Points
1	Invert RegFile clock	`00`, `01`, `02`, `03`	10
2	Forwarding (FRD0/FRD1 MUXes)	`04-add-fwd`	50
3	Load stalling	`05-ld-stl`	20
4	Control-hazard flush (jump)	`06-jal-fls`	10
4b	Control-hazard flush (branch)	`07-branch`	5
—	Full program	`08-fibrec` (= 55)	5

Project 7 correction announced in lecture: pull the latest in-class and test repos to get the updated Week 15 ROM directory, remove the old NEM instruction and the outdated ROM directory from your Project 7 folder, and copy in the new ROM directory. The top-level processor circuit must be named project07.dig.

10. Performance: Counting Cycles¶

A useful exam skill is counting how many cycles a sequence of m instructions takes on a k-stage pipeline. With no hazards/stalls:

total cycles = k + (m - 1)
             = (fill the pipeline: k) + (one more instruction completes each
                additional cycle: m - 1)

For our 5-stage pipeline running 3 independent instructions:

total = 5 + (3 - 1) = 7 cycles

Each hazard fix changes the number of instructions (nops removed) and the number of stall bubbles, and therefore the cycle count. The lecture's running example illustrates the progression for the same dependent add:

Version	Instructions (incl. nops)	Cycles
Starter (3 nops)	2 producers + 3 nops + add = 6	10
Inverted clock (2 nops)	2 producers + 2 nops + add = 5	9
Forwarding (0 nops)	2 producers + add = 3	7

Each stall bubble adds exactly one cycle; each flush turns an already-fetched instruction into a bubble (it does not add a cycle beyond the redirect already happening).

Key Concepts¶

Concept	Definition	Example
Pipelining	Overlapping instruction stages to raise throughput	5 instructions in flight, one finishes per cycle
Pipeline register	Register between stages that holds intermediate values	IF/DR, DR/EX, EX/MEM, MEM/WB
Data hazard (RAW)	Instruction reads a register an earlier in-flight instruction writes	`add a0,a1,a2` after `addi a1,...`
Clock inversion	Write RegFile on 1st half, read on 2nd half of the cycle	Removes one `nop` (10 → 9 cycles)
Forwarding	Routing a result from EX/MEM or MEM/WB back to EX	`FRD0 = 2` selects `ALUR_3`
Forwarding priority	Closest (most recent) producer wins	check `WR_3` before `WR_4`
Load-use hazard	Load result needed by the immediately following instruction	`ld a2,(a0)` then `addi a0,a2,1`
Stall	Freeze front of pipeline (`EN=0`) and inject a bubble (`CLR=1`)	one-cycle stall before forwarding the load
Control hazard	Jump/branch changes PC after wrong-path instrs fetched	`jal foo` then wrong-path `unimp`
Flush	Clear a pipeline register to a bubble	`IF_DR_CLR=1`, `DR_EX_CLR=1` on a jump
Hazard Unit	Combinational block producing forward/stall/flush signals	inputs RR0, WR_3, RFW_3, MLD_3, PCbr_2

Practice Problems¶

Problem 1: Identify the hazard¶

Classify the hazard (if any) in each pair and name the cheapest fix.

# (a)
add  t0, t1, t2
sub  t3, t0, t4

# (b)
ld   t0, (a0)
add  t3, t0, t4

# (c)
beq  t0, t1, done
add  t2, t3, t4

Click to reveal solution

- **(a)** RAW data hazard: `sub` reads `t0` produced by `add`. Cheapest fix: **forwarding** from EX/MEM to EX (`FRD0` selects `ALUR_3`). No stall needed. - **(b)** Load-use data hazard: `add` reads `t0` loaded by `ld`, whose value is ready only in MEM. Cheapest fix: **one-cycle stall**, then forward from MEM/WB. Forwarding alone cannot fix it because the value does not exist early enough. - **(c)** Control hazard: `beq` may redirect the PC, but `add` is fetched on the wrong path. Fix: update PC in EX and **flush** IF/DR and DR/EX if the branch is taken (`PCbr_2 == 1`).

Problem 2: Cycle count with clock inversion¶

The starter pipeline needs three nops between a producer and consumer. After inverting the RegFile clock it needs two. For the sequence below, how many cycles does it take with the inverted clock (two nops between the dependent instructions)?

addi a1, zero, 3
addi a2, zero, 4
nop
nop
add  a0, a1, a2

Click to reveal solution

There are 5 instructions, 5-stage pipeline, no stalls:

total = k + (m - 1) = 5 + (5 - 1) = 9 cycles

This matches the lecture's "10 cycles → 9 cycles" result: inverting the clock removed one `nop`, dropping the count from 10 to 9.

Problem 3: Forwarding selector values¶

Given the snapshot below, what are FRD0 and FRD1 for the instruction in EX? Use the project's selector encoding (0 = RegFile, 1 = MEM/WB result, 2 = EX/MEM result).

Instruction in EX  : add a0, a1, a2     (RR0_2 = a1, RR1_2 = a2)
Instruction in EX/MEM (stage 3): addi a2, zero, 4   (WR_3 = a2, RFW_3 = 1)
Instruction in MEM/WB (stage 4): addi a1, zero, 3   (WR_4 = a1, RFW_4 = 1)

Click to reveal solution

Apply the Hazard Unit logic to each source: - **FRD0** (source `a1`): `RR0_2 == WR_3`? `a1 == a2`? No. `RR0_2 == WR_4 && RFW_4`? `a1 == a1` and `RFW_4 == 1`? Yes → `FRD0 = 1` (forward `MR_4`). - **FRD1** (source `a2`): `RR1_2 == WR_3 && RFW_3`? `a2 == a2` and `RFW_3 == 1`? Yes → `FRD1 = 2` (forward `ALUR_3`). So `FRD0 = 1`, `FRD1 = 2`. The `add` gets `a1 = 3` from MEM/WB and `a2 = 4` from EX/MEM, computing `a0 = 7` with no `nop`s.

Problem 4: Why priority matters¶

Show that swapping the order of the two if tests in the forwarding logic gives the wrong answer for the program below.

addi a0, zero, 3
addi a0, zero, 4
add  a0, a0, a0     # correct result: 8

Click to reveal solution

When `add` is in EX: - The second `addi a0,...,4` is in **EX/MEM** (stage 3, the *newer* value 4). - The first `addi a0,...,3` is in **MEM/WB** (stage 4, the *older* value 3). **Correct logic** checks stage 3 first:

if ((RR0_2 == WR_3) && RFW_3) FRD0 = 2;   // forwards 4  (correct)

Result: `4 + 4 = 8`. Correct. **Swapped logic** checks stage 4 first:

if ((RR0_2 == WR_4) && RFW_4) FRD0 = 1;   // forwards 3  (WRONG)

Result: `3 + 3 = 6`. Wrong. The closest producer holds the most recent write, so its test must come first.

Problem 5: Detecting a load stall¶

Write the boolean condition under which the Hazard Unit should stall, and state exactly which pipeline-register control signals it sets.

Click to reveal solution

Stall condition (the EX/MEM instruction is a load writing a register the DR instruction reads):

stall = RFW_3 && MLD_3 && ((RR0_2 == WR_3) || (RR1_2 == WR_3));

When `stall` is true:

PC_EN      = 0     // freeze the PC
IF_DR_EN   = 0     // freeze IF/DR register
DR_EX_EN   = 0     // freeze DR/EX register
EX_MEM_CLR = 1     // inject a bubble into EX/MEM

When false, the Hazard Unit passes through the original `EN_ORG` / `CLR_ORG` values so normal operation (and single-step debugging) is unaffected.

Problem 6: Control-hazard flush count¶

For the program below, how many fetched instructions must be flushed when jal redirects the PC from the EX stage, and which pipeline registers are cleared?

main:
    jal foo
    unimp          # wrong path
foo:
    addi a0, a0, 4

Click to reveal solution

When `jal` reaches EX and asserts `PCbr_2 == 1`, two instructions have already been fetched on the wrong (sequential) path: one in the **DR/EX** register and one in the **IF/DR** register. Both are flushed:

IF_DR_CLR = 1;   // flush instruction in IF/DR
DR_EX_CLR = 1;   // flush instruction in DR/EX

So **two** instructions are flushed (turned into bubbles), and the correct target `foo` is fetched right after the PC is redirected. No `nop`s are needed in the source program.

Summary¶

A 5-stage pipeline (IF, DR, EX, MEM, WB) overlaps instructions to raise throughput; pipeline registers carry intermediate values between stages.
Hazards arise because of the overlap: a data hazard reads a not-yet-written value, a control hazard fetches wrong-path instructions after a jump/branch.
There are four modifications to remove hand-inserted nops: invert the RegFile clock, add forwarding, add load stalling, and add control-hazard flushing — implemented in that order.
Inverting the RegFile clock lets WB write in the first half of a cycle and DR read in the second half, removing one nop and dropping the example from 10 to 9 cycles.
Forwarding routes a result from the EX/MEM (ALUR_3) or MEM/WB (MR_4) stage back into the EX-stage ALU inputs via FRD0/FRD1 MUXes, removing the remaining nops; the closest producer takes priority.
Load-use hazards cannot be forwarded because the load's value is ready only in MEM, so the Hazard Unit stalls one cycle by freezing PC_EN, IF_DR_EN, DR_EX_EN and injecting a bubble with EX_MEM_CLR.
Control hazards are resolved by updating the PC early (from the EX ALU result, selected by PCbr_2) and flushing the wrong-path instructions in IF/DR and DR/EX via IF_DR_CLR and DR_EX_CLR.
All of this lives in a single Hazard Unit that observes register numbers and control lines (RR0, RR1, WR_3, WR_4, RFW, MLD, PCbr_2) and must preserve the original EN/CLR behavior when no hazard is present.

Processor Pipeline Hazards¶

Overview¶

Learning Objectives¶

Prerequisites¶

1. From Single-Cycle to Pipelined¶

Throughput, not latency¶

2. Pipeline Registers¶

3. The Four Modifications¶

4. Data Hazards (RAW)¶

Why three nops "work" on the starter¶

5. Modification 1 — Invert the RegFile Clock¶

6. Modification 2 — Forwarding (Bypassing)¶

The forwarding datapath¶

The Hazard Unit logic for forwarding¶

Priority: closest producer wins¶

7. Modification 3 — Load Stalling¶

How to stall¶

The Hazard Unit logic for load stalling¶

8. Control Hazards¶

Resolving control hazards: update early, flush¶

9. Putting It Together: the Hazard Unit¶

Recommended implementation order¶

10. Performance: Counting Cycles¶

Key Concepts¶

Practice Problems¶

Problem 1: Identify the hazard¶

Problem 2: Cycle count with clock inversion¶

Problem 3: Forwarding selector values¶

Problem 4: Why priority matters¶

Problem 5: Detecting a load stall¶

Problem 6: Control-hazard flush count¶

Further Reading¶

Summary¶

Why three `nop`s "work" on the starter¶