← Back to Course
# Processor Pipelining ## CS 315 Computer Architecture --- ## Learning Objectives - Distinguish **single-cycle**, **multi-cycle**, and **pipelined** processors - Explain **latency** vs. **throughput** - Use the laundry analogy to reason about overlapping stages - Derive and apply the pipeline timing formula - Identify the five RISC-V pipeline stages: **IF, DR, EX, MEM, WB** - Explain why **pipeline registers** are needed between stages - Recognize **data hazards** and **control hazards** - Draw a pipeline execution diagram --- ## Three Processor Organizations | Organization | Cycles/Instruction | Clock Period | Key Idea | |---|---|---|---| | **Single-cycle** | 1 | Long (slowest insn) | Everything in one cycle | | **Multi-cycle** | Several (varies) | Short | Steps reuse hardware | | **Pipelined** | ~1 (steady state) | Short | Overlap different instructions |
Pipelining does
not
make one instruction faster — it improves
throughput
.
--- ## Latency vs. Throughput
**Latency** - Time to complete *one* task end-to-end - Pipelining does **not** reduce latency - One instruction still passes through all 5 stages
**Throughput** - Tasks completed per unit time - Pipelining **does** improve throughput - Steady state: one instruction finishes per clock cycle
Key insight: keep every functional unit busy every cycle by overlapping stages of
different
instructions.
--- ## The Laundry Analogy Four stages, 30 min each: **Wash, Dry, Fold, Put away** **Serial:** finish one load completely before starting the next ```text [W][D][F][P] [W][D][F][P] 4 loads = 240 min ``` **Pipelined:** start the next load as soon as the washer is free ```text time --> 30 30 30 30 30 30 30 Load 1 [W] [D] [F] [P] Load 2 [W] [D] [F] [P] Load 3 [W] [D] [F] [P] Load 4 [W] [D] [F] [P] ``` 4 loads pipelined = **210 min** (vs 240 serial) — gap widens with more loads. --- ## Laundry Pipeline Timing Table | Loads | Serial | Pipelined | Speedup | |---|---|---|---| | 1 | 120 min | 120 min | 1.0x | | 2 | 240 min | 150 min | 1.6x | | 4 | 480 min | 210 min | 2.3x | | 100 | 12,000 min | 3,090 min | 3.9x |
First load: full 120 min. Every
additional
load: just 30 min more (one stage time).
--- ## Pipeline Timing Formula Let **k** = stages, **t** = time per stage, **N** = tasks: ```text Total = k*t + (N - 1)*t = (fill time) + (remaining tasks at one stage-time each) ``` **Example: 100 loads** (k = 4, t = 0.5 hr) ```text Serial: 100 x 2 hrs = 200 hrs Pipelined: 4 x 0.5 + 99 x 0.5 = 2 hrs + 49.5 hrs = 51.5 hrs Speedup: 200 / 51.5 = 3.88x (approaching k = 4) ```
As N → ∞, speedup →
k
(number of stages).
--- ## Five RISC-V Pipeline Stages
flowchart LR IF["IF\nInstruction Fetch"] --> DR["DR\nDecode +\nRegFile Read"] DR --> EX["EX\nExecute\nALU + BU"] EX --> MEM["MEM\nMemory\nLoad/Store"] MEM --> WB["WB\nWrite Back\nto RegFile"] style IF fill:#e8f5e9,stroke:#00543c,stroke-width:2px style DR fill:#e8f5e9,stroke:#00543c,stroke-width:2px style EX fill:#fff3cd,stroke:#FDBB30,stroke-width:2px style MEM fill:#e8f5e9,stroke:#00543c,stroke-width:2px style WB fill:#e8f5e9,stroke:#00543c,stroke-width:2px
--- ## What Each Stage Does | Stage | Name | Hardware Used | |---|---|---| | **IF** | Instruction Fetch | PC, Instruction Memory | | **DR** | Decode + RegFile Read | Inst Decoder, Imm Decoder, Register File | | **EX** | Execute | ALU, Branch Unit | | **MEM** | Memory Access | Data Memory (RAM) | | **WB** | Write Back | Register File (write port) | The combinational logic is essentially the **same as the single-cycle processor** — we add pipeline registers to chop the long path into five short ones. --- ## Pipeline Registers Values produced in one stage are consumed in a **later cycle** — a wire is not enough. ```text IF/DR DR/EX EX/MEM MEM/WB | | | | [IF] [DR] [EX] [MEM] [WB] ``` Each pipeline register carries forward: - **Data**: `RD0`, `RD1`, immediate, ALU result, loaded value - **Control signals**: `ALUOp`, write-enable, dest reg `RD`, `WDsel`, mem lines - **`PC + 4`** for `jal`/`jalr` return-address writes in WB --- ## Why Carry PC + 4 Down the Pipeline? `jal ra, label` must write **return address = PC + 4** into `ra` at WB. By the time `jal` reaches WB, the PC has already advanced to new addresses.
flowchart LR A["IF\ncompute PC+4"] --> B["IF/DR\nholds PC+4"] B --> C["DR/EX\nholds PC+4"] C --> D["EX/MEM\nholds PC+4"] D --> E["MEM/WB\nholds PC+4"] E --> F["WB\nwrite PC+4 to ra"] style F fill:#e8f5e9,stroke:#00543c,stroke-width:2px
`PC + 4` is computed once in IF and **rides the pipeline registers** all the way to WB. --- ## Control Signals Travel with the Instruction The instruction decoder works in two parts (Project 7 refactor): 1. **Instruction word → INUM** (which instruction is this?) 2. **INUM → control signals** via a ROM (spreadsheet-populated)
Control signals are generated in
DR
but consumed across
EX
,
MEM
, and
WB
. They ride the pipeline registers with the instruction.
The implementation also keeps per-stage **"versions in flight"** — one copy of the instruction visible at each stage — for debugging. --- ## Flushing Pipeline Registers Pipeline registers have a **CLR** (clear) input.
Clearing a pipeline register converts that in-flight instruction into a
no-op (bubble)
.
Used to cancel instructions fetched on the wrong path — for example, after a **taken branch**: ```text Branch taken in EX cycle 5 -> flush IF/DR and DR/EX registers -> inject two bubbles -> correct target enters IF next cycle ``` This is the hardware mechanism for handling **control hazards**. --- ## Pipeline Execution Diagram ```text cycle: 1 2 3 4 5 6 7 addi a0,z,1 [IF] [DR] [EX] [MEM] [WB] addi a1,z,2 [IF] [DR] [EX] [MEM] [WB] add a2,a0,a1 [IF] [DR] [EX] [MEM] [WB] ``` - Read **across** a row: one instruction marching through 5 stages - Read **down** a column: multiple instructions in flight at once - Pipeline reaches steady-state when all 5 stages are occupied (needs 5+ instructions) --- ## Pipeline Diagram: Cycle-by-Cycle View
flowchart TD subgraph C3["Cycle 3"] A3["addi a0: EX"] B3["addi a1: DR"] C3i["add a2: IF"] end subgraph C4["Cycle 4"] A4["addi a0: MEM"] B4["addi a1: EX"] C4i["add a2: DR"] end subgraph C5["Cycle 5"] A5["addi a0: WB"] B5["addi a1: MEM"] C5i["add a2: EX"] end C3 --> C4 --> C5
In cycle 5: three instructions simultaneously active. After cycle 5, one instruction completes every cycle. --- ## Two Kinds of Hazards
flowchart TD H["Pipeline Hazards"] --> D["Data Hazards\ninstruction needs a result\nnot yet written back"] H --> C["Control Hazards\njumps and branches\nchange the PC"] style D fill:#e8f5e9,stroke:#00543c,stroke-width:2px style C fill:#fff3cd,stroke:#FDBB30,stroke-width:2px
Both prevent a real pipeline from reaching the ideal k× speedup. --- ## Data Hazards ```text addi a0, zero, 1 # writes a0 in WB -- cycle 5 addi a1, zero, 2 # writes a1 in WB -- cycle 6 add a2, a0, a1 # reads a0, a1 in DR -- cycle 4 <- TOO EARLY ```
Data hazard:
the consumer reads operands before the producer has written them back. Result: stale (wrong) data.
`add` reaches DR in cycle 4, but `a0` and `a1` are not written until cycles 5 and 6. --- ## Temporary Fix: Insert No-ops Pad with `nop` instructions to let producers reach WB before the consumer reaches DR: ```text addi a0, zero, 1 addi a1, zero, 2 nop # stall: wait for a0 nop # stall: wait for a1 add a2, a0, a1 # now a0 and a1 are valid ``` This **works** but **wastes two cycles**. Project 7 replaces nops with smarter hardware. --- ## Better Data Hazard Solutions (Project 7) | Technique | Idea | Cycles Saved | |---|---|---| | **Clock inversion** | WB writes 1st half-cycle, DR reads 2nd half | 1 nop eliminated | | **Forwarding** | Route EX/MEM result directly to ALU input | Most stalls removed | | **ld-stall** | Load result unavailable until after MEM; stall 1 cycle then forward | 1 cycle stall for `ld` followed immediately by dependent insn |
Hazard detection unit gates the PC enable:
stl_en = ld_stall OR en
--- ## Forwarding (Bypassing) Without forwarding: wait until WB writes to register file, then DR reads. With forwarding: route the result **earlier**, directly to the ALU input. ```text addi a0, zero, 1 EX result available after cycle 3 add a2, a0, a1 needs a0 in EX -- cycle 5 Forward: EX/MEM pipeline register -> ALU input of add (no stall needed for this case) ```
Forwarding is the primary technique for eliminating data hazard stalls.
--- ## Control Hazards When a **branch or jump** is fetched, the pipeline does not yet know the next PC. ```text Cycle 1: beq a0, a1, target [IF] Cycle 2: next insn (maybe wrong path) [IF] <- fetched speculatively Cycle 3: next+1 insn (maybe wrong path) [IF] <- fetched speculatively beq resolves in EX -> branch taken! ``` The two speculatively-fetched instructions may be on the wrong path. **Fix:** flush (clear) those pipeline registers once the branch resolves — inject bubbles. --- ## Control Hazard: Temporary Fix Insert **nop** instructions after a jump so the jump's new PC is in effect before any real instruction follows: ```text jal ra, target nop # pipeline is still fetching from old PC nop # by now, PC = target, safe to fetch real instructions ``` The real fix (Project 7): detect the taken branch in EX, **flush** IF/DR and DR/EX registers, redirect PC. --- ## Hazard Summary | Hazard | Cause | Temporary Fix | Project 7 Fix | |---|---|---|---| | **Data** | Producer not yet in WB when consumer reaches DR | Insert `nop`s | Clock inversion, forwarding, ld-stall | | **Control** | Branch/jump destination unknown when pipeline fetches next insns | Insert `nop`s after branch | Flush wrong-path instructions; redirect PC | --- ## Project 7 Overview Extend the **single-cycle processor** into a pipelined one: 1. Partition datapath into IF, DR, EX, MEM, WB 2. Insert pipeline registers (IF/DR, DR/EX, EX/MEM, MEM/WB) carrying data + control + PC+4 3. Refactor decoder: insn word → INUM → control ROM; keep per-stage "versions in flight" 4. Get basic pipeline working (using nops as needed) 5. Then add hazard handling: clock inversion, forwarding, ld-stall, flushing
flowchart LR P6["Project 6\nsingle-cycle\n(built from scratch)"] --> P7["Project 7\npipelined + hazard\nhandling (extend)"] style P7 fill:#e8f5e9,stroke:#00543c,stroke-width:2px
--- ## Key Concepts Reference | Concept | Definition | |---|---| | **Latency** | Time for one task end-to-end (unchanged by pipelining) | | **Throughput** | Tasks per unit time (improved by pipelining) | | **Pipeline register** | Stores data + control signals between stages | | **Flush** | Clear a pipeline register to inject a no-op (bubble) | | **Data hazard** | Consumer needs a result not yet written back | | **Control hazard** | Next PC unknown until branch/jump resolves | | **Forwarding** | Route result directly to later stage input | | **ld-stall** | 1-cycle stall when load result used immediately | --- ## Summary 1. **Three organizations:** single-cycle (long clock), multi-cycle (reuse hardware), pipelined (overlap stages of different instructions). 2. **Latency vs. throughput:** pipelining leaves single-instruction latency unchanged; it improves throughput toward one instruction per clock. 3. **Formula:** `Total = k*t + (N-1)*t`; speedup approaches **k** as N grows. 4. **Five stages:** IF, DR, EX, MEM, WB — same hardware as single-cycle, chopped by pipeline registers. 5. **Pipeline registers** carry data, control signals, and PC+4 forward each cycle; clearing one injects a bubble. 6. **Data hazards** (stale reads) and **control hazards** (wrong-path fetches) limit real speedup; Project 7 fixes them with forwarding, clock inversion, ld-stall detection, and flushing.