Processor Pipelining¶
Overview¶
This lecture introduces the semester's final major architecture topic: increasing processor throughput by overlapping the execution of multiple instructions instead of just trying to make a single instruction run faster. We motivate pipelining with a laundry analogy, work through the arithmetic that shows why pipelining approaches an n-times speedup, then map the idea onto a real RISC-V datapath split into five stages (IF, DR, EX, MEM, WB) separated by pipeline registers. Finally we name the two problems that keep a real pipeline from reaching its theoretical speedup — data hazards and control hazards — which become the core focus of Project 7.
Learning Objectives¶
- Distinguish single-cycle, multi-cycle, and pipelined processor organizations
- Explain the difference between latency (time for one task) and throughput (tasks completed per unit time)
- Use the laundry analogy to reason about overlapping stages of work
- Derive the pipeline timing formula and compute total time for N tasks through a k-stage pipeline
- Identify the five RISC-V pipeline stages (IF, DR, EX, MEM, WB) and what each one does
- Explain why pipeline registers are needed between stages and what they hold
- Recognize data hazards and control hazards and why they prevent the ideal speedup
- Draw a pipeline execution diagram showing which stage each instruction occupies in each clock cycle
Prerequisites¶
- Single-cycle RISC-V processor datapath: PC, instruction memory, register file, ALU, data memory (Project 6, Lab 09–11)
- Combinational vs. sequential logic and how registers hold state across a clock edge (Lab 09)
- The RISC-V instruction formats and control signals from the instruction decoder (instruction-ROM decoding lecture)
- Basic RISC-V assembly:
addi,add,ld,jal, branches (RISC-V assembly lectures) - The branch unit and PC-update logic for jumps and branches (Processor Design Part 3: /guides/processor-part-3/)
1. Three Ways to Organize a Processor¶
So far in the course we built a single-cycle processor: every instruction completes in exactly one clock cycle. That is conceptually simple, but the clock period has to be long enough for the slowest instruction to finish all of its work (fetch, decode, register read, ALU, memory access, write back) within one cycle. The fast instructions are forced to wait around for the slow ones.
There are three classic organizations:
| Organization | Cycles per instruction | Clock period | Key idea |
|---|---|---|---|
| Single-cycle | 1 | Long (set by slowest instruction) | Everything happens in one cycle |
| Multi-cycle | Several (varies) | Short | Break work into steps; reuse hardware across cycles |
| Pipelined | ~1 (in steady state) | Short | Overlap steps of different instructions |
A multi-cycle processor breaks the work of an instruction into several smaller steps and uses a shorter clock, spending a different number of cycles per instruction. That lets the clock be faster, but only one instruction is in flight at a time, so the hardware that is not currently busy sits idle.
A pipelined processor takes the same small steps but runs them on different instructions at the same time. While instruction 1 is in its memory step, instruction 2 can be in its execute step, instruction 3 in its decode step, and so on. Every piece of hardware stays busy every cycle. This is the big idea of this lecture.
Key distinction: pipelining does not make a single instruction finish faster. The latency of one instruction is unchanged (it still passes through every stage). What improves is throughput — how many instructions finish per unit time.
2. The Laundry Analogy¶
Pipelining is exactly the same idea as an assembly line, and the cleanest everyday example is doing laundry. Suppose every load of laundry goes through four steps, and assume each step takes 30 minutes:
Laundry Steps
1) Wash [W]
2) Dry [D]
3) Fold [F]
4) Put away [P]
Assume: each step takes 30 minutes.
One load (latency)¶
A single load passes through all four steps back-to-back:
One load takes 120 minutes = 2 hours. This is the latency of a load, and pipelining will never make it smaller — the clothes still have to be washed, then dried, then folded, then put away, in order.
Two loads, done serially (the naive way)¶
If you insist on completely finishing one load before starting the next, two loads take twice as long:
This is wasteful. While load 1 is in the dryer, the washer is sitting empty. We could be washing load 2 already.
The laundry pipeline¶
The insight: as soon as load 1 finishes washing and moves to the dryer, start washing load 2. Each resource (washer, dryer, folding table, drawer) is a stage, and we keep all of them busy. Drawing time left-to-right:
time --->
30 30 30 30 30 30 30
1st load [W] [D] [F] [P] done @ 120
2nd load [W] [D] [F] [P] done @ 150
3rd load [W] [D] [F] [P] done @ 180
4th load [W] [D] [F] [P] done @ 210
gantt
title Laundry Pipeline (each step = 30 min)
dateFormat X
axisFormat %s
section Load 1
Wash :0, 30
Dry :30, 60
Fold :60, 90
PutAway :90, 120
section Load 2
Wash :30, 60
Dry :60, 90
Fold :90, 120
PutAway :120, 150
section Load 3
Wash :60, 90
Dry :90, 120
Fold :120, 150
PutAway :150, 180
section Load 4
Wash :90, 120
Dry :120, 150
Fold :150, 180
PutAway :180, 210
Reading the finish times:
| Loads | Pipelined time |
|---|---|
| 1 load | 120 min (2.0 hr) |
| 2 loads | 150 min (2.5 hr) |
| 4 loads | 210 min (3.5 hr) |
Notice the pattern: the first load still takes the full 120 minutes (its latency is unchanged), but every additional load finishes just 30 minutes later — one stage time apart. Compare with the serial version, where four loads took 240 minutes. The pipeline did the same four loads in 210 minutes, and the advantage grows quickly as the number of loads grows.
3. Pipeline Timing and Speedup¶
Let us make the timing precise. Let:
k= number of stages (herek = 4)t= time per stage (heret = 30min = 0.5 hr)N= number of tasks (loads)
The first task needs all k stages to complete, taking k · t. After that, one new task finishes every t (one stage time), and there are N − 1 remaining tasks. So:
This is exactly the formula from the reference guide: Total = (stages × stage_time) + ((tasks − 1) × stage_time).
Worked example: 100 loads¶
Serial (no pipelining): each load is fully independent and takes 2 hours.
Pipelined: first load takes the full 2 hours, then 99 more loads each finish 0.5 hr apart.
So 100 loads drop from 200 hours to 51.5 hours — roughly a 4× speedup, which is the number of stages.
Worked example: 1000 loads¶
Serial would be 1000 × 2 = 2000 hrs, so the pipeline gives about 2000 / 501.5 ≈ 3.99×. As N grows large, the constant k·t startup cost (the "fill" of the pipeline) becomes negligible and the speedup approaches the number of stages.
In principle: an n-stage pipeline can speed up execution by a factor approaching n. (More carefully: the throughput improves by up to n×; the latency of a single task does not change.) In the instructor's shorthand, each stage now does
1/nof the work per clock, so the clock can be roughlyntimes faster.
flowchart LR
A["1 load<br/>latency = 120 min"] --> B["Throughput goal:<br/>1 finished load<br/>every 30 min"]
B --> C["N loads ≈ N × 30 min<br/>(for large N)"]
style B fill:#f9f,stroke:#333,stroke-width:2px
4. Mapping Pipelining onto the Processor¶
A processor instruction goes through the same kind of repeated steps as a load of laundry. We take the single-cycle datapath and partition it into stages, inserting a register between each pair of stages to hold the intermediate results. The standard classic RISC pipeline has five stages:
flowchart LR
IF["IF<br/>Instruction Fetch"] --> DR["DR<br/>Decode +<br/>RegFile Read"]
DR --> EX["EX<br/>Execute"]
EX --> MEM["MEM<br/>Memory<br/>load/store"]
MEM --> WB["WB<br/>Write Back<br/>to RegFile"]
style IF fill:#bbf,stroke:#333,stroke-width:2px
style EX fill:#f9f,stroke:#333,stroke-width:2px
| Stage | Name | What happens | Major hardware used |
|---|---|---|---|
| IF | Instruction Fetch | Use PC to read the instruction word from instruction memory; compute PC + 4 |
PC, Instruction Memory |
| DR | Decode + Register File Read | Decode the instruction word into control signals and immediates; read RD0/RD1 from the register file |
Instruction Decoder, Reg Decoder, Imm Decoder, Register File |
| EX | Execute | Perform the ALU operation; compute branch target / branch condition in the Branch Unit | ALU, Branch Unit (BU) |
| MEM | Memory | For loads/stores, read or write data memory | Data Memory (RAM) |
| WB | Write Back | Write the ALU result or loaded value back into the register file | Register File (write port) |
The hand-drawn datapath from lecture, redrawn, with the blue vertical lines marking the stage boundaries:
IF DR EX MEM WB
| | | | |
+----+ +------+ | +--------+ | +-----+ | +-----+ |
| PC |->| Inst |---+->| Inst |--\ | | ALU |-\ | | MEM | |
+----+ | Mem | | | Decode | \ | +-----+ \| +-----+ |
+------+ | +--------+ >--+-> / |
| +--------+ / | +-----+ (write
| | Reg |--/ | | BU | back to
| | Decode | | +-----+ RegFile)
| +--------+ | |
| +--------+ | |
| | Imm |-------+------------------------+
| | Decode | | |
| +--------+ | |
| +--------+ | |
+->| Reg |<------+------------------------+
| File |
+--------+
^ ^ ^ ^ ^
| pipeline | pipeline | pipeline | pipeline |
| reg | reg | reg | reg |
IF/DR DR/EX EX/MEM MEM/WB
The wires that loop back along the bottom carry control signals and operand values forward through every later stage. A key example mentioned in lecture: PC + 4 is carried all the way down the pipeline so that jump-and-link (jal/jalr) can write the correct return address (PC + 4) into the register file in the WB stage, even though the PC has already moved on by then.
Single-cycle vs pipelined datapath¶
The combinational logic is essentially the same as the single-cycle processor. We do not add new functional units — we add pipeline registers that chop the long single-cycle path into five short ones, so a faster clock can drive each short segment.
5. Pipeline Registers¶
In a single-cycle processor a value computed early (say, the decoded RD0 in DR) is consumed later in the same cycle, so it just flows down a wire. In a pipeline, the consuming stage runs in a later clock cycle, while the producing stage has already moved on to the next instruction. We need somewhere to store the intermediate values until the next stage is ready.
That is the job of pipeline registers (also called pipeline latches). They sit between adjacent stages:
On each clock edge, every pipeline register captures its stage's outputs and presents them to the next stage. Things that must be carried through the registers include:
- The instruction's data values: register reads
RD0/RD1, the immediate, the ALU result, the loaded memory value. - The instruction's control signals:
ALUOp, write-enable for the register file, the destination register numberRD, memory load/store lines,WDsel, etc. Each control line is produced once in DR and must be carried forward to the stage that actually uses it. - The carried
PC + 4forjal/jalr.
Why control signals travel with the instruction¶
This was the part the instructor stressed about the Project 7 decoder refactor. The decoder now works in two parts:
- Instruction word → INUM (an instruction number identifying which instruction it is).
- A standard ROM (populated from a spreadsheet) that maps the INUM to the full set of control signals.
Those control signals are generated in DR but consumed across EX, MEM, and WB. They cannot be regenerated later (the instruction word is gone), so they ride along inside the pipeline registers, one set per instruction. To make debugging possible, the implementation keeps multiple "versions in flight" — copies of the instruction visible at each stage — so you can see exactly which instruction occupies each stage at any clock.
Flushing (clearing) pipeline registers¶
A pipeline register has a clear (CLR) input just like any register. Clearing a pipeline register turns the instruction in that stage into a no-op (it injects zeros / a bubble). This is exactly how we will cancel instructions that were fetched by mistake — for example, the instructions fetched right after a taken branch before we realized the branch changes the PC. Flushing is the basic mechanism for handling control hazards, covered in detail after the break.
6. Instructions Flowing Through the Pipeline¶
Here is the central picture: a stream of instructions, each one stage behind the previous, so that in steady state all five stages are busy with five different instructions. Consider this sequence:
The pipeline execution diagram (each column is one clock cycle):
cycle: 1 2 3 4 5 6 7
addi a0, zero, 1 [IF] [DR] [EX] [MEM] [WB]
addi a1, zero, 2 [IF] [DR] [EX] [MEM] [WB]
add a2, a0, a1 [IF] [DR] [EX] [MEM] [WB]
... [IF] [DR] [EX] ...
flowchart TD
subgraph C1["Cycle 1"]
A1["addi a0: IF"]
end
subgraph C2["Cycle 2"]
A2["addi a0: DR"]
B2["addi a1: IF"]
end
subgraph C3["Cycle 3"]
A3["addi a0: EX"]
B3["addi a1: DR"]
C3i["add a2: IF"]
end
subgraph C4["Cycle 4"]
A4["addi a0: MEM"]
B4["addi a1: EX"]
C4i["add a2: DR"]
end
subgraph C5["Cycle 5 (pipeline full)"]
A5["addi a0: WB"]
B5["addi a1: MEM"]
C5i["add a2: EX"]
end
C1 --> C2 --> C3 --> C4 --> C5
Reading down a column tells you which instructions are simultaneously in flight in one clock cycle. Reading across a row tells you how a single instruction marches through the five stages, one per cycle. After the pipeline "fills" (cycle 5 here), every cycle completes one instruction — that is the throughput win.
But look closely at the third instruction:
add a2, a0, a1 reaches DR (where it reads a0 and a1 from the register file) in cycle 3. But addi a0 does not write a0 until its WB stage in cycle 5, and addi a1 writes a1 in cycle 6. The add would read stale values for a0 and a1. That is a hazard — the overlap that gives us throughput has created a correctness problem.
7. Pipeline Hazards¶
A hazard is any situation where the next instruction cannot execute in the following clock cycle as drawn, because the overlap violates a dependency. There are two kinds we care about in this course.
flowchart TD
H["Pipeline Hazards"] --> D["Data Hazards"]
H --> C["Control Hazards<br/>(jumps and branches)"]
style D fill:#bbf,stroke:#333,stroke-width:2px
style C fill:#f9f,stroke:#333,stroke-width:2px
Data hazards¶
A data hazard occurs when an instruction needs a result that an earlier, still-in-flight instruction has not yet written back. The add a2, a0, a1 example above is exactly this: the producer is still in the pipeline when the consumer tries to read its operands.
The simplest (but slow) fix is to insert no-op instructions ("bubbles") between the producer and consumer so that by the time the consumer reaches DR, the producer has reached WB:
addi a0, zero, 1
addi a1, zero, 2
nop # buy time so a0/a1 are written before...
nop
add a2, a0, a1 # ...this reads them
The instructor demonstrated this as a temporary correctness crutch — it works but wastes cycles. Project 7 will replace it with better techniques (covered after the break):
- Inverting the clock to the register file so the WB write and the DR read happen in the same cycle (write in the first half, read in the second). This eliminates one no-op.
- Forwarding (bypassing): route a result directly from a later stage (EX/MEM or MEM/WB) back to the ALU input of a waiting instruction, instead of waiting for it to reach the register file. This removes most data-hazard stalls.
- Load stalling (ld-stall): a load's value is not available until after MEM, so an instruction that uses a load result one cycle later genuinely cannot be forwarded in time; a hazard detection unit stalls the pipeline for one cycle. The detection logic gates the PC enable, e.g.
stl_en = ld_stall OR en.
Control hazards¶
A control hazard is caused by jumps and branches. When we fetch a branch, we do not know whether it is taken (and therefore what the next PC should be) until the branch resolves a couple of stages later. Meanwhile, the pipeline has already fetched the next one or two sequential instructions that may be on the wrong path.
The temporary fix shown in lecture is, again, to pad with no-ops after a jump so the jump's new PC takes effect before any real following instruction is fetched. The real fix is to flush (clear) the pipeline registers holding the wrongly-fetched instructions, turning them into bubbles, once we know the branch was taken. Recall from Section 5 that clearing a pipeline register is exactly how we inject a no-op.
Both hazard types are why a real pipeline does not quite reach the ideal n× speedup: occasional bubbles and stalls cost cycles. Minimizing them with forwarding, clock inversion, hazard detection, and flushing is the heart of Project 7.
8. Putting It Together: Project 7¶
Project 7 extends a base processor (in contrast to Project 6, where you built the processor from scratch). You take the single-cycle datapath and:
- Partition it into the five stages IF, DR, EX, MEM, WB.
- Insert pipeline registers (IF/DR, DR/EX, EX/MEM, MEM/WB) that carry both data and control signals (plus
PC + 4). - Refactor the instruction decoder into "instruction word → INUM" and "INUM → control signals (ROM)", keeping per-stage "versions in flight" for visibility.
- Get a basic pipeline running correctly first (using no-ops where needed as a stopgap).
- Then attack hazards: clock inversion, forwarding, and ld-stall hazard detection for data hazards; flushing for control hazards.
flowchart LR
P6["Project 6:<br/>single-cycle<br/>(from scratch)"] --> P7["Project 7:<br/>pipeline + hazard<br/>handling (extend)"]
style P7 fill:#bbf,stroke:#333,stroke-width:2px
The full run of the Project 6 tests is available as extra credit, and remaining work on prior projects can still earn points before the final-exam deadline. Lab 11 will provide practice problems for the upcoming exam.
Key Concepts¶
| Concept | Definition | Example |
|---|---|---|
| Latency | Time to complete one task from start to finish | One laundry load = 120 min; one instruction = 5 stages |
| Throughput | Number of tasks completed per unit time | 1 finished instruction per cycle in steady state |
| Single-cycle processor | One instruction per clock; clock set by slowest instruction | Project 6 datapath |
| Multi-cycle processor | Each instruction takes several short cycles | Reuses hardware across cycles |
| Pipelined processor | Overlaps stages of different instructions | Project 7 datapath |
| Pipeline stage | One step of instruction processing | IF, DR, EX, MEM, WB |
| Pipeline register | Register between stages holding data + control signals | IF/DR, DR/EX, EX/MEM, MEM/WB |
| Flush | Clearing a pipeline register to inject a no-op (bubble) | Cancel instructions after a taken branch |
| Data hazard | Instruction needs a result not yet written back | add a2,a0,a1 after addi a0,... |
| Control hazard | Next PC unknown until a branch/jump resolves | Branches and jumps |
| Forwarding | Routing a result directly to a later stage's input | EX/MEM result → ALU input |
| ld-stall | One-cycle stall when a load result is used immediately | ld t0,(a0) then add t1,t0,... |
Practice Problems¶
Problem 1: Latency vs. throughput¶
You have a 5-stage pipeline where each stage takes 2 ns. (a) What is the latency of a single instruction? (b) In steady state, how often does an instruction complete? (c) Did pipelining make a single instruction faster?
Click to reveal solution
(a) **Latency = 5 stages × 2 ns = 10 ns.** One instruction must still pass through all five stages. (b) In steady state (pipeline full) one instruction completes **every stage time = every 2 ns** — the throughput. (c) **No.** The latency of a single instruction is unchanged (still 10 ns). Pipelining improves *throughput*, not single-instruction latency. The win comes from overlapping many instructions.Problem 2: Laundry timing formula¶
Using the laundry pipeline (4 stages, 30 min each), compute the total time for 6 loads with the formula Total = k·t + (N − 1)·t.
Click to reveal solution
Serial would be `6 × 120 = 720` min (12 hr), so the pipeline is a big improvement. The first load takes the full 120 min; each of the remaining 5 finishes 30 min apart.Problem 3: Speedup for large N¶
For the 4-stage, 0.5-hr-per-stage laundry pipeline, compute the pipelined time for 500 loads and the speedup over serial. What value does the speedup approach as N → ∞?
Click to reveal solution
As `N → ∞`, the constant fill cost `k·t = 2` hr becomes negligible compared to `(N−1)·t`, so the speedup approaches the **number of stages, k = 4**. An *n*-stage pipeline approaches an *n*× throughput improvement.Problem 4: Draw the pipeline diagram¶
Draw a pipeline execution diagram (cycles across the top) for the three instructions below, and identify the cycle in which the pipeline is first "full" (all five stages busy).
Click to reveal solution
With only three instructions the pipeline never reaches all five stages busy at once (you would need at least five overlapping instructions). The diagram does show overlap: in cycle 3, three different instructions occupy IF, DR, and EX simultaneously. (With five or more independent instructions, the fifth cycle would have one instruction in each of IF/DR/EX/MEM/WB.)Problem 5: Spot the data hazard¶
In the sequence from Problem 4, add t2, t0, t1 reads t0 and t1 in its DR stage. In which cycle does add reach DR, and in which cycles do the two addi instructions write their results in WB? Is there a data hazard? Give a temporary fix.
Click to reveal solution
- `add t2, t0, t1` reaches **DR in cycle 4** (it is fetched in cycle 3). - `addi t0` writes `t0` in **WB in cycle 5**. - `addi t1` writes `t1` in **WB in cycle 6**. So `add` tries to read `t0`/`t1` in cycle 4, **before** either value has been written back (cycles 5 and 6). This is a **data hazard** — both operands are stale. Temporary fix — insert no-ops so the producers reach WB before the consumer reaches DR: The real Project 7 fixes are register-file clock inversion, forwarding, and (for loads) ld-stall detection.Problem 6: Why carry PC + 4 down the pipeline?¶
Explain why PC + 4 must be carried through the pipeline registers, using jal ra, label as an example.
Click to reveal solution
`jal ra, label` does two things: it sets `PC = label` (jump target), and it writes the **return address** `ra = PC + 4` into the register file. The write to `ra` happens in the **WB** stage, several cycles after the instruction was fetched in IF. By then the PC has already advanced (and been replaced by the jump target and subsequent fetches), so the original `PC + 4` is no longer available from the PC hardware. Therefore `PC + 4` is computed in IF and **stored in the pipeline registers**, riding along with the `jal` instruction down to WB, where it is written into `ra`. The same carried value supports `jalr`. This is also why each pipeline register must hold not just data operands but the bookkeeping values an instruction will need in a later stage.Further Reading¶
- Source lecture notes (handwritten): /notes/CS315-01 2025-11-20 Processor Pipelining.pdf
- Processor Design Part 3 (branch unit, data memory, debugging): /guides/processor-part-3/
- Course key concepts (pipelining and hazards section): /guides/key-concepts-all/
- Instruction pipelining (Wikipedia)
- Classic RISC pipeline (Wikipedia)
- Hazard (computer architecture) (Wikipedia)
- Patterson & Hennessy, Computer Organization and Design (RISC-V Edition), Chapter 4: The Processor (Pipelining)
Summary¶
-
Three processor organizations: single-cycle (one long cycle per instruction), multi-cycle (several short cycles, hardware reused), and pipelined (short cycles, stages of different instructions overlapped). Pipelining keeps every functional unit busy every cycle.
-
Latency vs. throughput: pipelining does not reduce the time for a single task (latency); it increases how many tasks finish per unit time (throughput). One laundry load still takes 2 hours.
-
The laundry analogy shows the idea: with 4 stages of 30 min each, the first load takes 120 min, but every additional load finishes just 30 min later because resources overlap.
-
Pipeline timing formula:
Total = k·t + (N − 1)·t. For largeNthe speedup of ak-stage pipeline approaches k× (e.g., 100 loads: 200 hr serial → 51.5 hr pipelined). -
Five RISC-V stages: IF (instruction fetch), DR (decode + register read), EX (execute / ALU + branch unit), MEM (memory load/store), WB (write back to register file).
-
Pipeline registers between every pair of stages hold the in-flight instruction's data values and control signals (and
PC + 4forjal/jalr). Clearing a pipeline register flushes that stage into a no-op. -
Hazards keep a real pipeline below the ideal speedup. Data hazards arise when an instruction needs a result not yet written back; control hazards arise from jumps and branches changing the PC. No-ops are a temporary fix; Project 7 uses clock inversion, forwarding, ld-stall detection, and flushing.
-
Project 7 extends the base single-cycle processor into a pipelined one with hazard handling, refactoring the decoder into instruction-word→INUM and INUM→control-signal ROM, with per-stage "versions in flight" for debugging.