Processor Pipelining

# Processor Pipelining

## CS 315 Computer Architecture

---

## Learning Objectives

- Distinguish **single-cycle**, **multi-cycle**, and **pipelined** processors
- Explain **latency** vs. **throughput**
- Use the laundry analogy to reason about overlapping stages
- Derive and apply the pipeline timing formula
- Identify the five RISC-V pipeline stages: **IF, DR, EX, MEM, WB**
- Explain why **pipeline registers** are needed between stages
- Recognize **data hazards** and **control hazards**
- Draw a pipeline execution diagram

---

## Three Processor Organizations

| Organization | Cycles/Instruction | Clock Period | Key Idea |
|---|---|---|---|
| **Single-cycle** | 1 | Long (slowest insn) | Everything in one cycle |
| **Multi-cycle** | Several (varies) | Short | Steps reuse hardware |
| **Pipelined** | ~1 (steady state) | Short | Overlap different instructions |

<div class="highlight-box">
Pipelining does <strong>not</strong> make one instruction faster — it improves <strong>throughput</strong>.
</div>

---

## Latency vs. Throughput

**Latency**
- Time to complete *one* task end-to-end
- Pipelining does **not** reduce latency
- One instruction still passes through all 5 stages

</div>
<div>

**Throughput**
- Tasks completed per unit time
- Pipelining **does** improve throughput
- Steady state: one instruction finishes per clock cycle

</div>
</div>

<div class="info-box">
Key insight: keep every functional unit busy every cycle by overlapping stages of <em>different</em> instructions.
</div>

---

## The Laundry Analogy

Four stages, 30 min each: **Wash, Dry, Fold, Put away**

**Serial:** finish one load completely before starting the next
```text
[W][D][F][P]  [W][D][F][P]    4 loads = 240 min
```

**Pipelined:** start the next load as soon as the washer is free
```text
time -->   30   30   30   30   30   30   30
Load 1    [W]  [D]  [F]  [P]
Load 2         [W]  [D]  [F]  [P]
Load 3               [W]  [D]  [F]  [P]
Load 4                    [W]  [D]  [F]  [P]
```

4 loads pipelined = **210 min** (vs 240 serial) — gap widens with more loads.

---

## Laundry Pipeline Timing Table

| Loads | Serial | Pipelined | Speedup |
|---|---|---|---|
| 1 | 120 min | 120 min | 1.0x |
| 2 | 240 min | 150 min | 1.6x |
| 4 | 480 min | 210 min | 2.3x |
| 100 | 12,000 min | 3,090 min | 3.9x |

<div class="info-box">
First load: full 120 min. Every <em>additional</em> load: just 30 min more (one stage time).
</div>

---

## Pipeline Timing Formula

Let **k** = stages, **t** = time per stage, **N** = tasks:

```text
Total = k*t  +  (N - 1)*t
      = (fill time) + (remaining tasks at one stage-time each)
```

**Example: 100 loads** (k = 4, t = 0.5 hr)

```text
Serial:    100 x 2 hrs = 200 hrs

Pipelined: 4 x 0.5  +  99 x 0.5
         = 2 hrs    +  49.5 hrs
         = 51.5 hrs

Speedup:   200 / 51.5 = 3.88x  (approaching k = 4)
```

<div class="highlight-box">
As N → ∞, speedup → <strong>k</strong> (number of stages).
</div>

---

## Five RISC-V Pipeline Stages

<div class="mermaid">
flowchart LR
    IF["IF\nInstruction Fetch"] --> DR["DR\nDecode +\nRegFile Read"]
    DR --> EX["EX\nExecute\nALU + BU"]
    EX --> MEM["MEM\nMemory\nLoad/Store"]
    MEM --> WB["WB\nWrite Back\nto RegFile"]
    style IF fill:#e8f5e9,stroke:#00543c,stroke-width:2px
    style DR fill:#e8f5e9,stroke:#00543c,stroke-width:2px
    style EX fill:#fff3cd,stroke:#FDBB30,stroke-width:2px
    style MEM fill:#e8f5e9,stroke:#00543c,stroke-width:2px
    style WB fill:#e8f5e9,stroke:#00543c,stroke-width:2px
</div>

---

## What Each Stage Does

| Stage | Name | Hardware Used |
|---|---|---|
| **IF** | Instruction Fetch | PC, Instruction Memory |
| **DR** | Decode + RegFile Read | Inst Decoder, Imm Decoder, Register File |
| **EX** | Execute | ALU, Branch Unit |
| **MEM** | Memory Access | Data Memory (RAM) |
| **WB** | Write Back | Register File (write port) |

The combinational logic is essentially the **same as the single-cycle processor** — we add pipeline registers to chop the long path into five short ones.

---

## Pipeline Registers

Values produced in one stage are consumed in a **later cycle** — a wire is not enough.

```text
  IF/DR       DR/EX      EX/MEM     MEM/WB
    |            |           |          |
[IF]         [DR]        [EX]       [MEM]       [WB]
```

Each pipeline register carries forward:
- **Data**: `RD0`, `RD1`, immediate, ALU result, loaded value
- **Control signals**: `ALUOp`, write-enable, dest reg `RD`, `WDsel`, mem lines
- **`PC + 4`** for `jal`/`jalr` return-address writes in WB

---

## Why Carry PC + 4 Down the Pipeline?

`jal ra, label` must write **return address = PC + 4** into `ra` at WB.

By the time `jal` reaches WB, the PC has already advanced to new addresses.

<div class="mermaid">
flowchart LR
    A["IF\ncompute PC+4"] --> B["IF/DR\nholds PC+4"]
    B --> C["DR/EX\nholds PC+4"]
    C --> D["EX/MEM\nholds PC+4"]
    D --> E["MEM/WB\nholds PC+4"]
    E --> F["WB\nwrite PC+4 to ra"]
    style F fill:#e8f5e9,stroke:#00543c,stroke-width:2px
</div>

`PC + 4` is computed once in IF and **rides the pipeline registers** all the way to WB.

---

## Control Signals Travel with the Instruction

The instruction decoder works in two parts (Project 7 refactor):

1. **Instruction word → INUM** (which instruction is this?)
2. **INUM → control signals** via a ROM (spreadsheet-populated)

<div class="info-box">
Control signals are generated in <strong>DR</strong> but consumed across <strong>EX</strong>, <strong>MEM</strong>, and <strong>WB</strong>. They ride the pipeline registers with the instruction.
</div>

The implementation also keeps per-stage **"versions in flight"** — one copy of the instruction visible at each stage — for debugging.

---

## Flushing Pipeline Registers

Pipeline registers have a **CLR** (clear) input.

<div class="highlight-box">
Clearing a pipeline register converts that in-flight instruction into a <strong>no-op (bubble)</strong>.
</div>

Used to cancel instructions fetched on the wrong path — for example, after a **taken branch**:

```text
Branch taken in EX cycle 5
 -> flush IF/DR and DR/EX registers
 -> inject two bubbles
 -> correct target enters IF next cycle
```

This is the hardware mechanism for handling **control hazards**.

---

## Pipeline Execution Diagram

```text
            cycle:  1     2     3     4     5     6     7
addi a0,z,1        [IF]  [DR]  [EX]  [MEM] [WB]
addi a1,z,2              [IF]  [DR]  [EX]  [MEM] [WB]
add  a2,a0,a1                   [IF]  [DR]  [EX]  [MEM] [WB]
```

- Read **across** a row: one instruction marching through 5 stages
- Read **down** a column: multiple instructions in flight at once
- Pipeline reaches steady-state when all 5 stages are occupied (needs 5+ instructions)

---

## Pipeline Diagram: Cycle-by-Cycle View

<div class="mermaid">
flowchart TD
    subgraph C3["Cycle 3"]
      A3["addi a0: EX"]
      B3["addi a1: DR"]
      C3i["add a2: IF"]
    end
    subgraph C4["Cycle 4"]
      A4["addi a0: MEM"]
      B4["addi a1: EX"]
      C4i["add a2: DR"]
    end
    subgraph C5["Cycle 5"]
      A5["addi a0: WB"]
      B5["addi a1: MEM"]
      C5i["add a2: EX"]
    end
    C3 --> C4 --> C5
</div>

In cycle 5: three instructions simultaneously active. After cycle 5, one instruction completes every cycle.

---

## Two Kinds of Hazards

<div class="mermaid">
flowchart TD
    H["Pipeline Hazards"] --> D["Data Hazards\ninstruction needs a result\nnot yet written back"]
    H --> C["Control Hazards\njumps and branches\nchange the PC"]
    style D fill:#e8f5e9,stroke:#00543c,stroke-width:2px
    style C fill:#fff3cd,stroke:#FDBB30,stroke-width:2px
</div>

Both prevent a real pipeline from reaching the ideal k× speedup.

---

## Data Hazards

```text
addi a0, zero, 1   # writes a0 in WB -- cycle 5
addi a1, zero, 2   # writes a1 in WB -- cycle 6
add  a2, a0, a1    # reads a0, a1 in DR -- cycle 4  <- TOO EARLY
```

<div class="highlight-box">
<strong>Data hazard:</strong> the consumer reads operands before the producer has written them back. Result: stale (wrong) data.
</div>

`add` reaches DR in cycle 4, but `a0` and `a1` are not written until cycles 5 and 6.

---

## Temporary Fix: Insert No-ops

Pad with `nop` instructions to let producers reach WB before the consumer reaches DR:

```text
addi a0, zero, 1
addi a1, zero, 2
nop                # stall: wait for a0
nop                # stall: wait for a1
add  a2, a0, a1    # now a0 and a1 are valid
```

This **works** but **wastes two cycles**. Project 7 replaces nops with smarter hardware.

---

## Better Data Hazard Solutions (Project 7)

| Technique | Idea | Cycles Saved |
|---|---|---|
| **Clock inversion** | WB writes 1st half-cycle, DR reads 2nd half | 1 nop eliminated |
| **Forwarding** | Route EX/MEM result directly to ALU input | Most stalls removed |
| **ld-stall** | Load result unavailable until after MEM; stall 1 cycle then forward | 1 cycle stall for `ld` followed immediately by dependent insn |

<div class="info-box">
Hazard detection unit gates the PC enable: <code>stl_en = ld_stall OR en</code>
</div>

---

## Forwarding (Bypassing)

Without forwarding: wait until WB writes to register file, then DR reads.

With forwarding: route the result **earlier**, directly to the ALU input.

```text
addi a0, zero, 1     EX result available after cycle 3
add  a2, a0, a1      needs a0 in EX -- cycle 5

Forward: EX/MEM pipeline register -> ALU input of add
         (no stall needed for this case)
```

<div class="highlight-box">
Forwarding is the primary technique for eliminating data hazard stalls.
</div>

---

## Control Hazards

When a **branch or jump** is fetched, the pipeline does not yet know the next PC.

```text
Cycle 1: beq a0, a1, target  [IF]
Cycle 2: next insn (maybe wrong path) [IF]  <- fetched speculatively
Cycle 3: next+1 insn (maybe wrong path) [IF] <- fetched speculatively
         beq resolves in EX -> branch taken!
```

The two speculatively-fetched instructions may be on the wrong path.

**Fix:** flush (clear) those pipeline registers once the branch resolves — inject bubbles.

---

## Control Hazard: Temporary Fix

Insert **nop** instructions after a jump so the jump's new PC is in effect before any real instruction follows:

```text
jal ra, target
nop               # pipeline is still fetching from old PC
nop
                  # by now, PC = target, safe to fetch real instructions
```

The real fix (Project 7): detect the taken branch in EX, **flush** IF/DR and DR/EX registers, redirect PC.

---

## Hazard Summary

| Hazard | Cause | Temporary Fix | Project 7 Fix |
|---|---|---|---|
| **Data** | Producer not yet in WB when consumer reaches DR | Insert `nop`s | Clock inversion, forwarding, ld-stall |
| **Control** | Branch/jump destination unknown when pipeline fetches next insns | Insert `nop`s after branch | Flush wrong-path instructions; redirect PC |

---

## Project 7 Overview

Extend the **single-cycle processor** into a pipelined one:

1. Partition datapath into IF, DR, EX, MEM, WB
2. Insert pipeline registers (IF/DR, DR/EX, EX/MEM, MEM/WB) carrying data + control + PC+4
3. Refactor decoder: insn word → INUM → control ROM; keep per-stage "versions in flight"
4. Get basic pipeline working (using nops as needed)
5. Then add hazard handling: clock inversion, forwarding, ld-stall, flushing

<div class="mermaid">
flowchart LR
    P6["Project 6\nsingle-cycle\n(built from scratch)"] --> P7["Project 7\npipelined + hazard\nhandling (extend)"]
    style P7 fill:#e8f5e9,stroke:#00543c,stroke-width:2px
</div>

---

## Key Concepts Reference

| Concept | Definition |
|---|---|
| **Latency** | Time for one task end-to-end (unchanged by pipelining) |
| **Throughput** | Tasks per unit time (improved by pipelining) |
| **Pipeline register** | Stores data + control signals between stages |
| **Flush** | Clear a pipeline register to inject a no-op (bubble) |
| **Data hazard** | Consumer needs a result not yet written back |
| **Control hazard** | Next PC unknown until branch/jump resolves |
| **Forwarding** | Route result directly to later stage input |
| **ld-stall** | 1-cycle stall when load result used immediately |

---

## Summary

1. **Three organizations:** single-cycle (long clock), multi-cycle (reuse hardware), pipelined (overlap stages of different instructions).

2. **Latency vs. throughput:** pipelining leaves single-instruction latency unchanged; it improves throughput toward one instruction per clock.

3. **Formula:** `Total = k*t + (N-1)*t`; speedup approaches **k** as N grows.

4. **Five stages:** IF, DR, EX, MEM, WB — same hardware as single-cycle, chopped by pipeline registers.

5. **Pipeline registers** carry data, control signals, and PC+4 forward each cycle; clearing one injects a bubble.

6. **Data hazards** (stale reads) and **control hazards** (wrong-path fetches) limit real speedup; Project 7 fixes them with forwarding, clock inversion, ld-stall detection, and flushing.