Advanced Architecture and Final Review

# Advanced Architecture and Final Review

## CS 315 Computer Architecture

---

## Topics Today

- **Project 07 Q&A** — Hazard unit stall-enable logic
- **Beyond the simple pipeline** — speculative and out-of-order execution
- **Security** — Spectre and Meltdown
- **Lab 11 exam-style problems** — combinational design and pipeline tracing
- **Final exam overview**

<div class="info-box">
Final exam: 35% of grade — covers all lectures, labs, and assignments. Two pages of notes allowed.
</div>

---

## The Five-Stage Pipeline (Review)

<div class="mermaid">
flowchart LR
    IF["IF\nFetch"] --> DE["DE\nDecode"]
    DE --> EX["EX\nExecute"]
    EX --> MEM["MEM\nMemory"]
    MEM --> WB["WB\nWriteback"]
</div>

Up to five instructions in flight simultaneously — but introduces **hazards**.

---

## Three Hazard Types

| Hazard | Cause | Resolution |
|--------|-------|------------|
| **Data** | Instruction needs a value not yet written back | Forwarding; stall if too early |
| **Load-use** | `ld` produces value in MEM, next needs it in EX | Stall 1 cycle, then forward |
| **Control** | Branch/jump resolved late; wrong path already fetched | Flush wrong-path instructions |

<div class="highlight-box">
The <strong>hazard unit</strong> is combinational logic that watches the pipeline and issues stalls or flushes every cycle.
</div>

---

## Stall Mechanism: Enable Signals

To stall, de-assert enables on *upstream* state elements so they hold:

- **PC register** has `EN` — if `EN = 0`, PC holds (same instruction re-fetched)
- **IF/DE pipeline register** has `EN` — if `EN = 0`, decode stage freezes

<div class="mermaid">
flowchart LR
    EN["en"] --> HU["Hazard Unit"]
    LDS["ld-stall\n(load-use detected)"] --> HU
    HU -->|"stl-en"| PC["PC\n(EN)"]
    HU -->|"stl-en"| IFDE["IF/DE reg\n(EN)"]
</div>

---

## Deriving stl-en: Truth Table

| `ld-stall` | `en` | `stl-en` |
|:----------:|:----:|:--------:|
| 0 | 0 | 0 |
| 0 | 1 | **1** |
| 1 | 0 | 0 |
| 1 | 1 | 0 |

Output is 1 in exactly **one** row: when the pipeline is enabled AND there is no load-use stall.

```text
stl-en = (NOT ld-stall) AND en
```

---

## Gate-Level Implementation

An AND gate with one inverted (bubbled) input:

```text
   ld-stall ----o\
                 ) AND ----- stl-en
   en ----------/

(small "o" = inverter on ld-stall input)
```

```c
// Hazard unit (combinational, 1-bit values)
int stl_en = (!ld_stall) & en;
```

<div class="highlight-box">
Common bug: wiring <code>stl-en = ld-stall OR en</code> — this <em>never</em> stalls, silently reading stale register values.
</div>

---

## How Real CPUs Go Faster

Your Project 06/07 processor executes strictly in order. Every real CPU adds more layers:

<div class="mermaid">
flowchart TD
    A["Single-cycle\n(Project 06)"] --> B["Pipelined, in-order\n(Project 07)"]
    B --> C["+ Branch prediction\n& speculative execution"]
    B --> D["+ Out-of-order\nexecution"]
    C --> E["Modern superscalar\nOOO core"]
    D --> E
</div>

---

## Speculative Execution

A deep pipeline must fetch instructions *before* a branch resolves. **Speculation** = guess the branch outcome and start executing the predicted path.

<div class="mermaid">
flowchart LR
    F["Fetch"] --> P{"Predict\ntaken?"}
    P -->|"guess path"| SPEC["Speculatively\nexecute"]
    SPEC --> CHK{"Branch\nresolved:\ncorrect?"}
    CHK -->|"yes"| COMMIT["Commit results"]
    CHK -->|"no"| FLUSH["Flush pipeline,\nrestart correct path"]
</div>

Modern branch predictors are ~95-97% accurate.

---

## Speculation: The Key Insight

- **Guess right** → speculatively-executed work is committed — free performance
- **Guess wrong** → pipeline flushed, restart down correct path (mispredict penalty = pipeline depth)

<div class="info-box">
A squashed speculative path is <em>architecturally</em> invisible — registers and memory are rolled back. But it can leave <strong>microarchitectural side effects</strong> in the cache. That is what Spectre exploits.
</div>

---

## Out-of-Order (OOO) Execution

**Problem:** even with perfect prediction, in-order stalls on slow instructions (cache-missing loads).

```asm
ld   t0, (a0)      # cache miss: ~200 cycle wait
add  t1, t2, t3    # independent of t0 — why wait?
```

**OOO:** execute whichever instructions have their inputs ready, regardless of program order. Retire **in program order**.

---

## OOO Pipeline Structure

<div class="mermaid">
flowchart LR
    FE["Fetch/Decode\n(in order)"] --> REN["Rename\n(remove false deps)"]
    REN --> SCHED["Scheduler\n(pick ready ops)"]
    SCHED --> FU["Functional Units\n(execute OOO)"]
    FU --> ROB["Reorder Buffer"]
    ROB --> RET["Retire\n(in order)"]
</div>

**Register renaming** removes false WAW/WAR dependencies from the small architectural register file.

---

## OOO vs In-Order Timing

```text
In-order (stalls behind the load):
  ld   t0,(a0)  |EX|--- waiting on memory (~200 cyc) ---|WB|
  add  t1,t2,t3           (cannot start until load clears)  |EX|WB|

Out-of-order (add runs during the miss):
  ld   t0,(a0)  |EX|--- waiting on memory --------------|WB|
  add  t1,t2,t3 |EX|WB|   <-- executed early, latency hidden
```

OOO buys ~20-40% extra IPC over a well-tuned in-order core.

---

## Privileged Mode (Brief)

Real ISAs have two worlds:

| Mode | Who | Can do |
|------|-----|--------|
| **User** | Application code | Normal instructions |
| **Privileged** | OS kernel | Set page tables, handle traps, manage devices |

- Isolates processes from each other
- Mediates all hardware access via system calls / traps

<div class="info-box">
Not implemented in CS 315 — covered in depth in <strong>CS 326 Operating Systems</strong>.
</div>

---

## Spectre and Meltdown

In 2018, researchers showed speculation opens a **timing side channel** to leak data that should be inaccessible.

| Attack | Abuses | Leaks |
|--------|--------|-------|
| **Spectre** | Speculation past a bounds check | Data within same address space across security boundaries |
| **Meltdown** | OOO running after a permission fault before it is delivered | Kernel memory readable from user space |

---

## The Cache Timing Side Channel

<div class="mermaid">
flowchart TD
    A["CPU speculates past\na bounds check/fault"] --> B["Speculatively reads\na secret byte"]
    B --> C["Uses secret to index array:\nloads array[secret * 256]"]
    C --> D["Speculation squashed:\nregisters rolled back"]
    D --> E["BUT cache line for\narray[secret*256] stays warm"]
    E --> F["Attacker times array[i] for all i:\nthe fast one reveals secret"]
</div>

---

## Why Complete Fixes Are Hard

- Speculation and OOO are **fundamental** to how fast CPUs are built
- Cache side channels are intrinsic to having caches at all
- Mitigations exist (KPTI, speculation barriers, retpolines, microcode) — but cost real performance
- Researchers keep finding new variants

<div class="highlight-box">
The very tricks that make modern CPUs fast are the same ones that make them hard to secure.
</div>

---

## Lab 11, Q1: Sum-of-Products (Majority)

**Majority(a, b, c)** = 1 when at least two inputs are 1.

Truth table minterms (rows where output = 1):

| `a` | `b` | `c` | majority |
|:---:|:---:|:---:|:--------:|
| 0 | 1 | 1 | **1** |
| 1 | 0 | 1 | **1** |
| 1 | 1 | 0 | **1** |
| 1 | 1 | 1 | **1** |

```text
majority = (a'·b·c) + (a·b'·c) + (a·b·c') + (a·b·c)
```

Simplified: `majority = ab + ac + bc`

---

## Majority: SOP Circuit

<div class="mermaid">
flowchart LR
    a["a"] --> AND1 & AND2 & AND3 & AND4
    b["b"] --> AND1 & AND2 & AND3 & AND4
    c["c"] --> AND1 & AND2 & AND3 & AND4
    AND1["AND\na'·b·c"] --> OR["OR"]
    AND2["AND\na·b'·c"] --> OR
    AND3["AND\na·b·c'"] --> OR
    AND4["AND\na·b·c"] --> OR
    OR --> M["majority"]
</div>

Four 3-input AND gates (with NOT on complemented inputs) feed one 4-input OR gate.

---

## SOP Recipe (applies to any function)

1. Build the truth table
2. Identify all rows where output = **1** (minterms)
3. Write one product term per minterm:
   - Variable appears **complemented** if it is 0 in that row
   - Variable appears **uncomplemented** if it is 1
4. OR all product terms together
5. Draw AND/OR/NOT circuit

<div class="info-box">
Q2 (xor3): minterms are rows with an <em>odd</em> number of 1s: 001, 010, 100, 111.<br>
xor3 = a'b'c + a'bc' + ab'c' + abc
</div>

---

## Lab 11, Q3: Max3 Circuit

Build **Max3(A, B, C)** from comparators and MUXes (not by calling Max2).

Cascade two comparator + MUX stages:

<div class="mermaid">
flowchart LR
    A1["A"] --> CMP1["cmp\nA > B"]
    B1["B"] --> CMP1
    A1 --> MUX1["MUX"]
    B1 --> MUX1
    CMP1 -->|"sel"| MUX1
    MUX1 -->|"M = max(A,B)"| CMP2["cmp\nM > C"]
    C1["C"] --> CMP2
    MUX1 --> MUX2["MUX"]
    C1 --> MUX2
    CMP2 -->|"sel"| MUX2
    MUX2 --> MAX["Max"]
</div>

---

## Max3 Explanation

- **Stage 1:** `cmp(A > B)` drives MUX1 → output `M = max(A, B)`
- **Stage 2:** `cmp(M > C)` drives MUX2 → output `max(M, C) = Max3`

**Sort2 variation:** compute `gt = (A > B)` once, feed two MUXes with opposite data wiring:

```text
OUT2 (larger)  = gt ? A : B
OUT1 (smaller) = gt ? B : A
```

One comparator, two complementary MUXes — no extra comparison needed.

---

## Lab 11, Q5-6: Single-Cycle Limits

Every answer must be **justified from the hardware**.

| Question | Answer | Justification |
|----------|--------|---------------|
| Max instructions | `2^(ROM addr bits)` | ROM words = addressable entries |
| Registers written/cycle | **1** | Single write port on register file |
| Max data memory | `2^(addr bits) × bytes/word` | RAM entries × entry size |
| Min clock period | Longest combinational path | Load: PC→IMEM→regfile→ALU→DMEM→MUX→regfile |

---

## Adding New Instructions: 5-Step Recipe

1. **Identify** format (R/I/S/B/U/J)
2. **Components** — new hardware needed?
3. **Datapath** — new MUX inputs or wires?
4. **Control** — new control lines?
5. **Decoder** — opcode/funct detection and control bundle

**LWU (load word unsigned):** same as `LW` but zero-extend instead of sign-extend.
- Add zero-extend path; widen load-result MUX; new control bit for extension type.

---

## The SWPR Trap (Two Write Ports)

`swpr a0, a1` must write **both** `a0` and `a1` in the **same** clock cycle.

<div class="highlight-box">
The register file has a <strong>single write port</strong> — only one register can be updated per cycle. Two simultaneous writes require adding a <strong>second write port</strong>.
</div>

- This is the hardware change no other "add an instruction" problem requires
- ADDI2 (write two registers) has the same issue
- The exam asks for this **explicit justification** — answers without it earn 0

---

## Lab 11, Q7: Pipeline Behavior (No Hazard Handling)

```asm
addi a1, zero, 3
addi a2, zero, 4
add  a0, a1, a2    ← reads a1, a2 in DE — before the addi WBs
unimp
```

**Result without hazard handling:** `add` reads the **old** values (both 0), so `a0 = 0` instead of 7.

**Fix (software):** insert NOPs to let producers write back before the consumer reads:

```asm
addi a1, zero, 3
addi a2, zero, 4
nop
nop
add  a0, a1, a2    ← now a1, a2 are written back
```

---

## Pipeline Cycle Counting

For `n` instructions, no stalls, 5-stage pipeline: `n + 4` cycles total.

```text
Cycle:        1    2    3    4    5    6    7
addi a1   :  IF   DE   EX   MEM  WB
addi a2   :       IF   DE   EX   MEM  WB
add  a0   :            IF   DE   EX   MEM  WB
unimp     :                 IF   DE   EX   MEM  WB
```

With **forwarding**, the `add` gets `a1`/`a2` bypassed from EX/MEM — no stall needed here.

---

## Load-Use Stall in the Pipeline

```asm
ld   a2, 0(zero)
add  a0, a2, a2    ← load-use hazard: needs a2 in EX, available end of MEM
```

```text
ld   a2,0(zero)  :  IF  DE  EX  MEM  WB
add  a0,a2,a2    :      IF  DE  **   EX  MEM  WB
                                ^^
                         1-cycle bubble (stall)
```

- **Forwarding alone cannot help** — `a2` is not ready until end of MEM, but `add` needs it in EX
- Hazard unit asserts `ld-stall = 1` → `stl-en = 0` → PC and IF/DE hold for one cycle

---

## Full Load-Use Sequence: Hazard Summary

```asm
addi a1, zero, 3
sd   a1, 0(zero)    ← forward addi result to sd (no stall)
ld   a2, 0(zero)
add  a0, a2, a2     ← load-use stall (1 cycle) + forward
unimp
```

| Hazard type | Needed? | Why |
|-------------|---------|-----|
| Forwarding | Yes | `addi → sd`, `ld → add` after stall |
| Stalling | Yes | `ld → add` is load-use (1 cycle gap) |
| Flushing | No | No taken branches/jumps |

---

## Key Concepts Reference

| Concept | One-liner |
|---------|-----------|
| `stl-en` | `(NOT ld-stall) AND en` — AND gate, inverted ld-stall input |
| Load-use stall | `ld` then immediate consumer → 1-cycle bubble |
| Forwarding | Bypass EX/MEM result to next EX input |
| Speculation | Guess branch, execute ahead; flush on mispredict |
| OOO execution | Run ready instructions early; retire in order |
| Spectre/Meltdown | Cache timing leaks squashed speculative reads |
| SOP | One AND term per 1-row; OR them all |
| Single write port | Why `swpr`/`addi2` need a second port |
| Critical path | Slowest path sets minimum clock period |

---

## Final Exam Overview

- **Worth:** 35% of course grade
- **Covers:** all lectures, labs, and assignments
- **Notes:** two pages allowed
- **Format:** exam-style problems like Lab 11

What to expect:
- Truth tables and SOP expressions
- Digital circuits (comparators, MUXes)
- Single-cycle processor limits (justify from hardware)
- Pipeline tracing with forwarding, stalls, and flushes
- Short answer on speculative/OOO execution and Spectre/Meltdown

---

## Summary

1. **Hazard unit** produces `stl-en = (NOT ld-stall) AND en` — an AND gate with inverted `ld-stall`
2. **Three hazards:** data (forwarding), load-use (stall + forward), control (flush)
3. **Speculative execution** guesses branches; squash rolls back registers but leaves cache residue
4. **OOO execution** runs ready instructions out of order; retires in order; hides load latency
5. **Spectre/Meltdown** exploit cache residue via timing to leak secrets across security boundaries
6. **SOP design:** one product term per truth-table 1-row; realize with AND/OR/NOT gates
7. **Max3/Sort2** cascade comparator+MUX stages; Sort2 uses one comparator, two swapped MUXes
8. **Single-cycle limits** must be justified from hardware: one write port, ROM size, critical path