← Back to Course
# Advanced Architecture and Final Review ## CS 315 Computer Architecture --- ## Topics Today - **Project 07 Q&A** — Hazard unit stall-enable logic - **Beyond the simple pipeline** — speculative and out-of-order execution - **Security** — Spectre and Meltdown - **Lab 11 exam-style problems** — combinational design and pipeline tracing - **Final exam overview**
Final exam: 35% of grade — covers all lectures, labs, and assignments. Two pages of notes allowed.
--- ## The Five-Stage Pipeline (Review)
flowchart LR IF["IF\nFetch"] --> DE["DE\nDecode"] DE --> EX["EX\nExecute"] EX --> MEM["MEM\nMemory"] MEM --> WB["WB\nWriteback"]
Up to five instructions in flight simultaneously — but introduces **hazards**. --- ## Three Hazard Types | Hazard | Cause | Resolution | |--------|-------|------------| | **Data** | Instruction needs a value not yet written back | Forwarding; stall if too early | | **Load-use** | `ld` produces value in MEM, next needs it in EX | Stall 1 cycle, then forward | | **Control** | Branch/jump resolved late; wrong path already fetched | Flush wrong-path instructions |
The
hazard unit
is combinational logic that watches the pipeline and issues stalls or flushes every cycle.
--- ## Stall Mechanism: Enable Signals To stall, de-assert enables on *upstream* state elements so they hold: - **PC register** has `EN` — if `EN = 0`, PC holds (same instruction re-fetched) - **IF/DE pipeline register** has `EN` — if `EN = 0`, decode stage freezes
flowchart LR EN["en"] --> HU["Hazard Unit"] LDS["ld-stall\n(load-use detected)"] --> HU HU -->|"stl-en"| PC["PC\n(EN)"] HU -->|"stl-en"| IFDE["IF/DE reg\n(EN)"]
--- ## Deriving stl-en: Truth Table | `ld-stall` | `en` | `stl-en` | |:----------:|:----:|:--------:| | 0 | 0 | 0 | | 0 | 1 | **1** | | 1 | 0 | 0 | | 1 | 1 | 0 | Output is 1 in exactly **one** row: when the pipeline is enabled AND there is no load-use stall. ```text stl-en = (NOT ld-stall) AND en ``` --- ## Gate-Level Implementation An AND gate with one inverted (bubbled) input: ```text ld-stall ----o\ ) AND ----- stl-en en ----------/ (small "o" = inverter on ld-stall input) ``` ```c // Hazard unit (combinational, 1-bit values) int stl_en = (!ld_stall) & en; ```
Common bug: wiring
stl-en = ld-stall OR en
— this
never
stalls, silently reading stale register values.
--- ## How Real CPUs Go Faster Your Project 06/07 processor executes strictly in order. Every real CPU adds more layers:
flowchart TD A["Single-cycle\n(Project 06)"] --> B["Pipelined, in-order\n(Project 07)"] B --> C["+ Branch prediction\n& speculative execution"] B --> D["+ Out-of-order\nexecution"] C --> E["Modern superscalar\nOOO core"] D --> E
--- ## Speculative Execution A deep pipeline must fetch instructions *before* a branch resolves. **Speculation** = guess the branch outcome and start executing the predicted path.
flowchart LR F["Fetch"] --> P{"Predict\ntaken?"} P -->|"guess path"| SPEC["Speculatively\nexecute"] SPEC --> CHK{"Branch\nresolved:\ncorrect?"} CHK -->|"yes"| COMMIT["Commit results"] CHK -->|"no"| FLUSH["Flush pipeline,\nrestart correct path"]
Modern branch predictors are ~95-97% accurate. --- ## Speculation: The Key Insight - **Guess right** → speculatively-executed work is committed — free performance - **Guess wrong** → pipeline flushed, restart down correct path (mispredict penalty = pipeline depth)
A squashed speculative path is
architecturally
invisible — registers and memory are rolled back. But it can leave
microarchitectural side effects
in the cache. That is what Spectre exploits.
--- ## Out-of-Order (OOO) Execution **Problem:** even with perfect prediction, in-order stalls on slow instructions (cache-missing loads). ```asm ld t0, (a0) # cache miss: ~200 cycle wait add t1, t2, t3 # independent of t0 — why wait? ``` **OOO:** execute whichever instructions have their inputs ready, regardless of program order. Retire **in program order**. --- ## OOO Pipeline Structure
flowchart LR FE["Fetch/Decode\n(in order)"] --> REN["Rename\n(remove false deps)"] REN --> SCHED["Scheduler\n(pick ready ops)"] SCHED --> FU["Functional Units\n(execute OOO)"] FU --> ROB["Reorder Buffer"] ROB --> RET["Retire\n(in order)"]
**Register renaming** removes false WAW/WAR dependencies from the small architectural register file. --- ## OOO vs In-Order Timing ```text In-order (stalls behind the load): ld t0,(a0) |EX|--- waiting on memory (~200 cyc) ---|WB| add t1,t2,t3 (cannot start until load clears) |EX|WB| Out-of-order (add runs during the miss): ld t0,(a0) |EX|--- waiting on memory --------------|WB| add t1,t2,t3 |EX|WB| <-- executed early, latency hidden ``` OOO buys ~20-40% extra IPC over a well-tuned in-order core. --- ## Privileged Mode (Brief) Real ISAs have two worlds: | Mode | Who | Can do | |------|-----|--------| | **User** | Application code | Normal instructions | | **Privileged** | OS kernel | Set page tables, handle traps, manage devices | - Isolates processes from each other - Mediates all hardware access via system calls / traps
Not implemented in CS 315 — covered in depth in
CS 326 Operating Systems
.
--- ## Spectre and Meltdown In 2018, researchers showed speculation opens a **timing side channel** to leak data that should be inaccessible. | Attack | Abuses | Leaks | |--------|--------|-------| | **Spectre** | Speculation past a bounds check | Data within same address space across security boundaries | | **Meltdown** | OOO running after a permission fault before it is delivered | Kernel memory readable from user space | --- ## The Cache Timing Side Channel
flowchart TD A["CPU speculates past\na bounds check/fault"] --> B["Speculatively reads\na secret byte"] B --> C["Uses secret to index array:\nloads array[secret * 256]"] C --> D["Speculation squashed:\nregisters rolled back"] D --> E["BUT cache line for\narray[secret*256] stays warm"] E --> F["Attacker times array[i] for all i:\nthe fast one reveals secret"]
--- ## Why Complete Fixes Are Hard - Speculation and OOO are **fundamental** to how fast CPUs are built - Cache side channels are intrinsic to having caches at all - Mitigations exist (KPTI, speculation barriers, retpolines, microcode) — but cost real performance - Researchers keep finding new variants
The very tricks that make modern CPUs fast are the same ones that make them hard to secure.
--- ## Lab 11, Q1: Sum-of-Products (Majority) **Majority(a, b, c)** = 1 when at least two inputs are 1. Truth table minterms (rows where output = 1): | `a` | `b` | `c` | majority | |:---:|:---:|:---:|:--------:| | 0 | 1 | 1 | **1** | | 1 | 0 | 1 | **1** | | 1 | 1 | 0 | **1** | | 1 | 1 | 1 | **1** | ```text majority = (a'·b·c) + (a·b'·c) + (a·b·c') + (a·b·c) ``` Simplified: `majority = ab + ac + bc` --- ## Majority: SOP Circuit
flowchart LR a["a"] --> AND1 & AND2 & AND3 & AND4 b["b"] --> AND1 & AND2 & AND3 & AND4 c["c"] --> AND1 & AND2 & AND3 & AND4 AND1["AND\na'·b·c"] --> OR["OR"] AND2["AND\na·b'·c"] --> OR AND3["AND\na·b·c'"] --> OR AND4["AND\na·b·c"] --> OR OR --> M["majority"]
Four 3-input AND gates (with NOT on complemented inputs) feed one 4-input OR gate. --- ## SOP Recipe (applies to any function) 1. Build the truth table 2. Identify all rows where output = **1** (minterms) 3. Write one product term per minterm: - Variable appears **complemented** if it is 0 in that row - Variable appears **uncomplemented** if it is 1 4. OR all product terms together 5. Draw AND/OR/NOT circuit
Q2 (xor3): minterms are rows with an
odd
number of 1s: 001, 010, 100, 111.
xor3 = a'b'c + a'bc' + ab'c' + abc
--- ## Lab 11, Q3: Max3 Circuit Build **Max3(A, B, C)** from comparators and MUXes (not by calling Max2). Cascade two comparator + MUX stages:
flowchart LR A1["A"] --> CMP1["cmp\nA > B"] B1["B"] --> CMP1 A1 --> MUX1["MUX"] B1 --> MUX1 CMP1 -->|"sel"| MUX1 MUX1 -->|"M = max(A,B)"| CMP2["cmp\nM > C"] C1["C"] --> CMP2 MUX1 --> MUX2["MUX"] C1 --> MUX2 CMP2 -->|"sel"| MUX2 MUX2 --> MAX["Max"]
--- ## Max3 Explanation - **Stage 1:** `cmp(A > B)` drives MUX1 → output `M = max(A, B)` - **Stage 2:** `cmp(M > C)` drives MUX2 → output `max(M, C) = Max3` **Sort2 variation:** compute `gt = (A > B)` once, feed two MUXes with opposite data wiring: ```text OUT2 (larger) = gt ? A : B OUT1 (smaller) = gt ? B : A ``` One comparator, two complementary MUXes — no extra comparison needed. --- ## Lab 11, Q5-6: Single-Cycle Limits Every answer must be **justified from the hardware**. | Question | Answer | Justification | |----------|--------|---------------| | Max instructions | `2^(ROM addr bits)` | ROM words = addressable entries | | Registers written/cycle | **1** | Single write port on register file | | Max data memory | `2^(addr bits) × bytes/word` | RAM entries × entry size | | Min clock period | Longest combinational path | Load: PC→IMEM→regfile→ALU→DMEM→MUX→regfile | --- ## Adding New Instructions: 5-Step Recipe 1. **Identify** format (R/I/S/B/U/J) 2. **Components** — new hardware needed? 3. **Datapath** — new MUX inputs or wires? 4. **Control** — new control lines? 5. **Decoder** — opcode/funct detection and control bundle **LWU (load word unsigned):** same as `LW` but zero-extend instead of sign-extend. - Add zero-extend path; widen load-result MUX; new control bit for extension type. --- ## The SWPR Trap (Two Write Ports) `swpr a0, a1` must write **both** `a0` and `a1` in the **same** clock cycle.
The register file has a
single write port
— only one register can be updated per cycle. Two simultaneous writes require adding a
second write port
.
- This is the hardware change no other "add an instruction" problem requires - ADDI2 (write two registers) has the same issue - The exam asks for this **explicit justification** — answers without it earn 0 --- ## Lab 11, Q7: Pipeline Behavior (No Hazard Handling) ```asm addi a1, zero, 3 addi a2, zero, 4 add a0, a1, a2 ← reads a1, a2 in DE — before the addi WBs unimp ``` **Result without hazard handling:** `add` reads the **old** values (both 0), so `a0 = 0` instead of 7. **Fix (software):** insert NOPs to let producers write back before the consumer reads: ```asm addi a1, zero, 3 addi a2, zero, 4 nop nop add a0, a1, a2 ← now a1, a2 are written back ``` --- ## Pipeline Cycle Counting For `n` instructions, no stalls, 5-stage pipeline: `n + 4` cycles total. ```text Cycle: 1 2 3 4 5 6 7 addi a1 : IF DE EX MEM WB addi a2 : IF DE EX MEM WB add a0 : IF DE EX MEM WB unimp : IF DE EX MEM WB ``` With **forwarding**, the `add` gets `a1`/`a2` bypassed from EX/MEM — no stall needed here. --- ## Load-Use Stall in the Pipeline ```asm ld a2, 0(zero) add a0, a2, a2 ← load-use hazard: needs a2 in EX, available end of MEM ``` ```text ld a2,0(zero) : IF DE EX MEM WB add a0,a2,a2 : IF DE ** EX MEM WB ^^ 1-cycle bubble (stall) ``` - **Forwarding alone cannot help** — `a2` is not ready until end of MEM, but `add` needs it in EX - Hazard unit asserts `ld-stall = 1` → `stl-en = 0` → PC and IF/DE hold for one cycle --- ## Full Load-Use Sequence: Hazard Summary ```asm addi a1, zero, 3 sd a1, 0(zero) ← forward addi result to sd (no stall) ld a2, 0(zero) add a0, a2, a2 ← load-use stall (1 cycle) + forward unimp ``` | Hazard type | Needed? | Why | |-------------|---------|-----| | Forwarding | Yes | `addi → sd`, `ld → add` after stall | | Stalling | Yes | `ld → add` is load-use (1 cycle gap) | | Flushing | No | No taken branches/jumps | --- ## Key Concepts Reference | Concept | One-liner | |---------|-----------| | `stl-en` | `(NOT ld-stall) AND en` — AND gate, inverted ld-stall input | | Load-use stall | `ld` then immediate consumer → 1-cycle bubble | | Forwarding | Bypass EX/MEM result to next EX input | | Speculation | Guess branch, execute ahead; flush on mispredict | | OOO execution | Run ready instructions early; retire in order | | Spectre/Meltdown | Cache timing leaks squashed speculative reads | | SOP | One AND term per 1-row; OR them all | | Single write port | Why `swpr`/`addi2` need a second port | | Critical path | Slowest path sets minimum clock period | --- ## Final Exam Overview - **Worth:** 35% of course grade - **Covers:** all lectures, labs, and assignments - **Notes:** two pages allowed - **Format:** exam-style problems like Lab 11 What to expect: - Truth tables and SOP expressions - Digital circuits (comparators, MUXes) - Single-cycle processor limits (justify from hardware) - Pipeline tracing with forwarding, stalls, and flushes - Short answer on speculative/OOO execution and Spectre/Meltdown --- ## Summary 1. **Hazard unit** produces `stl-en = (NOT ld-stall) AND en` — an AND gate with inverted `ld-stall` 2. **Three hazards:** data (forwarding), load-use (stall + forward), control (flush) 3. **Speculative execution** guesses branches; squash rolls back registers but leaves cache residue 4. **OOO execution** runs ready instructions out of order; retires in order; hides load latency 5. **Spectre/Meltdown** exploit cache residue via timing to leak secrets across security boundaries 6. **SOP design:** one product term per truth-table 1-row; realize with AND/OR/NOT gates 7. **Max3/Sort2** cascade comparator+MUX stages; Sort2 uses one comparator, two swapped MUXes 8. **Single-cycle limits** must be justified from hardware: one write port, ROM size, critical path