Advanced Architecture and Final Review¶
Overview¶
This is the final lecture of CS 315. It pulls together two threads. First, we close out the processor track: a Project 07 Q&A on the pipelined hazard unit, then a conceptual tour of how real CPUs go far beyond the single-cycle and simple pipelined designs you built this semester — using speculative execution and out-of-order execution, along with the security consequences (Spectre and Meltdown) of those tricks. Second, we work the Lab 11 exam-style problems (sum-of-products, digital design with comparators and multiplexers, and pipeline behavior) so you know exactly what the final exam looks like and how to show your work. The final is worth 35% of the course grade, covers all lectures, labs, and assignments, and allows two pages of notes.
Learning Objectives¶
- Explain the hazard unit's stall-enable logic and complete its truth table and gate-level implementation
- Distinguish data, control, and structural hazards and the techniques that resolve each (forwarding, stalling, flushing)
- Describe speculative execution and out-of-order execution and why modern CPUs use them
- Explain at a high level how Spectre and Meltdown abuse speculation and cache timing to leak data
- Derive a sum-of-products (SOP) boolean expression from a truth table and draw the corresponding AND/OR/NOT circuit
- Build a Max3 circuit from comparators and multiplexers and explain how it works
- Reason about single-cycle processor limits (instruction count, registers updated per cycle, addressable memory)
- Trace a short program through a pipeline and count clock cycles, identifying where stalls, flushes, and forwarding occur
Prerequisites¶
- Single-cycle RISC-V processor datapath and control (Project 06)
- Pipelined processor with pipeline registers and hazard handling (Project 07)
- Combinational logic, truth tables, and boolean algebra (Lectures 9-10)
- Sequential logic: registers, clocks, enable signals
- RISC-V assembly:
add,addi,ld,sd,unimp
1. Project 07 Q&A — The Hazard Unit¶
Project 07 takes the single-cycle processor and splits it into the classic five pipeline stages so that up to five instructions are in flight at once. The cost of that overlap is hazards — situations where the naive pipeline produces the wrong answer or fetches the wrong instruction. The hazard unit is the combinational block that watches the pipeline and decides, every cycle, whether to let the pipeline advance normally or to stall (freeze the front of the pipeline and inject a bubble).
The three hazard types¶
| Hazard | Cause | Resolution in Project 07 |
|---|---|---|
| Data | An instruction needs a register value that an earlier, still-in-flight instruction has not written back yet | Forwarding (bypass the value from a later stage back to EX); stall only when forwarding cannot supply it in time |
| Load-use | A load (ld) produces its value in MEM, but the next instruction wants it in EX — one cycle too early |
Stall one cycle (the load-use stall), then forward |
| Control | A taken branch/jump is resolved late, after the wrong-path instructions are already fetched | Flush the wrongly-fetched instructions (replace with bubbles) |
Today's Q&A focused on the load-use stall path, because that is where the hazard unit must actively freeze part of the pipeline.
Enable signals and the stall mechanism¶
A pipeline stage advances when its enable (EN) input is asserted on the clock edge. To stall, you de-assert the enables on the upstream state elements so they hold their current values while the rest of the pipeline drains:
- The PC register has an enable
EN. IfEN = 0, the PC does not change, so the same instruction is fetched again next cycle. - The IF/DE pipeline register has an enable. Holding it freezes the instruction sitting in decode.
The hazard unit receives the global en signal and produces a stl-en (stall-enable) signal. When a load-use hazard is detected (ld-stall = 1), the hazard unit must suppress the enable so the PC and IF/DE register hold instead of advancing.
flowchart LR
EN["en"] --> HU["Hazard Unit"]
LDS["ld-stall?<br/>(load-use<br/>detected)"] --> HU
HU -->|"stl-en"| PC["PC<br/>(EN)"]
HU -->|"stl-en"| IFDE["IF/DE reg<br/>(EN)"]
Deriving the stall-enable logic¶
The instructor walked through the truth table for stl-en in terms of ld-stall (1 = a load-use hazard is present) and en (1 = pipeline would otherwise advance):
ld-stall |
en |
stl-en |
|---|---|---|
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
Reading the only row where the output is 1: stl-en = 1 exactly when ld-stall = 0 and en = 1. So:
In words: advance the front of the pipeline only when the pipeline is enabled and there is no load-use stall. When a stall is detected, stl-en drops to 0, the PC and IF/DE register hold, and a bubble is injected into the EX stage.
The gate-level implementation¶
The handwritten notes draw this as a single AND gate with one inverted (bubbled) input: ld-stall enters through an inversion bubble, en enters directly, and the output is stl-en.
ld-stall ----o\
) AND ----- stl-en
en ----------/
(the small "o" bubble on the ld-stall input is a built-in inverter)
Equivalently, in C-like pseudocode for the combinational block:
// Hazard unit: produce the stall-enable signal
int stl_en = (!ld_stall) & en; // bitwise on 1-bit values
A common student bug is to wire stl-en = ld-stall OR en (or to forget the inversion), which never stalls — the load-use hazard then silently reads a stale register value. Always confirm against the truth table: the output is 1 in exactly one row.
Project 06 debugging note from class: if your autograder suddenly fails after reorganizing your repo, check that the instruction memory ROM was not deleted when nesting/un-nesting directories. A missing ROM means no instructions are fetched. Inspect forwarding paths and control lines, and verify ROM integrity.
2. Beyond the Simple Pipeline — How Real CPUs Go Faster¶
The processor you built executes instructions strictly in program order, one path at a time. Every chip you actually own — laptop, phone, cloud VM — does far more to extract performance. Two ideas dominate and were the focus of the advanced-architecture review: speculative execution and out-of-order execution. Understanding the simple datapath first is exactly what makes these comprehensible; they are layers stacked on top of what you already built.
flowchart TD
A["Single-cycle<br/>(Project 06)"] --> B["Pipelined, in-order<br/>(Project 07)"]
B --> C["+ Branch prediction<br/>& speculative execution"]
B --> D["+ Out-of-order<br/>execution"]
C --> E["Modern superscalar<br/>OOO core"]
D --> E
3. Speculative Execution¶
A deep pipeline learns whether a branch is taken only after the branch has progressed several stages. But the front-end must fetch something every cycle in the meantime. Speculative execution means the CPU guesses the branch outcome (using a branch predictor), fetches and begins executing down the predicted path, and only later checks whether the guess was right.
- Guess right (predictors are ~95-97% accurate today): the speculatively-executed work is committed — free performance.
- Guess wrong: the pipeline is flushed — every speculatively-fetched instruction is discarded and the CPU restarts down the correct path. The cost (the mispredict penalty) is roughly the number of stages between fetch and branch resolution.
flowchart LR
F["Fetch"] --> P{"Predict<br/>taken?"}
P -->|"guess path"| SPEC["Speculatively<br/>execute"]
SPEC --> CHK{"Branch<br/>resolved:<br/>correct?"}
CHK -->|"yes"| COMMIT["Commit results"]
CHK -->|"no"| FLUSH["Flush pipeline,<br/>restart correct path"]
The key insight that matters for security: a mispredicted speculative path is architecturally discarded — the register file and memory are rolled back as if it never ran — but its execution can leave microarchitectural side effects, most importantly entries it pulled into the cache. That observable residue is what Spectre exploits (Section 6).
4. Out-of-Order Execution¶
Even with perfect prediction, in-order execution stalls whenever the next instruction depends on a slow predecessor (e.g., a cache-missing load). Out-of-order (OOO) execution lets the hardware look at a window of upcoming instructions and execute whichever ones have their inputs ready, regardless of program order. Results are still retired in program order, so software cannot tell the difference (except in timing).
The motivating example from the whiteboard:
ld t0, (a0) # cache miss: t0 not ready for ~200 cycles
add t1, t2, t3 # independent of t0 — why wait?
The add does not depend on t0 at all. An in-order machine would stall the add behind the slow load. An out-of-order machine runs the add while the load is still waiting on memory, hiding the load latency behind useful work.
flowchart LR
FE["Fetch / Decode<br/>(in order)"] --> REN["Rename<br/>(remove false deps)"]
REN --> SCHED["Scheduler<br/>(pick ready ops)"]
SCHED --> FU["Functional units<br/>(execute out of order)"]
FU --> ROB["Reorder Buffer"]
ROB --> RET["Retire<br/>(in order)"]
In-order vs. out-of-order timing for the snippet above (# marks where work happens):
In-order (stalls behind the load):
ld t0,(a0) |EX|--- waiting on memory (~200 cyc) ---|WB|
add t1,t2,t3 (cannot start until load clears) |EX|WB|
Out-of-order (add runs during the miss):
ld t0,(a0) |EX|--- waiting on memory ---------------|WB|
add t1,t2,t3 |EX|WB| <-- executed early, latency hidden
OOO typically buys another ~20-40% IPC over a well-tuned in-order core — a real but expensive win (millions of transistors, significant power). Combined with register renaming, it removes the false (write-after-write, write-after-read) dependencies that a small architectural register set otherwise imposes.
5. Privileged Instructions and OS Isolation (brief)¶
A real ISA has two worlds: user mode for application code and privileged (supervisor/machine) mode for the operating system. Privileged instructions — setting page tables, handling traps, configuring interrupts, talking to devices — are only legal in the privileged mode. This is the hardware foundation that lets an OS:
- Isolate processes from each other (one program cannot read another's memory),
- Manage resources (CPU time, physical memory, devices) fairly and safely,
- Mediate all access to hardware through controlled entry points (system calls / traps).
We do not implement privileged mode in CS 315 — it is covered in depth in CS 326 (Operating Systems). The takeaway is that the architecture foundation you built this semester is exactly what real-world systems software is layered on top of; understanding software complexity starts with understanding the machine underneath it.
6. Security: Spectre and Meltdown¶
Speculative and out-of-order execution were designed purely for speed. In 2018 researchers showed they also open a side channel: a way to leak data the program was never supposed to read, by observing timing rather than reading any value directly.
The cache timing side channel¶
The core trick: speculation can perform a load that is later squashed, but the load still pulls a cache line in. Afterward, the attacker probes memory and times the accesses — a fast access means that line was cached (so the secret-dependent address was touched), a slow access means it was not. The secret leaks through which address was cached.
flowchart TD
A["CPU speculates past<br/>a bounds check / fault"] --> B["Speculatively reads<br/>a secret byte"]
B --> C["Uses secret to index<br/>an array → loads array[secret*256]"]
C --> D["Speculation squashed:<br/>register state rolled back"]
D --> E["BUT cache line for<br/>array[secret*256] stays warm"]
E --> F["Attacker times array[i] for all i:<br/>the fast one reveals secret"]
Spectre vs. Meltdown¶
| Attack | What it abuses | What it leaks |
|---|---|---|
| Spectre | Speculation past a bounds check or branch (the CPU runs the wrong path far enough to touch secret-dependent data) | Data within the same address space across security boundaries (e.g., a sandbox / JIT) |
| Meltdown | Out-of-order execution running after a permission fault but before the fault is delivered | Kernel memory readable from user space (on affected CPUs) |
Why complete fixes are hard¶
These behaviors are not bugs in one product line — speculation and OOO are fundamental to how fast CPUs are built, and the cache side channel is intrinsic to having caches at all. Mitigations exist (kernel page-table isolation, speculation barriers, microcode updates, retpolines), but they often cost real performance, and researchers keep finding new variants. This is the ongoing security vs. performance trade-off: harden the CPU and you slow it down; chase peak speed and you reopen channels.
The practical lesson for this course: the very tricks that make modern CPUs fast are the same ones that make them hard to secure. You can only reason about either once you understand the underlying machine.
7. Lab 11, Question 1 — Sum-of-Products (Majority)¶
The exam-style problems begin with combinational design. The Majority function of three 1-bit inputs is true when at least two of a, b, c are 1.
Step 1: Truth table¶
The four rows where the output is 1 (two or more inputs high) are the minterms:
a |
b |
c |
majority |
|---|---|---|---|
| 0 | 0 | 0 | 0 |
| 0 | 0 | 1 | 0 |
| 0 | 1 | 0 | 0 |
| 0 | 1 | 1 | 1 ✓ |
| 1 | 0 | 0 | 0 |
| 1 | 0 | 1 | 1 ✓ |
| 1 | 1 | 0 | 1 ✓ |
| 1 | 1 | 1 | 1 ✓ |
Step 2: Sum-of-products expression¶
Write one product (AND) term per 1-row — a variable appears complemented if it is 0 in that row, uncomplemented if it is 1 — then OR the terms together:
(Here a' means NOT a.) This is the canonical SOP form. It can be simplified — note majority = ab + ac + bc — but the exam asks for the SOP derivation, so show the four minterms first.
Step 3: Circuit (AND / OR / NOT)¶
Four 3-input AND gates (one per product term, with NOT gates feeding the complemented inputs) drive a single 4-input OR gate:
a b c
| | |
+-+-+----[NOT a]--+
| |---[AND]---+ (a'·b·c)
+-----------------+ |
|
... three more AND gates ... +--[ OR ]--- majority
|
(a·b'·c), (a·b·c'), (a·b·c) --+
flowchart LR
a["a"] --> AND1 & AND2 & AND3 & AND4
b["b"] --> AND1 & AND2 & AND3 & AND4
c["c"] --> AND1 & AND2 & AND3 & AND4
AND1["AND<br/>a'·b·c"] --> OR["OR"]
AND2["AND<br/>a·b'·c"] --> OR
AND3["AND<br/>a·b·c'"] --> OR
AND4["AND<br/>a·b·c"] --> OR
OR --> M["majority"]
The same recipe solves Question 2 (xor3): list the four rows with an odd number of 1s as minterms — 001, 010, 100, 111 — and OR their product terms: xor3 = a'b'c + a'bc' + ab'c' + abc.
8. Lab 11, Question 3 — Digital Design: Max3¶
Given a library Max2 (a comparator that outputs 1 when A > B, plus a 2-input multiplexer selecting A when the comparator is 1, else B), build Max3(A, B, C) that outputs the largest of three 64-bit values. The constraint: you must build it from components and gates, not by instantiating Max2.
How it works¶
Cascade two comparator + MUX stages:
- Stage 1: Compare
AandB. The comparator outputs 1 iffA > B; feed that to a MUX that selectsAwhen 1, elseB. Call the resultM = max(A, B). - Stage 2: Compare
MandC. A second comparator outputs 1 iffM > C; feed that to a second MUX that selectsMwhen 1, elseC. The result ismax(M, C) = max(A, B, C) = Max.
cmp(A>B) cmp(M>C)
A ---+----[>]---sel +---[>]---sel
| | | |
A ---+--->[MUX]-+-- M = max(A,B) -->[MUX]--- Max
B ------->[ 0 ] | | M ->[ 1 ]
B ---+----[>] C -+----->[ 0 ]
flowchart LR
A1["A"] --> CMP1["cmp<br/>A > B"]
B1["B"] --> CMP1
A1 --> MUX1["MUX"]
B1 --> MUX1
CMP1 -->|"sel"| MUX1
MUX1 -->|"M = max(A,B)"| CMP2["cmp<br/>M > C"]
C1["C"] --> CMP2
MUX1 --> MUX2["MUX"]
C1 --> MUX2
CMP2 -->|"sel"| MUX2
MUX2 --> MAX["Max"]
This matches the handwritten Question 3 sketch: two comparator boxes (each producing a > result that is buffered/inverted as needed to drive its MUX select), two multiplexers drawn as the tall oval shapes, with A and B feeding the first stage and C joining at the second stage to produce Max.
The Question 4 (Sort2) variation reuses the same comparator: compute cmp = (A > B) once, then drive two MUXes from it — one selecting the smaller (OUT1 = cmp ? B : A) and one selecting the larger (OUT2 = cmp ? A : B). One comparator, two complementary MUXes.
9. Lab 11, Questions 5-6 — Single-Cycle Processor Reasoning¶
These questions test how well you understand the limits of the datapath you built. Each answer must be justified — answers without justification earn 0.
Architectural limits (justify from the hardware)¶
| Question | Answer | Justification |
|---|---|---|
| Max instructions in a program | 2^(ROM address bits) |
Instruction memory is a ROM; the number of addressable words = 2^(address-bit width). With, e.g., 10 address bits → 1024 instructions. |
| Registers updated per clock cycle | 1 | The register file has a single write port (WR, WD, WE). Only one register's write enable is asserted per cycle. |
| Max addressable data memory | 2^(addr bits) × bytes_per_word |
RAM capacity = number of entries × entry size. 7 address bits × 8 bytes = 1024 bytes. |
| What sets the minimum clock period | The longest combinational path (critical path) | A single-cycle clock must be long enough for the slowest instruction to propagate fetch → decode → regfile → ALU → memory → writeback. Usually a load: PC → IMEM → decode → regfile read → ALU (address) → DMEM read → MUX → regfile write. |
| Why both an ALU and a PC+4 adder | They run in parallel every cycle | The ALU computes the instruction's result (and branch/jump targets); the PC+4 adder simultaneously computes the next sequential address. One unit cannot do both in the same cycle, so the datapath needs both. |
Adding new instructions (the 5-step recipe)¶
For any "add this instruction" problem, walk the same five steps and be specific:
- Identify the instruction(s) and format (R/I/S/B/U/J).
- Components: what new hardware is needed (sign-extender variant, extra adder, etc.)?
- Datapath: which MUXes must be added/widened, what new wires?
- Control: which decoder control lines must be added or set?
- Decoder: how is the instruction detected (opcode/funct3/funct7 comparators) and which control bundle does it emit?
LWU (load word unsigned) — like LW but zero-extend the 32-bit word instead of sign-extending:
- Components: add a zero-extend path (upper 32 bits forced to 0) alongside the existing sign-extend.
- Datapath: widen the load-result MUX to select between sign-extended and zero-extended word.
- Control: add/extend the memory-size/extension control to pick zero-extension for LWU.
- Decoder: detect LWU by its opcode + funct3; emit
LD=1,M2R=1, register write, and the zero-extend selector.
SWPR swpr a0, a1 (swap two registers) — the genuinely hard one, because it violates the one-write-port rule:
- A single-cycle register file with one write port can update only one register per cycle. Swapping needs two writes in one cycle.
- Components: add a second write port to the register file (second
WR/WD/WE), or accept that this instruction cannot be single-cycle on the current hardware. - Datapath: route
RD0→ second register andRD1→ first register simultaneously. - Control/Decoder: a new opcode/funct that asserts both write enables and crosses the read-data lines.
- The justification — "the register file has one write port, so two simultaneous writes are impossible without adding a port" — is the whole point of the question.
(The ADDI2 variation has the same trap: writing both a0 and a1 in one cycle needs a second write port.)
10. Lab 11, Question 7 — Pipeline Behavior and Cycle Counting¶
The starter Project 07 pipeline has no hazard handling — no forwarding, no stalling, no flushing. Consider:
Why it produces the wrong result (parts 1-2)¶
add a0, a1, a2 reads a1 and a2 in its DE stage. But on the starter pipeline, those addi instructions have not yet written back when the add reads the register file. The add therefore reads the old values of a1 and a2 (both 0 at program start), so:
The register-file write happens in WB, several cycles after the value is needed in DE — a classic data hazard with nothing to fix it.
Fixing it without changing the hardware (part 3)¶
Insert independent instructions (NOPs) to separate the producers from the consumer so the writes complete before the add reads:
addi a1, zero, 3
addi a2, zero, 4
nop
nop
add a0, a1, a2 # now a1, a2 are already written back
unimp
The number of NOPs depends on the pipeline depth and whether the regfile uses an inverted write clock; the principle is to push the dependent read past the producers' WB stage.
Cycle counting on a complete (forwarding + stall + flush) solution¶
For an n-instruction straight-line program with no stalls, total cycles = n + (k - 1) where k is the pipeline depth (5 here) — fill the pipe, then retire one per cycle.
Cycle: 1 2 3 4 5 6 7
addi a1 : IF DE EX MEM WB
addi a2 : IF DE EX MEM WB
add a0 : IF DE EX MEM WB
unimp : IF DE EX MEM WB
With forwarding, add a0, a1, a2 gets a1 and a2 bypassed from the addis' EX/MEM results into its EX stage — no stall, no flush needed for this sequence. Four instructions through a 5-stage pipe ≈ 4 + 4 = 8 cycles (exact count depends on how unimp/halt is modeled).
The load-use sequence (parts 6-9)¶
- Forwarding needed? Yes —
addi a1feedssd a1(forward the store value), and theldresult feedsadd a0, a2, a2. - Stalling needed? Yes —
ld a2thenadd a0, a2, a2is a load-use hazard: the load's value is ready only at end of MEM, butaddwants it in EX. This is exactly the case the hazard unit from Section 1 handles:ld-stall = 1forcesstl-en = 0for one cycle, inserting a bubble, after which the value is forwarded. - Flushing needed? No — there are no taken branches or jumps in this code, so no wrong-path instructions are ever fetched.
The one-cycle stall is the reason a complete solution needs the hazard unit, and is why getting stl-en = (NOT ld-stall) AND en correct matters.
Key Concepts¶
| Concept | Definition | Example |
|---|---|---|
| Hazard unit | Combinational logic that detects pipeline hazards and controls stalling | Produces stl-en to freeze the PC and IF/DE register |
Stall-enable (stl-en) |
Enable that advances the front of the pipeline only when safe | stl-en = (NOT ld-stall) AND en |
| Load-use hazard | A load's result is needed by the very next instruction, one cycle too early | ld a2,… then add a0,a2,a2 → 1-cycle stall |
| Forwarding | Bypass a result from a later stage back to EX, avoiding a stall | ALU output of addi fed into the next add |
| Flushing | Discard wrongly-fetched (speculative/wrong-path) instructions | After a taken branch resolves |
| Speculative execution | Guess a branch outcome, execute the predicted path, verify later | Mispredict → flush; leaves cache residue |
| Out-of-order execution | Execute ready instructions regardless of program order; retire in order | add runs while a missed ld waits on DRAM |
| Spectre / Meltdown | Side-channel attacks reading squashed speculative state via cache timing | Time array[i] to recover a leaked secret byte |
| Sum-of-products (SOP) | OR of one AND (product) term per truth-table 1-row | maj = a'bc + ab'c + abc' + abc |
| Comparator + MUX | A > test selecting one of two inputs; the building block of Max/Sort |
M = (A>B) ? A : B |
| Critical path | Longest combinational delay; sets the minimum clock period | Load path: PC→IMEM→regfile→ALU→DMEM→MUX→regfile |
| Single write port | Register file can update only one register per clock cycle | Why swpr/addi2 need a second port |
Practice Problems¶
Problem 1: Complete the hazard logic¶
The hazard unit must hold the PC and IF/DE register during a load-use stall. Given inputs ld-stall and en, write the boolean expression for stl-en and state what each input value of stl-en does to the PC.
Click to reveal solution
- `stl-en = 1` → the PC's enable is asserted, so the PC updates (pipeline advances normally). - `stl-en = 0` → the PC's enable is de-asserted, so the PC **holds** its value, re-fetching the same instruction and creating a one-cycle bubble downstream. Truth table check — output is 1 in exactly one row (`ld-stall = 0`, `en = 1`): | ld-stall | en | stl-en | |:--------:|:--:|:------:| | 0 | 0 | 0 | | 0 | 1 | 1 | | 1 | 0 | 0 | | 1 | 1 | 0 | Gate: an AND gate with the `ld-stall` input inverted (bubbled).Problem 2: Speculation leaves a trace¶
A CPU speculatively executes t1 = array[secret] down a mispredicted path, then squashes it. The architectural registers are rolled back so t1's value is gone. Explain how an attacker could still learn secret.
Click to reveal solution
The squash rolls back **architectural** state (registers, memory) but **not** the cache. If the speculative path used `secret` to index memory — e.g., loaded `probe[secret * 256]` — that cache line stays warm. The attacker then **times** accesses to every `probe[i * 256]`: the one that is fast (a cache hit) reveals `i = secret`. The data leaks through *which line was cached*, a microarchitectural side effect of execution that "never happened" architecturally. This is the essence of Spectre/Meltdown.Problem 3: SOP for "exactly two" of three inputs¶
Derive the SOP expression for f(a,b,c) = 1 when exactly two of the three inputs are 1 (not three). Give the truth table rows and the expression.
Click to reveal solution
Exactly-two rows: `011`, `101`, `110` (but **not** `111`). | a | b | c | f | |:-:|:-:|:-:|:-:| | 0 | 1 | 1 | 1 | | 1 | 0 | 1 | 1 | | 1 | 1 | 0 | 1 | | 1 | 1 | 1 | 0 | Note this differs from `majority` only by *excluding* the `abc` term — majority includes the all-ones case, "exactly two" does not.Problem 4: Build Sort2 from one comparator¶
Using a single comparator that outputs gt = (A > B) and two 2-input MUXes, give the expressions for OUT1 (smaller) and OUT2 (larger).
Click to reveal solution
Compute `gt = (A > B)` once and feed it to both MUXes with opposite assignments: - If `A > B` (`gt = 1`): `OUT2 = A` (larger), `OUT1 = B` (smaller). Correct. - If `A <= B` (`gt = 0`): `OUT2 = B`, `OUT1 = A`. Correct (ties give `A = B`, either output works). One comparator drives two MUXes whose data inputs are swapped — no extra comparator needed.Problem 5: Why swpr cannot be single-cycle¶
On the Project 06 single-cycle processor, explain precisely why adding swpr a0, a1 (swap two registers) requires a hardware change that the other "add an instruction" problems do not.
Click to reveal solution
`swpr a0, a1` must write **both** `a0` and `a1` in the **same** clock cycle (`a0 ← old a1`, `a1 ← old a0`). The register file has a **single write port** (`WR`, `WD`, `WE`) — it can assert exactly one register's write enable per cycle. Two simultaneous writes are physically impossible on that hardware. Fix: add a **second write port** (a second `WR`/`WD`/`WE`), then route `RD1 → a0`'s write data and `RD0 → a1`'s write data and assert both write enables. Most other new instructions (e.g., `lwu`) only write one register and need no extra port. This is the justification the exam is looking for.Problem 6: Cycle count and hazards¶
On a complete Project 07 solution (forwarding, stalling, flushing), how many stalls does the following need, and why?
Click to reveal solution
**One stall.** - `ld t0` → `add t1, t0, t0` is a **load-use hazard**: the load's value is ready only at end of MEM, but the `add` wants it in EX. Forwarding alone cannot cover this one-cycle gap, so the hazard unit asserts `ld-stall`, dropping `stl-en` to 0 for one cycle (one bubble). After the bubble, `t0` is forwarded from MEM/WB into the `add`'s EX. - `add t1, …` → `add t2, t1, t1` is an ordinary ALU-to-ALU dependency: **forwarding** handles it with **no stall** (EX result of the first `add` bypassed straight into the second's EX). No branches/jumps → no flushes. Total extra cycles beyond the ideal `3 + 4 = 7` is the single bubble → 8 cycles.Further Reading¶
- Source notes (handwritten): "/notes/CS315-01 2025-12-02 Advanced Architecture Final Review.pdf"
- Lab 11 exam-style problems: /assignments/lab11/
- Modern Microprocessors: A 90-Minute Guide — Jason Patterson's survey of pipelining, superscalar, OOO, and speculation
- Spectre Attacks: Exploiting Speculative Execution — the original Spectre paper
- Meltdown: Reading Kernel Memory from User Space — the original Meltdown paper
- Patterson & Hennessy, Computer Organization and Design (RISC-V Edition) — Ch. 4 (pipelining and hazards), Ch. 5 (memory)
- RISC-V Specifications — instruction formats, load/store, privileged ISA
Summary¶
- The hazard unit decides each cycle whether the front of the pipeline advances; its load-use stall logic is
stl-en = (NOT ld-stall) AND en, drawn as an AND gate with theld-stallinput inverted. - Three hazards — data, control, structural — are resolved by forwarding, stalling, and flushing; the load-use case is the one that forces a true stall.
- Speculative execution guesses branch outcomes and executes ahead, flushing on a mispredict; the squash rolls back registers but leaves cache residue.
- Out-of-order execution runs ready instructions early (hiding load-miss latency) while retiring in program order — worth ~20-40% extra IPC.
- Spectre and Meltdown turn that cache residue into a timing side channel that leaks data across security boundaries; complete fixes are hard because speculation and caching are fundamental to performance.
- Sum-of-products design: one product term per truth-table 1-row, OR'd together, then realized with AND/OR/NOT gates — the recipe behind majority, xor3, and "exactly two."
- Comparator + MUX is the universal building block for Max3 (cascade two stages) and Sort2 (one comparator, two swapped MUXes).
- Single-cycle limits must be justified from the hardware: max instructions =
2^(ROM addr bits), exactly one register updated per cycle (single write port), and the clock period is set by the critical path — which is whyswpr/addi2need a second write port and why the final exam demands justification for every answer.