Lab: Project 07 and Final Review¶

Overview¶

This hands-on lab wrapped up the semester by combining two goals: finishing the Project 07 pipelined-processor Hazard Unit and reviewing for the final exam. We worked through the Lab 11 exam-like problems on paper, drew pipeline timing diagrams cycle-by-cycle for forwarding, stalling, and flushing, and reasoned about how to extend the single-cycle and pipelined datapaths with new instructions (LWU, SWPR). The emphasis throughout was procedural: how to trace a program through the five stages, how to count clock cycles, and how to decide which hazard mechanism (forward, stall, or flush) each dependency requires.

Learning Objectives¶

Draw a cycle-by-cycle pipeline timing diagram (IF, DR, EX, MEM, WB) for a short RISC-V program
Count the total clock cycles needed for a program on a fully hazard-managed pipeline
Decide when a dependency needs forwarding versus a stall versus a flush
Explain why inverting the RegFile clock removes one nop and how forwarding removes more
Trace the Hazard Unit logic (FRD0/FRD1, stall enables, flush clears) against the Project 07 spec
Extend the single-cycle datapath for sign-vs-zero extension (LWU) and for a two-register write (SWPR)
Identify and fix the most common Project 07 submission and wiring mistakes

Prerequisites¶

Project 06 single-cycle RISC-V processor (datapath, control, instruction decoder)
The five-stage pipeline and pipeline registers (IF/DR, DR/EX, EX/MEM, MEM/WB)
RISC-V instruction formats and the meaning of addi, add, ld, sd, jal, beq
Two's-complement sign extension versus zero extension
Digital (the logic simulator): MUXes, comparators, RAM, ROM, registers with EN/CLR

1. Session Roadmap and Final Exam Format¶

This was the last working session before the final. The plan was to use the Lab 11 exam-like problems as the review backbone, then return to Project 07 mechanics so everyone could finish the Hazard Unit.

What the final exam will look like:

Color circuit diagrams. Expect to read and annotate a datapath drawn in color. Lines you add (new datapath wires, MUXes, control lines) should be drawn clearly, the way the lab solutions were drawn.
Cycle counting and pipelining. Given a short program and a fully hazard-managed pipeline, count the clock cycles and identify every forward, stall, and flush.
Processor design / "add an instruction." Given the single-cycle or pipelined datapath, describe the datapath, control-path, component, and instruction-decoder changes needed to support a new instruction.
Sum-of-products and digital design. Truth tables to boolean equations to gate-level circuits (majority, XOR3, Max3, Sort2).

The project itself is cumulative in how it is weighted at the course level: roughly 25% pre-midterm content and 75% post-midterm content. The autograder is run as-is; any grading adjustments are handled afterward.

flowchart LR
    A[Lab 11 problems] --> B[Pipeline timing diagrams]
    B --> C[Hazard reasoning: forward / stall / flush]
    C --> D[Project 07 Hazard Unit]
    A --> E[Add-an-instruction: LWU, SWPR]
    E --> F[Final exam prep]
    C --> F
    style B fill:#f9f,stroke:#333,stroke-width:2px
    style C fill:#f9f,stroke:#333,stroke-width:2px

2. The Five-Stage Pipeline: Vocabulary We Will Reuse All Session¶

Every timing diagram in this lab uses the same five stages. Memorize the abbreviations because the diagrams are written entirely in them.

Stage	Letter	What happens
Instruction Fetch	I (IF)	Read the instruction word from instruction memory at PC; compute PC+4
Decode / Register read	D (DR)	Decode the instruction; read RR0/RR1 from the RegFile into RD0/RD1; build the immediate
Execute	E (EX)	ALU computes the result; Branch Unit compares; branch/jump target computed
Memory	M (MEM)	Load or store to data RAM (`ld`/`sd`); other instructions pass through
Write Back	W (WB)	Write the result back into the RegFile (WR/WD/WE)

Between every pair of stages sits a pipeline register that carries that instruction's signals forward one stage per clock:

   IF        DR        EX        MEM        WB
  [PC] -->[IF/DR]-->[DR/EX]-->[EX/MEM]-->[MEM/WB]--> RegFile write

A signal name in Project 07 carries a stage suffix: _2 means the value is in the DR/EX boundary (decode side), _3 means EX/MEM, _4 means MEM/WB. So RR0_2 is the read-register-0 selector of the instruction currently entering EX, WR_3 is the destination register of the instruction in MEM, and RFW_4 is the register-file-write flag of the instruction in WB.

flowchart LR
    PC[PC] --> IFDR[IF/DR reg]
    IFDR --> DREX[DR/EX reg]
    DREX --> EXMEM[EX/MEM reg]
    EXMEM --> MEMWB[MEM/WB reg]
    MEMWB --> RF[(RegFile)]
    RF -. read RD0/RD1 .-> DREX

3. Reading a Pipeline Timing Diagram (Worked: Three `addi`/`add`)¶

The first board we drew is Question 7, part (4) from Lab 11. The program runs on a complete Project 07 solution (clock inversion + forwarding + stalling + flushing):

addi a1, zero, 3
addi a2, zero, 4
add  a0, a1, a2     # a0 should be 7
unimp               # marker / halt

A timing diagram places cycles across the top and one instruction per row. Each instruction occupies one stage per cycle, shifted right by one as the next instruction follows it down the pipe.

            cyc1   cyc2   cyc3   cyc4   cyc5   cyc6   cyc7
addi a1   |  I  |  D  |  E  |  M  | (W) |     |     |
addi a2   |     |  I  |  D  | (E) | (M) |  W  |     |
add  a0   |     |     |  I  |  D  | (E) |  M  |  W  |

The add finishes its WB in cycle 7. That circled (7) was the answer on the board.

Why forwarding, not stalling, is enough here¶

The add reads a1 and a2 in its DR stage (cycle 4), but the producers of a1 and a2 (the two addis) have not yet written back. Look at the vertical alignment in cycle 5: the second addi is in MEM (M) and the add is entering EX (E). The first addi is in WB (W).

The add needs a1: produced by addi a1 (now in WB). Forward from the WB/MEM-WB side into EX.
The add needs a2: produced by addi a2 (now in MEM). Forward from the EX/MEM (ALU result) side into EX.

Both values exist somewhere in the pipeline by the time the add is in EX, so forwarding alone resolves the hazard. The circled forwarding arrows on the board ran from the addi results back down to the add's EX inputs. No stall is needed, no bubble is inserted.

Part (5) answer: No flushes and no stalls are needed for this program — only forwarding. The pipeline runs at full rate.

Contrast with the starter pipeline (no Hazard Unit): the add would read the RegFile in cycle 3 before either addi has written back, getting stale a1/a2 (both 0 if those registers started at 0), so a0 would be 0. That is Question 7 parts (1)-(3): the wrong result is 0, and you would fix it by inserting nops between the instructions.

4. The Forwarding Hazard Unit (Project 07, test `04-add-fwd`)¶

Forwarding is the highest-value piece of Project 07 (50 of 100 points). The idea: instead of waiting for a value to travel all the way to WB and back into the RegFile, route it directly from a later stage back into the EX inputs.

Datapath additions¶

Bring two already-computed values back to EX:
ALUR_3 — the ALU result sitting in the EX/MEM register (the instruction one ahead, in MEM).
MR_4 — the MEM/WB MUX output (the instruction two ahead, in WB).
Add two 3-input MUXes in the EX stage, one per operand:
RD0 MUX: inputs { RD0, ALUR_3, MR_4 }, selector FRD0.
RD1 MUX: inputs { RD1, ALUR_3, MR_4 }, selector FRD1.
The MUX output feeds everywhere the original RD0/RD1 used to connect (ALU input, store data, branch unit, etc.).

            +-----------+
   RD0 ---->|0          |
  ALUR_3 -->|1   MUX    |---> to ALU A (and wherever RD0 went)
   MR_4 --->|2          |
            +-----------+
                 ^
                FRD0   (from Hazard Unit)

Hazard Unit logic (for FRD0; FRD1 is symmetric with RR1_2)¶

if ((RR0_2 == WR_3) && (RFW_3)) {
    FRD0 = 2;          // forward ALU result from EX/MEM (closest producer)
} else if ((RR0_2 == WR_4) && (RFW_4)) {
    FRD0 = 1;          // forward from MEM/WB
} else {
    FRD0 = 0;          // no hazard: use the value read from the RegFile
}

Two things to internalize:

Match the read register to a later instruction's write register, and only forward if that later instruction actually writes (RFW). An instruction that does not write the RegFile must never be a forwarding source.
Closest wins. Test EX/MEM (_3) before MEM/WB (_4). The nearer instruction holds the most recent value.

The board example for "closest wins":

foo:
    addi a0, zero, 3
    addi a0, zero, 4
    add  a0, a0, a0     # must see 4, not 3

When the add is in EX, the addi a0, zero, 4 is in MEM (its result is ALUR_3 = 4) and addi a0, zero, 3 is in WB (MR_4 = 3). Both match a0, so the priority order forwards ALUR_3 = 4. Getting the priority backwards is the single most common forwarding bug.

flowchart TD
    A["add a0, a0, a0 in EX wants a0"] --> B{"RR0_2 == WR_3 and RFW_3?"}
    B -- yes --> C["FRD0 = 2 -> forward ALUR_3 (=4)"]
    B -- no --> D{"RR0_2 == WR_4 and RFW_4?"}
    D -- yes --> E["FRD0 = 1 -> forward MR_4 (=3)"]
    D -- no --> F["FRD0 = 0 -> use RD0 from RegFile"]

5. The Load-Use Stall (Worked: `sd`/`ld`/`add`)¶

The second board is Question 7, parts (6)-(9). This is the case forwarding cannot fully solve, because a load's value is not ready until the end of MEM — too late to forward into the very next instruction's EX.

addi a1, zero, 3
sd   a1, 0(zero)
ld   a2, 0(zero)
add  a0, a2, a2     # a0 should be 8
unimp

Here is the diagram we drew. B (in red on the board) marks the inserted bubble, and the row "wraps" because the dependent add is held back one cycle:

            c1    c2    c3    c4    c5    c6    c7    c8    c9
addi a1   | I  | D  | E  | M  | W  |    |    |    |    |
sd   a1   |    | I  | D  | E  | M  | W  |    |    |    |
ld   a2   |    |    | I  | D  | E  | M  | W  |    |    |
add  a0   |    |    |    | I  | D  |(B) | E  | M  | W  |   <- stalled 1 cycle

The add reaches WB in cycle 9 — that is the circled (9) answer.

Walking the timeline¶

ld a2 produces a2 at the end of its MEM stage (cycle 6).
The add would normally be in EX in cycle 5, but it needs a2 then — too early. Forwarding from EX/MEM does not help because the load result is not in EX/MEM yet; it is still being read from RAM.
So we stall one cycle: freeze the add (and everything behind it) in place for cycle 6, inserting a bubble. Now the add is in EX in cycle 7, and by then a2 is available in MEM/WB (MR_4) and can be forwarded.

So this program needs both a stall (one cycle) and forwarding (the load result is then forwarded from WB into the add's EX).

Answers to parts (6)-(9)¶

(6) Cycles: 9.
(7) Flushing: No. There is no taken branch or jump, so nothing in flight must be discarded.
(8) Stalling: Yes — exactly one cycle, caused by the load-use dependency between ld a2 and add a0, a2, a2.
(9) Forwarding: Yes — after the one-cycle stall, the loaded a2 is forwarded from the MEM/WB side into the add's EX inputs. (addi a1 -> sd a1 is also a forward of a1 into the store.)

Hazard Unit logic for the load-use stall (`05-ld-stl`)¶

Detect "the instruction in EX is a load that writes a register which the instruction in DR reads," then stall the front of the pipe and clear the EX/MEM register (inject a NOP downstream):

if ((RFW_3 == 1) && (MLD_3 == 1) &&
    ((RR0_2 == WR_3) || (RR1_2 == WR_3))) {
    PC_EN      = 0;        // freeze PC
    IF_DR_EN   = 0;        // freeze IF/DR
    DR_EX_EN   = 0;        // freeze DR/EX
    EX_MEM_CLR = 1;        // flush EX/MEM -> inject a bubble
} else {
    PC_EN      = EN_ORG;   // preserve original control
    IF_DR_EN   = 1;
    DR_EX_EN   = 1;
    EX_MEM_CLR = CLR_ORG;
}

The key discipline from the spec: preserve the original EN and CLR lines in the else branch (EN_ORG, CLR_ORG) so manual single-stepping and debugging still behave.

flowchart LR
    L["ld a2 in EX (MLD_3=1)"] -->|produces a2 end of MEM| M[MEM]
    A["add a2 in DR wants a2"] -->|too early to forward| STALL[Stall 1 cycle]
    STALL --> FWD["then forward a2 from MEM/WB into EX"]

6. Back-to-Back Loads Need Only a Single-Cycle Stall¶

The third board (bottom of the page) addressed a subtle question: if we have two loads in a row and then a use, how many stalls?

ld  a2, 0(zero)
ld  a3, 8(zero)
add a0, a2, a3     # depends on BOTH loads

Intuition might say "two stalls, one per load." But only one stall is needed. The diagram showed the add stalled exactly one cycle (the red E box and the red empty bubble boxes mark where the stall lands):

            c1    c2    c3    c4    c5    c6    c7    c8    c9
ld   a2   | I  | D  | E  | M  | W  |    |    |    |    |
ld   a3   |    | I  | D  | E  | M  | W  |    |    |    |
add  a0   |    |    | I  | D  |(--)| E  | M  | W  |    |   <- one bubble

Why one stall suffices:

ld a2 writes back in cycle 5; ld a3 writes back in cycle 6.
After a single stall, the add is in EX in cycle 6. By then:
a2 is already in the RegFile / MEM-WB region and can be forwarded.
a3 is exactly arriving from ld a3's MEM result and can be forwarded too.
The clock-inversion assumption (RegFile written on one clock edge, read on the other) means the most recent load's write-back lines up with the dependent read after just one bubble. No second stall is required.

This is exactly the kind of timing reasoning the final expects: don't add stalls mechanically — line up the producer's write-back with the consumer's read and add only as many bubbles as the alignment actually demands.

7. Control Hazards: Flushing on Jumps and Branches¶

The remaining hazard class is control hazards: when jal/jalr/beq/bne/blt/bge change the PC, the instructions already fetched behind them must not commit.

In the Project 07 starter, jal needs four nops after it. With the Hazard Unit we instead resolve the jump in EX and flush the two instructions that were fetched behind it.

Test 06-jal-fls:

main:
    li   a0, 3
    jal  foo
    unimp           # marker: must NOT execute
foo:
    addi a0, a0, 4  # a0 should be 7
    ret

Datapath/control changes from the spec:

The second input to the PCBr MUX comes from the ALU result in EX (not from the MR_4 MUX).
The PCBr MUX selector comes from PCbr_2 (the EX-stage flag), not PCbr_4.
When a jump/taken-branch is detected in EX, clear the two pipeline registers feeding EX so the wrongly-fetched instructions become bubbles:

if (PCbr_2 == 1) {
    IF_DR_CLR = 1;     // flush the instruction in IF/DR
    DR_EX_CLR = 1;     // flush the instruction in DR/EX
} else {
    IF_DR_CLR = CLR_ORG;
    DR_EX_CLR = CLR_ORG;
}

flowchart TD
    J["jal in EX: PCbr_2 = 1"] --> P["PC <- ALU result (target)"]
    J --> F1["IF_DR_CLR = 1 (flush)"]
    J --> F2["DR_EX_CLR = 1 (flush)"]
    F1 --> N["wrongly-fetched 'unimp' becomes a bubble"]
    F2 --> N

Test 07-branch (a simple conditional branch) passes for free once this flush logic is correct, because a taken branch is the same control-hazard mechanism. Test 08-fibrec (computing fibrec(10) = 55) is the full program that only runs when forwarding, stalling, and flushing all work together.

Historical note: branch delay slots¶

Early RISC processors avoided this hardware entirely by defining the instruction right after a branch to always execute (the branch delay slot); the compiler filled it with useful work or a nop. Modern designs prefer hardware flushing/prediction over exposing the delay slot in the ISA.

8. Deciding Forward vs. Stall vs. Flush (Decision Procedure)¶

This is the heart of the exam. Given any program on the full pipeline, classify each dependency:

flowchart TD
    START["Dependency between instr A (producer) and B (consumer)?"] --> Q1{Control change? jal/jalr/taken branch}
    Q1 -- yes --> FLUSH["FLUSH the instructions fetched behind it"]
    Q1 -- no --> Q2{Is the producer a LOAD and B is the very next instruction needing it?}
    Q2 -- yes --> STALL["STALL one cycle, then FORWARD"]
    Q2 -- no --> Q3{Producer's result available in EX/MEM or MEM/WB when B is in EX?}
    Q3 -- yes --> FWD["FORWARD only"]
    Q3 -- no --> NONE["No action (no hazard)"]

A compact rule set:

Situation	Mechanism	Why
ALU result used by a later non-adjacent instruction	None / already in RegFile	Value written back in time
ALU result used by next 1-2 instructions	Forward (EX/MEM or MEM/WB)	Value exists in pipeline before consumer's EX
Load result used by the immediately following instruction	Stall 1 + forward	Load value not ready until end of MEM
Two loads then a combined use	Stall 1 + forward both	One bubble aligns both producers
`jal`/`jalr`/taken branch	Flush (clear IF/DR and DR/EX)	Fetched-behind instructions must not commit

9. Counting Clock Cycles¶

Cycle counting is a guaranteed exam question. The procedure:

Base latency. The first instruction takes 5 cycles to reach WB; each subsequent instruction adds 1 cycle if there are no hazards. For n independent instructions: cycles = n + 4.
Add a cycle per stall (bubble). Each load-use stall adds 1.
Flushes do not add cycles to the program's own count in the simple sense — they discard wrongly-fetched instructions, but the target instruction still flows through; the "wasted" fetch slots are the cost.

Worked counts from this session:

Program	Independent instrs	Stalls	Cycles to last WB
`addi a1; addi a2; add a0`	3	0 (forward only)	3 + 4 = 7
`addi a1; sd a1; ld a2; add a0,a2,a2`	4	1 (load-use)	4 + 4 + 1 = 9
`ld a2; ld a3; add a0,a2,a3`	3	1 (single, back-to-back loads)	3 + 4 + 1 = 8

Always anchor on "when does the last instruction reach WB," and read it straight off the diagram. The formula is a check, not a substitute for drawing the diagram.

10. Add-an-Instruction: LWU (Sign vs. Zero Extension)¶

Lab 11 Question 5(5) asks how to add LWU (load word unsigned) to the single-cycle processor. LW and LWU both read a 32-bit word from memory into a 64-bit register; the only difference is the upper 32 bits:

LW: sign-extend the loaded word (copy bit 31 into bits 63..32).
LWU: zero-extend the loaded word (fill bits 63..32 with 0).

LW   t0, (a0):   t0 = signext32->64( mem[a0] )
LWU  t0, (a0):   t0 = zeroext32->64( mem[a0] )

What changes¶

Datapath / components. The load path that extends a 32-bit value to 64 bits currently does sign extension only. Add the ability to select between sign-extend and zero-extend. The clean way: feed both a sign-extended and a zero-extended version into a small MUX and pick with a new control line (call it ZEXT or extend MSZ).

                +-- signext(word) --+
  word[31:0] ---+                   |--MUX--> 64-bit value to RegFile
                +-- zeroext(word) --+
                          ^
                        ZEXT  (1 for LWU, 0 for LW)

Control path. Add the new ZEXT (or extended MSZ) control line, asserted only for LWU.
Instruction decoder. LWU is opcode 0000011 (the load group) with funct3 = 110. Add comparators to detect funct3 == 110, and have that drive ZEXT = 1 while still asserting the normal load controls (M2R/WDsel select RAM, RFW = 1, ALU computes base+offset).

Variation: LBU (load byte unsigned)¶

The same recipe applies to LBU (Lab 11 Q6-5): read 8 bits, then zero-extend bits 63..8. Reuse the same ZEXT MUX on the byte-load path; decode funct3 == 100 in the load group.

11. Add-an-Instruction: SWPR (Two Register Writes in One Cycle)¶

Lab 11 Question 5(6) asks for SWPR (swap registers): swpr a0, a1 swaps the two registers in one instruction, replacing the usual three-mv idiom.

# without SWPR:           # with SWPR:
mv t0, a0                 swpr a0, a1
mv a0, a1
mv a1, t0

The hard part: a swap must write two registers in the same clock cycle, but the standard RegFile writes only one (single WR/WD/WE).

RegFile changes¶

Add a second write port: WR2 (which register), WD2 (the data), WE2 (enable).
Internally, a swap routes a1's old value into a0 and a0's old value into a1:
Port 1: WR = a0, WD = RD1 (the old a1), WE = 1.
Port 2: WR2 = a1, WD2 = RD0 (the old a0), WE2 = 1.
Collision policy. If both ports somehow target the same register (e.g. swpr a0, a0), define a priority — give WD (port 1) priority over WD2 — or disallow it. Add a MUX inside the RegFile decode that resolves which write data wins per destination.

Datapath / control / decoder changes¶

Datapath. Add the second write path: wire WR2/WD2/WE2 from the decoder and from the read-data outputs (WD2 = RD0, while WD = RD1).
Control. A new SWPR control line that asserts WE2 and selects the swap routing on the write-data MUXes.
Instruction format / decoder. SWPR is a new instruction, so you must choose an encoding (an unused opcode/funct combination) that carries two register fields. The decoder detects it and asserts the swap controls; RFW/WE and the new WE2 are both 1, and no immediate or ALU operation is needed.

Variation: ADDI2 (add immediate to two registers)¶

Lab 11 Q6-6 (addi2 a0, a0, 8 updates both a0 and a1) uses the same two-write-port machinery: port 1 writes rd = rd + imm, port 2 writes rd+1 = (rd+1) + imm. You additionally need a second ALU (or to reuse the adder) to compute the second sum, plus a read of rd+1.

flowchart LR
    RD0["RD0 = old a0"] --> P2["Port2: WR2=a1, WD2=RD0"]
    RD1["RD1 = old a1"] --> P1["Port1: WR=a0, WD=RD1"]
    P1 --> RF[(RegFile, 2 write ports)]
    P2 --> RF
    SW["SWPR control: WE=1, WE2=1"] --> RF

12. Project 07 Submission Mechanics and Common Mistakes¶

We closed the lab on logistics because several repos were failing the autograder for avoidable reasons.

Submission structure. The top-level processor circuit must be named project07.dig and committed at the expected path. The most common failure was a nested final directory — the files ended up one folder too deep, so the autograder could not find project07.dig. Fix: push a corrected repository immediately (ideally same day) to avoid extra penalties. There is no interactive grading for Project 07, so the autograder result is what counts; get the structure right.

Rubric reminder (where the points are):

Points	Test(s)	What it exercises
10	`00-add-3nop`, `01-add-2nop`, `02-jal`, `03-ld`	Starter + invert RegFile clock
50	`04-add-fwd`	Forwarding (FRD0/FRD1)
20	`05-ld-stl`	Load-use stall + bubble
10	`06-jal-fls`	Jump/branch flush
5	`07-branch`	Conditional branch (free once flush works)
5	`08-fibrec`	Full program, `fibrec(10) = 55`

The single cheapest win. Test 01-add-2nop passes by inverting the CLK input to the RegFile in the DR stage. WB then writes on the first half of the cycle and DR reads on the second half, removing one of the three nops. Tests 02-jal and 03-ld also start passing once you do this.

Debugging procedure (from the lab):

Build the Hazard Unit incrementally — get forwarding working before touching stalls or flushes.
Paste the .dig test into your processor so you can run it directly in Digital and see where it diverges.
objdump the test so you can see hex instruction words next to assembly, then single-step to find the first instruction that misbehaves.
Add probes on the datapath (FRD0, FRD1, ALUR_3, MR_4, the EN/CLR lines) to confirm whether the bug is a wrong control line or a wrong value.

Common wiring mistakes:

Forgetting that the EX MUX output must replace RD0/RD1 everywhere, not just at the ALU input.
Getting the forwarding priority backwards (must test _3 before _4).
Not preserving EN_ORG/CLR_ORG in the stall/flush logic, which breaks manual stepping.
Driving the PCBr MUX selector from PCbr_4 instead of PCbr_2 for the flush case.

Key Concepts¶

Concept	Definition	Example
Pipeline register	A register between two stages that carries an instruction's signals forward one stage per clock	`DR/EX`, `EX/MEM`, `MEM/WB`
Data hazard	A consumer reads a register before the producer has written it back	`add a0,a1,a2` right after `addi a1,...`
Forwarding	Routing a result from EX/MEM or MEM/WB directly back into EX inputs	`FRD0 = 2` forwards `ALUR_3`
Load-use hazard	A load's value is needed by the very next instruction's EX, but isn't ready until end of MEM	`ld a2` then `add a0,a2,a2`
Stall (bubble)	Freezing earlier stages one cycle and injecting a NOP downstream	one bubble before a load-dependent `add`
Control hazard	A jump/taken branch makes already-fetched instructions invalid	instruction after `jal`
Flush	Clearing pipeline registers so wrongly-fetched instructions become bubbles	`IF_DR_CLR=1`, `DR_EX_CLR=1` on `jal`
Clock inversion (RegFile)	Writing on one clock edge and reading on the other to remove one NOP	passes `01-add-2nop`
Sign vs. zero extension	Fill upper bits with the sign bit (LW) vs. with zeros (LWU)	`lwu` zero-fills bits 63..32
Second write port	RegFile able to write two registers in one cycle	`swpr a0, a1` uses WR/WD and WR2/WD2

Practice Problems¶

Problem 1: Cycle count with forwarding only¶

On a complete Project 07 pipeline, how many cycles to the last write-back for:

addi t0, zero, 1
add  t1, t0, t0
add  t2, t1, t1

Click to reveal solution

          c1   c2   c3   c4   c5   c6   c7
addi t0 | I  | D  | E  | M  | W  |    |    |
add  t1 |    | I  | D  | E  | M  | W  |    |
add  t2 |    |    | I  | D  | E  | M  | W  |

- Each `add` depends on the immediately previous instruction's ALU result. - When `add t1` is in EX (c4), `addi t0`'s result is in EX/MEM -> forward `ALUR_3`. - When `add t2` is in EX (c5), `add t1`'s result is in EX/MEM -> forward `ALUR_3`. - All ALU-to-ALU dependencies; **forwarding only, no stalls**. **Cycles = 3 + 4 = 7.** No flushes, no stalls.

Problem 2: Classify each dependency¶

For the program below on a full pipeline, label each dependency as forward, stall+forward, or flush:

ld   a0, 0(sp)
add  a1, a0, a0
beq  a1, a1, done
addi a2, zero, 9
done:
addi a3, zero, 7

Click to reveal solution

- `ld a0` -> `add a1, a0, a0`: **stall 1 + forward**. Load result isn't ready until end of MEM, and `add` is the next instruction needing it. - `add a1` -> `beq a1, a1, ...`: this is a register dependency on `a1` for the branch comparison; **forward** `a1` into the branch unit / EX. - `beq` taken -> `addi a2` was fetched behind it: **flush**. The branch jumps to `done`, so `addi a2, zero, 9` must be discarded. So: one stall (load-use), forwarding (load result and `a1` to the branch), and one flush (taken branch).

Problem 3: Forwarding priority¶

When add a0, a0, a0 reaches EX in this snippet, which value is forwarded for the first operand, and via which selector value?

addi a0, zero, 3
addi a0, zero, 4
add  a0, a0, a0

Click to reveal solution

When `add` is in EX: - `addi a0, zero, 4` is in MEM, so `ALUR_3 = 4`, `WR_3 = a0`, `RFW_3 = 1`. - `addi a0, zero, 3` is in WB, so `MR_4 = 3`, `WR_4 = a0`, `RFW_4 = 1`. Both match `RR0_2 == a0`. The Hazard Unit tests `_3` first:

if ((RR0_2 == WR_3) && RFW_3)  ->  FRD0 = 2   // forwards ALUR_3 = 4

**`FRD0 = 2`, forwarding the value `4`.** Result `a0 = 4 + 4 = 8`. Forwarding the WB value (`3`) instead would be the classic priority bug.

Problem 4: Why one stall, not two, for back-to-back loads¶

Explain, with a timing argument, why ld a2; ld a3; add a0, a2, a3 needs only a single stall.

Click to reveal solution

          c1   c2   c3   c4   c5   c6   c7   c8
ld   a2 | I  | D  | E  | M  | W  |    |    |    |
ld   a3 |    | I  | D  | E  | M  | W  |    |    |
add  a0 |    |    | I  | D  |(--)| E  | M  | W  |

- The `add` needs both `a2` (from `ld a2`) and `a3` (from `ld a3`). - The binding constraint is the *most recent* producer, `ld a3`, whose value is ready at the end of its MEM (c5). - After **one** bubble, the `add` is in EX in c6. At that point `a3` can be forwarded from `ld a3`'s MEM/WB result, and `a2` (ready a cycle earlier) is also available to forward. - A second stall would only delay a value that is already available, so one bubble is sufficient. **One stall.** Last WB in c8 -> 3 + 4 + 1 = **8 cycles**.

Problem 5: Add LWU to the datapath¶

List the datapath, control, and decoder changes needed to support lwu t0, (a0).

Click to reveal solution

- **Datapath/components:** On the 32-bit load path, provide both a sign-extended and a zero-extended version of the loaded word, and add a 2-input MUX to choose between them. - **Control:** Add a new control line `ZEXT` (or extend `MSZ`) that selects zero-extend; assert it only for LWU. - **Decoder:** Detect the load opcode `0000011` with `funct3 = 110` using comparators; on a match, set `ZEXT = 1` while keeping the normal load controls (RFW = 1, M2R/WDsel selects RAM output, ALU computes base + offset). LW (`funct3 = 010`) leaves `ZEXT = 0`. The only behavioral difference from LW is the upper-32-bit fill (zeros vs. sign bit).

Problem 6: Add SWPR to the RegFile¶

Describe how SWPR (swpr a0, a1) writes two registers in one cycle, including the collision policy.

Click to reveal solution

- **RegFile:** Add a second write port `WR2`/`WD2`/`WE2`. - **Routing for the swap:** - Port 1: `WR = a0`, `WD = RD1` (the old value of `a1`), `WE = 1`. - Port 2: `WR2 = a1`, `WD2 = RD0` (the old value of `a0`), `WE2 = 1`. - **Datapath:** Wire `RD0` to `WD2` and `RD1` to `WD`; route the two destination register numbers to `WR` and `WR2`. - **Control:** A `SWPR` line asserts both `WE`/`WE2` and selects the swap write-data routing. - **Decoder:** Choose an unused opcode/funct encoding that carries two register fields; the decoder asserts the swap controls (no immediate, no ALU op). - **Collision policy:** If both ports target the same register (`swpr a0, a0`), give port-1 (`WD`) priority via a MUX in the RegFile write decode, or disallow the case. Without this, the result on a self-swap is undefined.

Summary¶

The five stages (I, D, E, M, W) and stage suffixes (_2, _3, _4) are the shared language of every timing diagram and every Hazard Unit equation in Project 07.
Draw the timing diagram first. Place cycles across the top, instructions down the side, shift each row right by one, and read the last WB to count cycles.
Forwarding resolves ALU-to-ALU hazards by routing ALUR_3 (EX/MEM) or MR_4 (MEM/WB) back into the EX MUXes; closest producer wins, so test _3 before _4, and only forward from instructions that actually write (RFW).
Load-use hazards need a stall plus a forward: freeze PC/IF-DR/DR-EX one cycle, clear EX/MEM to inject a bubble, then forward the loaded value. The sd/ld/add example takes 9 cycles.
Back-to-back loads need only one stall — align the most recent load's write-back with the dependent read and add only the bubbles the alignment actually requires.
Control hazards need a flush: resolve jal/branch in EX (PCBr MUX from the EX ALU result, selector PCbr_2), then clear IF/DR and DR/EX so wrongly-fetched instructions become bubbles.
Adding an instruction follows a fixed recipe — datapath, control line, component, decoder — whether it is sign-vs-zero extension (LWU/LBU) or a two-write-port operation (SWPR/ADDI2).
Get the submission structure right: top-level project07.dig, no nested final directory, build the Hazard Unit incrementally (forwarding -> stalling -> flushing), and the cheapest points come from inverting the RegFile clock.

Lab: Project 07 and Final Review¶

Overview¶

Learning Objectives¶

Prerequisites¶

1. Session Roadmap and Final Exam Format¶

2. The Five-Stage Pipeline: Vocabulary We Will Reuse All Session¶

3. Reading a Pipeline Timing Diagram (Worked: Three addi/add)¶

Why forwarding, not stalling, is enough here¶

4. The Forwarding Hazard Unit (Project 07, test 04-add-fwd)¶

Datapath additions¶

Hazard Unit logic (for FRD0; FRD1 is symmetric with RR1_2)¶

5. The Load-Use Stall (Worked: sd/ld/add)¶

Walking the timeline¶

Answers to parts (6)-(9)¶

Hazard Unit logic for the load-use stall (05-ld-stl)¶

6. Back-to-Back Loads Need Only a Single-Cycle Stall¶

7. Control Hazards: Flushing on Jumps and Branches¶

Historical note: branch delay slots¶

8. Deciding Forward vs. Stall vs. Flush (Decision Procedure)¶

9. Counting Clock Cycles¶

10. Add-an-Instruction: LWU (Sign vs. Zero Extension)¶

What changes¶

Variation: LBU (load byte unsigned)¶

11. Add-an-Instruction: SWPR (Two Register Writes in One Cycle)¶

RegFile changes¶

Datapath / control / decoder changes¶

Variation: ADDI2 (add immediate to two registers)¶

12. Project 07 Submission Mechanics and Common Mistakes¶

Key Concepts¶

Practice Problems¶

Problem 1: Cycle count with forwarding only¶

Problem 2: Classify each dependency¶

Problem 3: Forwarding priority¶

Problem 4: Why one stall, not two, for back-to-back loads¶

Problem 5: Add LWU to the datapath¶

Problem 6: Add SWPR to the RegFile¶

Further Reading¶

Summary¶

3. Reading a Pipeline Timing Diagram (Worked: Three `addi`/`add`)¶

4. The Forwarding Hazard Unit (Project 07, test `04-add-fwd`)¶

5. The Load-Use Stall (Worked: `sd`/`ld`/`add`)¶

Hazard Unit logic for the load-use stall (`05-ld-stl`)¶