Skip to content

Processor Branches and RAM

Overview

This lecture covers the last two major pieces needed to finish the single-cycle RISC-V processor: conditional branches and data memory (RAM). We start by inventorying every instruction the processor must still support, grouped into data processing, control, and memory categories. We then build the Branch Unit (BU) that compares two register values to decide whether a branch is taken, and we work out how the PCsel logic chooses between PC+4 and the branch/jump target address. Finally we add a Digital RAM component for data memory (the stack), and we compute how to size it and convert byte addresses into doubleword addresses. These pieces complete the data path and control path for Project 6.

Learning Objectives

  • Enumerate the remaining RISC-V instructions (data processing, control, memory) the processor must support
  • Describe the two-step branch mechanism: compute the branch target address (BTA), then decide whether to take the branch
  • Compute BTA = PC + imm-b and explain why the immediate is a sign-extended B-type immediate
  • Design a Branch Unit that selects among =, !=, <, and >= comparisons using a BUOp control line
  • Combine PCsel and the branch outcome (PCbr) to conditionally update the PC
  • Size a Digital RAM data memory and convert a byte address into a doubleword address
  • Connect data memory to the ALU and instruction decoder, including the helper logic for lw/sw and lb/sb
  • Explain why programs need explicit initialization (set up sp, registers, and an unimp end marker)

Prerequisites

  • RISC-V instruction formats (R, I, S, B, J types) and opcodes (Project 4, Lab 03)
  • The Part 1 / Part 2 processor: PC, instruction memory, register file, ALU, and the three decoders (RegDecoder, ImmDecoder, InstDecoder)
  • The instruction decoder spreadsheet methodology and ROM-based control lines (Lecture 10)
  • Multiplexers, comparators, splitters, and the Digital RAM component (Lab 09, Lab 10)
  • Binary/hexadecimal conversion and bit masking/shifting (Project 1, Project 3)
  • JAL/JALR support and the PCsel MUX between PC+4 and the jump target address

1. Where We Are: The Remaining Instructions

By this point the processor can already execute I-type and R-type data processing instructions, plus jal and jalr for calls and returns. Today we finish the instruction set. The instructions still to be handled fall into three groups.

Data Processing Control Memory
addi, add, sub, mul jal, jalr lb, sb
sll, srl, slli, srai beq, bne, blt lw, sw
ld, sd

The data processing and shift instructions are handled by the ALU and are mostly already in place. jal and jalr were added in Part 2 (calls and returns). The two genuinely new capabilities are:

  • Conditional branches (beq, bne, blt, bge) — choose between PC+4 and a target address based on a comparison of two registers.
  • Data memory (lb/sb, lw/sw, ld/sd) — load and store values from a RAM that holds the program's stack.
flowchart LR
    A[Instruction Word IW] --> B[InstDecoder]
    B --> C[Data Processing - ALU]
    B --> D[Control - PC update]
    B --> E[Memory - RAM]
    D --> F[Branch Unit + PCsel]
    E --> G[Load/Store helper logic]

The InstDecoder's job stays the same: decode the instruction word and set control lines. The new behavior is encapsulated in dedicated components (the Branch Unit and the data memory helper circuits) so the decoder does not grow unwieldy.

Lab 10 versus Project 6

This material is exercised first in Lab 10 and then completed in Project 6:

  • Lab 10 Part 1: addi (li), add, unimp.
  • Lab 10 Part 2: adds jal (call, j) and jalr (ret) for function calls and returns.

Most components are shared between the two parts, but Part 1 and Part 2 need separate instruction decoders because they produce different sets of control-line outputs. The submission file names matter for autograding:

Lab 10 Part1.dig   ->  inst-decode-part1.dig
Lab 10 Part2.dig   ->  inst-decode-part2.dig

You also submit a PDF of the control spreadsheet: delete most unused rows, choose Workbook and All sheets, then Download as PDF. In Project 6 the final circuit lives in a final/ subdirectory, with part1, part2, part3 showing incremental development. Note that Digital recursively searches subdirectories for components, so keep components isolated to avoid name conflicts and duplicates.


2. The Branch Mechanism

A conditional branch looks like:

beq t0, t1, label      # if (t0 == t1) goto label

The RISC-V assembler computes a PC-relative offset to label and encodes it as the B-type immediate. The branch is a two-step process:

  1. Compute the branch target address (BTA).

    BTA = PC + imm-b
    

    imm-b is the 64-bit sign-extended immediate from the B-type instruction format. It comes from your ImmDecoder. Because it is sign-extended, branches can jump both forward (positive offset) and backward (negative offset) — backward branches are how loops work.

  2. Determine whether to take the branch. Compare rs1 and rs2 (the register file's RD0 and RD1). Then update the PC to BTA conditionally: if the comparison is true, PC = BTA; otherwise PC = PC + 4.

This is the crucial difference from jal/jalr. Jumps always redirect the PC to the target; conditional branches redirect only if the comparison succeeds.

Instruction class PC update
Sequential (e.g., add, addi) always PC + 4
jal / jalr (jumps) always the target address
beq / bne / blt / bge (branches) target if comparison true, else PC + 4

The Four Branch Comparisons

Instruction Meaning Comparison of rs1, rs2
beq branch if equal rs1 == rs2
bne branch if not equal rs1 != rs2
blt branch if less than rs1 < rs2
bge branch if greater or equal rs1 >= rs2

Note that bgt rs1, rs2, label is a pseudo-instruction implemented as blt rs2, rs1, label — swapping the operands turns "greater than" into "less than," so the hardware only needs the four comparisons above.

flowchart TD
    A[Fetch branch instruction] --> B["Compute BTA = PC + imm-b (ALU)"]
    A --> C["Read rs1, rs2 from RegFile"]
    C --> D{"Branch Unit: comparison true?"}
    D -- yes --> E[PC = BTA]
    D -- no --> F[PC = PC + 4]

We must do three things in hardware: compute the BTA, do the comparison, and update the PC based on the result. The BTA computation reuses the ALU (we already use the ALU to compute the jump target for jal/jalr), so the comparison logic must live somewhere else — in a dedicated Branch Unit.


3. The Branch Unit (BU)

Because the ALU is already busy computing the BTA, the comparisons go into a separate Branch Unit. The BU takes the two register values and a control line that selects which comparison to perform; it outputs a single bit, take_branch (also called PCbr), that says whether the branch should be taken.

Inputs

  • A (64 bits): the value of rs1 (the register file's RD0).
  • B (64 bits): the value of rs2 (the register file's RD1).
  • BUOp (2 bits): selects which comparison to apply.

Output

  • take_branch / PCbr (1 bit): high when the branch should be taken.

Internally, the BU feeds A and B into four parallel comparators — =, !=, <, and >= — and uses a 4-input MUX driven by BUOp to select the result of the comparison that matches the current branch instruction.

                                   BUOp (2)
                                      |
        A (64) ----+----[ = ]----0 \  |
                   |               |  \
                   +----[ != ]---1 |   MUX --> take_branch (PCbr)
                   |               |  /
        B (64) ----+----[ < ]----2 | /
                   |               |/
                   +----[ >= ]---3

Mapping the BUOp selector to the comparison (one natural ordering):

BUOp Comparison Branch
00 A == B beq
01 A != B bne
10 A < B blt
11 A >= B bge
flowchart LR
    A["A = rs1 (64)"] --> EQ["="]
    B["B = rs2 (64)"] --> EQ
    A --> NE["!="]
    B --> NE
    A --> LT["<"]
    B --> LT
    A --> GE[">="]
    B --> GE
    EQ --> M["MUX (BUOp)"]
    NE --> M
    LT --> M
    GE --> M
    M --> O["take_branch (PCbr)"]

Design Notes from Class

  • Don't decode the funct3 directly inside the BU. You could drive the comparison selection straight from the instruction's funct3 field, but RISC-V branch funct3 codes are not contiguous (beq=000, bne=001, blt=100, bge=101), which would force awkward MUX wiring or dummy inputs. It is cleaner to define a tidy 2-bit BUOp control line in the InstDecoder spreadsheet and map each branch instruction to it.
  • Add a "BU off" state. Non-branch instructions should not accidentally signal a taken branch. The control spreadsheet should set things up so that for non-branch instructions the BU output is forced to 0 (or PCsel is 0, which has the same effect — see the next section). One option is to encode a "branch unit disabled" mode in the control lines.
  • One comparator can do it all. A single subtractor/comparator can in principle produce equal, not-equal, less-than, and greater-or-equal flags simultaneously, which avoids four separate comparator components. Either approach is acceptable; the four-comparator version is easiest to read.

The BU is a combinational component — given A, B, and BUOp, it produces take_branch immediately, with no clock needed.


4. PC Selection: Combining PCsel and the Branch Outcome

The processor already has a PCsel MUX that chooses between PC+4 and the target address (used for jal/jalr). For branches we need that choice to depend on both the decoder (PCsel) and the runtime comparison result (PCbr). The agreed policy:

  • PCsel = 1 for branch instructions (and jumps), PCsel = 0 for non-branch instructions such as add.
  • When PCsel = 0, the PC always advances to PC+4, regardless of any branch/jump signals.
  • When PCsel = 1, whether we use the target depends on the branch outcome.

Two implementations were discussed in class.

Option 1: A Pre-MUX Selected by PCbr

Use a small inner MUX driven by PCbr to choose between PC+4 and BTA, then feed that into the main PCsel MUX (which also selects the jump target, JTA). This keeps PCsel as the top-level "are we redirecting the PC?" signal and lets the branch outcome decide the branch target separately.

                 PCsel (selects PC+4 / JTA / branch-result)
                    |
   PC+4 ----------0 |
   JTA  ----------1  MUX ----> PC
   (branch result) 2 |
                    |
        PCbr        |
          |         |
   PC+4 -0 \         |
           MUX ------+   (inner MUX: PCbr picks PC+4 or BTA)
   BTA  -1 /

Option 2 (rejected): Gate PCsel with PCbr

Kevin's simpler-looking idea was to AND PCsel with PCbr and use that to drive a MUX choosing among PC+4, JTA, and BTA. In class this option was crossed out — the gating interacts badly with the always-redirect jumps, so the cleaner pre-MUX approach (Option 1) is preferred.

flowchart TD
    PCbr["PCbr (from Branch Unit)"] --> IMUX["inner MUX"]
    PC4a["PC+4"] --> IMUX
    BTA["BTA (from ALU)"] --> IMUX
    IMUX --> PMUX["PCsel MUX"]
    PC4b["PC+4"] --> PMUX
    JTA["JTA"] --> PMUX
    PCsel["PCsel (from InstDecoder)"] --> PMUX
    PMUX --> PC["PC register"]

The key behavioral guarantee: if PCsel = 0, the next PC is PC+4 no matter what the Branch Unit or jump logic says. This is what makes ordinary sequential instructions correct. The InstDecoder spreadsheet must therefore add the new control bits (PCsel, BUOp) and set PCsel = 0 for every existing non-branch instruction.

New Control Lines

Signal Width Meaning
PCsel 1 (or wider) Selects whether PC is redirected; chooses PC+4 vs. target
BUOp 2 Selects the Branch Unit comparison (beq/bne/blt/bge)
PCbr / take_branch 1 Branch Unit output: 1 if the comparison succeeded

5. Worked Example: A Loop Using a Branch

To see the branch path exercised end-to-end, consider a countdown loop. RISC-V assembly:

main:
    li   t0, 3          # counter = 3
    li   t1, 0          # accumulator = 0
loop:
    beq  t0, zero, done # if counter == 0, exit loop
    add  t1, t1, t0     # acc += counter
    addi t0, t0, -1     # counter -= 1
    jal  loop           # unconditional jump back (j loop)
done:
    add  a0, t1, zero   # a0 = acc (result)
    unimp               # end marker

This computes 3 + 2 + 1 = 6 into a0. Trace the relevant control decisions:

Instruction PCsel BUOp PCbr Next PC
beq t0, zero, done (t0 = 3) 1 00 (==) 0 (3 != 0) PC + 4
add t1, t1, t0 0 x x PC + 4
addi t0, t0, -1 0 x x PC + 4
jal loop 1 x (jump) n/a JTA (loop)
... (after t0 reaches 0)
beq t0, zero, done (t0 = 0) 1 00 (==) 1 (0 == 0) BTA (done)

The backward jal loop works because the J-type immediate (and the B-type immediate for branches) is sign-extended, so the offset can be negative. The loop exits exactly when beq finally sees t0 == 0 and PCbr goes high, redirecting the PC to done.


6. Data Memory: Adding RAM

Programs need somewhere to store and load data. For our processor this is the stack (arrays, strings, saved registers, and the calling convention), though a heap could be added the same way. The processor must support ld/sd (doubleword), lw/sw (word), and lb/sb (byte).

We use Digital's RAM (Separated Ports) component. You configure two things:

  • Data bits — the width of each stored element. We use 64 data bits so each cell holds a doubleword. This makes ld/sd trivial.
  • Address bits — the number of elements (cells).

The completed processor now has all the major sub-circuits side by side:

 +----+   +--------+   +--------+   +-----+   +-----------+
 | PC |   | Inst   |   | Reg    |   | ALU |   | Data Mem  |
 |    |   | Mem    |   | File   |   | BU  |   | RAM       |
 +----+   +--------+   +--------+   +-----+   +-----------+
                                                  ^
                                                  |
                                                stack

Sizing the RAM

Worked example from class. We want a 1024-byte data memory built from 64-bit cells.

64 = 2^6 bits per cell
2^3 bytes per cell = 8 bytes   (64 bits / 8 = 8)

How many cells (n) for 1024 bytes?
    2^3 (bytes/cell) * n = 1024
    n = 1024 / 8 = 128 = 2^7

So:  2^3 * 2^7 = 2^10 = 1024 bytes

That means the RAM needs 7 address bits (2^7 = 128 cells) and 64 data bits per cell, for a total of 1024 bytes. In the test programs the stack pointer is initialized near the top of this region, for example li sp, 1024.

Quantity Value
Bytes per cell 8 (2^3)
Bits per cell 64 (2^6)
Number of cells 128 (2^7)
Address bits 7
Total size 1024 bytes (2^10)

Byte Address vs. Doubleword Address

The ALU computes the target memory address as a byte address, because all addresses live in registers as byte addresses. But the RAM's A (ADDR) input expects a doubleword (DW) address — an index into 8-byte cells. We must convert:

DW_addr = byte_addr / 8 = byte_addr >> 3

In hardware you do this with a splitter: drop the low 3 bits of the byte address and feed the remaining high bits into the RAM's address input. (The low 3 bits are the byte offset within a doubleword.)

byte_addr (from ALU):
   bit:  ... 9 8 7 6 5 4 3 | 2 1 0
              \-----------/   \---/
              DW address      byte-in-DW offset
              (to RAM ADDR)   (discarded for ld/sd)
flowchart LR
    ALU["ALU result (byte address, 64b)"] --> SP["splitter: drop low 3 bits"]
    SP --> RAM["RAM ADDR (DW address)"]
    RAM -->|D out 64b| LD["load logic"]
    SI["store logic"] -->|D in 64b| RAM

7. Connecting RAM and Supporting Sub-Word Access

The RAM connects to the ALU (which computes the address) and to the InstDecoder (which provides control lines). Loads route the RAM's D output back to the register file; stores route a register value into the RAM's Din.

New control lines from the InstDecoder for memory operations:

Signal Meaning
LD (ld) RAM read enable (load)
ST (str) RAM write enable (store)
MSZ Memory size: byte / word / doubleword
M2R (or expanded WDsel) Selects RAM output to write back to RegFile

For loads, you either expand the existing WDsel MUX or add a new two-input M2R MUX that selects between the ALU result and the RAM output and feeds the register file's write-data input.

ld/sd (doubleword) — the easy case

Because each cell is 64 bits, ld reads a full cell at the DW address and sd writes a full cell. No sub-word logic is needed.

lw/sw (word) — read-modify for stores

The RAM stores 64-bit cells but a word is only 32 bits, so we add helper logic. Crucially, keep the RAM component at the top level of the processor so you can open it during simulation; add load logic after the RAM and store logic before the RAM.

Load word. The ALU computes a 4-byte-aligned byte address; the splitter converts it to a DW address. Read the 64-bit cell, split it into the lower 32 bits (0-31) and upper 32 bits (32-63), and use a MUX to pick which half. The selector is bit 2 of the byte address (the word index within a doubleword — bits 0-1 are the byte index). Sign-extend the chosen 32-bit value to 64 bits.

byte_addr bit 2 = word index inside the doubleword
   bit 2 = 0  -> lower word (bits 0..31)
   bit 2 = 1  -> upper word (bits 32..63)

Store word. Stores are harder: we must preserve the other 32 bits of the cell. Set both ld and str high so we read the current 64-bit value (D64cur) and write back in the same clock cycle. Take the update value from RD1 (D64in), extract its lower 32 bits (Wnew), and recombine with the untouched half of D64cur. Using splitters: extract W0 (bits 0-31) and W1 (bits 32-63) from D64cur, build the two candidate cells Wnew:W1 and W0:Wnew with mergers, then a MUX selected by bit 2 picks the right one. That feeds the MSZ MUX, then the RAM Din.

lb/sb (byte)

Derive byte support the same way as word support, but operate on 8-bit slices using bits 0-2 of the byte address to pick which byte of the cell.

MSZ Encoding

Following the RISC-V funct3 low bits, the data size control values are:

Operation MSZ Width
lb / sb 00 8 bits, sign-extended to 64
lw / sw 10 32 bits, sign-extended to 64
ld / sd 11 64 bits

The final load-side MUX selects among the full 64-bit cell (ld), the sign-extended 32-bit word (lw), and the sign-extended 8-bit byte (lb), driven by MSZ.

flowchart TD
    RAMOUT["RAM D out (64b)"] --> LDD["ld: full 64b"]
    RAMOUT --> SPW["split words, MUX by bit 2"]
    SPW --> SXW["sign-extend 32->64"]
    RAMOUT --> SPB["split bytes, MUX by bits 0-2"]
    SPB --> SXB["sign-extend 8->64"]
    LDD --> MSZMUX["MSZ MUX"]
    SXW --> MSZMUX
    SXB --> MSZMUX
    MSZMUX --> WB["to WDsel / M2R MUX -> RegFile"]

8. Program and Code Initialization

The processor starts in a blank state: all registers are 0 and memory is uninitialized. So every test program needs explicit setup before it can run. The conventions for making a program runnable on the processor:

  • Add an assembly main that sets up parameters (Project 6 expects at least five parameters for your functions).
  • Initialize the stack pointer, e.g. li sp, 1024, so loads/stores hit valid RAM.
  • Remove any .global directives.
  • Use jal instead of call for function calls.
  • End the program with unimp, the marker that tells the processor to stop fetching.
main:
    li   sp, 1024       # set up the stack pointer
    li   a0, 5          # parameter 1
    li   a1, 10         # parameter 2
    jal  myfunc         # call (use jal, not call)
    unimp               # end marker -> processor halts

When you simulate the circuit, you press play, select the program (PROG) value, then toggle EN to 1 so execution begins — EN defaults to disabling writes to the PC and register file so you have time to choose the program.

ROM Programming Recap

The instruction decoder's control bits are stored in a ROM keyed by INUM. Earlier approaches (a Python script, or a hand-derived binary-to-hex equation) work for small data, but direct pasting becomes unreliable for large datasets because of formatting issues. The recommended, scalable approach is to generate a .hex file with the required prefix and load it directly into the ROM — this works at any size and is the same idea used for instruction memory (via makerom3.py).


Key Concepts

Concept Definition Example
BTA Branch Target Address, where a taken branch goes BTA = PC + imm-b
imm-b Sign-extended B-type immediate (PC-relative branch offset) negative for backward loop branches
Branch Unit (BU) Component comparing rs1,rs2 to decide if a branch is taken outputs take_branch / PCbr
BUOp Control line selecting the BU comparison 00===, 01=!=, 10=<, 11=>=
PCsel Control bit deciding whether the PC is redirected 1 for branch/jump, 0 for sequential
PCbr Runtime signal: 1 when the branch comparison succeeds gates use of BTA
Data memory RAM holding the program's stack 64-bit cells, used by ld/sd, etc.
Byte vs. DW address Registers hold byte addresses; RAM is indexed by 8-byte cells DW = byte >> 3
MSZ Memory-size control for load/store width 00=byte, 10=word, 11=doubleword
M2R / WDsel Selects RAM output (vs. ALU result) for register write-back needed for loads

Practice Problems

Problem 1: Branch Target Address

A beq instruction is at PC address 0x40 and the assembler computed a B-type immediate of -16 (decimal). What is the branch target address (BTA)?

Click to reveal solution
BTA = PC + imm-b
    = 0x40 + (-16)
    = 64 + (-16)
    = 48
    = 0x30
The negative immediate makes this a **backward** branch (target is before the branch), which is exactly how a loop's "branch back to the top" works. This is why `imm-b` must be **sign-extended** to 64 bits before adding it to the PC.

Problem 2: BUOp Selection

For each branch instruction, give the BUOp value (using the lecture's ordering 00===, 01=!=, 10=<, 11=>=) and state when the branch is taken.

Click to reveal solution | Instruction | `BUOp` | Branch taken when | |-------------|--------|-------------------| | `beq` | `00` | `rs1 == rs2` | | `bne` | `01` | `rs1 != rs2` | | `blt` | `10` | `rs1 < rs2` | | `bge` | `11` | `rs1 >= rs2` | `bgt rs1, rs2, label` is not a separate hardware case: the assembler turns it into `blt rs2, rs1, label`, so only the four comparisons above are needed in the Branch Unit.

Problem 3: PC Update Logic

Fill in the next PC for each row, given the policy that PCsel = 0 forces PC+4. Assume each instruction is 4 bytes and PC = 0x100.

Instruction PCsel PCbr Next PC
add t0,t1,t2 0 x ?
beq taken 1 1 ?
beq not taken 1 0 ?
jal label 1 n/a ?
Click to reveal solution | Instruction | `PCsel` | `PCbr` | Next PC | |-------------|---------|--------|---------| | `add t0,t1,t2` | 0 | x | `PC+4 = 0x104` | | `beq` taken | 1 | 1 | `BTA` | | `beq` not taken | 1 | 0 | `PC+4 = 0x104` | | `jal label` | 1 | n/a | `JTA` | The rule: when `PCsel = 0` the PC always advances to `PC+4` regardless of the Branch Unit. When `PCsel = 1`, a branch uses `BTA` only if `PCbr = 1`; a jump always uses its target. This is why the inner MUX (selected by `PCbr`) feeds the outer `PCsel` MUX (Option 1).

Problem 4: Sizing the RAM

You need a data memory of 2048 bytes built from 64-bit cells. How many cells are there, and how many address bits does the RAM need?

Click to reveal solution
Bytes per cell = 64 bits / 8 = 8 bytes = 2^3
Number of cells = 2048 / 8 = 256 = 2^8

Address bits = 8   (2^8 = 256 cells)
Check: 2^3 bytes/cell * 2^8 cells = 2^11 = 2048 bytes  ✓
So the RAM is configured with **8 address bits** and **64 data bits**.

Problem 5: Byte Address to Doubleword Address

A program executes ld t0, 16(sp) with sp = 1024. What byte address does the ALU compute, and what DW address goes to the RAM's A input?

Click to reveal solution
byte_addr = sp + offset = 1024 + 16 = 1040

DW_addr = byte_addr >> 3 = 1040 / 8 = 130
In hardware, a splitter drops the low 3 bits of `1040` (`0b10000010000`): the low 3 bits (`000`) are the byte-within-doubleword offset and are discarded for `ld`; the remaining high bits form `130`, the DW index sent to the RAM. Because `1040` is a multiple of 8, this is a valid doubleword-aligned `ld`.

Problem 6: Load Word Half Selection

A lw reads from byte address 0x2C. After converting to a DW address and reading the 64-bit cell, which 32-bit half (lower bits 0-31 or upper bits 32-63) does the load logic select, and why?

Click to reveal solution
0x2C = 0b101100
   bits 1..0 = 00  (byte index within the word)
   bit 2     = 1   (word index within the doubleword)
Bit 2 is the word-index selector. Since **bit 2 = 1**, the load logic selects the **upper word (bits 32-63)**. That 32-bit value is then sign-extended to 64 bits before going to the `MSZ` MUX. (If bit 2 were 0, it would select the lower word, bits 0-31.) Note `0x2C` is 4-byte aligned, as required for `lw`.

Further Reading


Summary

  1. The remaining instructions split into data processing (ALU), control (jal/jalr plus the new branches), and memory (lb/sb, lw/sw, ld/sd); branches and data memory are the genuinely new work.

  2. A conditional branch is a two-step process: compute BTA = PC + imm-b (reusing the ALU), then compare rs1 and rs2 to decide whether to redirect the PC to the BTA or fall through to PC+4.

  3. The Branch Unit encapsulates the four comparisons (=, !=, <, >=), selected by a clean 2-bit BUOp control line, and outputs take_branch/PCbr — keeping non-contiguous funct3 decoding out of the data path.

  4. PC selection combines PCsel and PCbr: an inner MUX driven by PCbr chooses PC+4 vs. BTA, feeding the outer PCsel MUX (Option 1). When PCsel = 0, the PC always advances to PC+4.

  5. Data memory uses a Digital RAM with 64-bit cells for the stack; a 1024-byte memory needs 128 cells (7 address bits). Registers hold byte addresses, so a splitter converts byte_addr >> 3 into the doubleword address for the RAM.

  6. Sub-word access needs helper logic: keep the RAM at the top level, add load logic after it and store logic before it; use byte-address bit 2 to pick the word half, sign-extend results, and use MSZ to select byte/word/doubleword. Stores do a read-modify-write so the rest of the cell is preserved.

  7. Programs require explicit initialization: set up sp (e.g., li sp, 1024), provide a main, drop .global, use jal instead of call, and end with unimp. Decode-ROM contents are best generated as a .hex file rather than pasted by hand.