Programs and Data Memory

# Programs and Data Memory

## CS 315 Computer Architecture

---

## Overview

- Wrap a RISC-V function to run **standalone** on the Digital processor
- **Size the RAM** and initialize the stack pointer
- Apply the **five-step recipe** for adding any new instruction
- Design the **data memory subsystem** for `ld` / `sd`
- Extend to sub-doubleword: `lw` / `sw` and `lb` / `sb`

---

## Running Functions on the Bare Processor

The Digital processor is **not** an OS — there is no loader, `_start`, or C runtime.

Four rules to wrap any function:

| Rule | Action |
|------|--------|
| 1. Assembly `main` | Provide entry point; set up arguments |
| 2. No `.global` | Drop linker directives |
| 3. `jal` not `call` | `call` may expand to `auipc`+`jalr` |
| 4. `unimp` end marker | Halts the processor |

---

## Standalone Skeleton

```asm
main:
        li    sp, 1024        # top of the RAM
        li    a0, ...         # set up arguments
        jal   swap_s          # link ra, jump
        unimp                 # processor halts here

swap_s:
        # ... function body ...
        ret
```

<div class="highlight-box">
<code>unimp</code> is the end marker — execution stops when the processor fetches it.
</div>

---

## Execution Flow

<div class="mermaid">
flowchart TD
    A["main:"] --> B["li sp, 1024"]
    B --> C["jal swap_s  (ra = PC+4)"]
    C --> D["swap_s: body runs"]
    D --> E["ret  (jalr ra → back to main)"]
    E --> F["unimp  (processor halts)"]
</div>

---

## Why `jal`, Not `call`?

- `call` is a **pseudo-instruction**: assembler may expand to `auipc` + `jalr`
- Our processor does **not yet support `auipc`**
- `jal` is a single real instruction — fits our datapath directly

<div class="info-box">
For tiny programs (all of Instruction Memory), every label is reachable with <code>jal</code>.
</div>

---

## Sizing the RAM

We want **1024 bytes** of stack. The RAM has **64-bit cells**:

```text
1024 bytes  =  2^10 bytes
each cell   =  8 bytes  =  2^3 bytes
cells       =  2^10 / 2^3  =  2^7  =  128 cells
address bits  =  7
```

| RAM parameter | Value | Meaning |
|---------------|-------|---------|
| data bits | 64 | one cell = one doubleword |
| address bits | 7 | 2^7 = 128 cells |
| total bytes | 1024 | 128 × 8 |

---

## Stack Pointer Initialization

Stack grows **downward** in RISC-V — `sp` starts at the **top** of RAM.

```asm
li sp, 1024     # top-of-RAM byte address
```

First push: `addi sp, sp, -16` → lands at byte 1008 (valid).

---

## Capacity Formula

$$\text{total bytes} = 2^{\text{addr bits}} \times \frac{\text{data bits}}{8}$$

Examples:

| data bits | addr bits | cells | total bytes |
|-----------|-----------|-------|-------------|
| 64 | 7 | 128 | 1 024 |
| 64 | 8 | 256 | 2 048 |
| 64 | 10 | 1 024 | 8 192 |

---

## The Five-Step Recipe

Apply this **every time** you add a new instruction:

<div class="mermaid">
flowchart TD
    S1["1. Pick an instruction (or group)"] --> S2
    S2["2. Add / modify components"] --> S3
    S3["3. Extend the datapath (add MUX inputs)"] --> S4
    S4["4. Update decoder spreadsheet + ROM"] --> S5
    S5["5. Test"]
    S5 -.->|"next instruction"| S1
</div>

---

## Recipe: Key Principles

- **Step 3 — MUX inputs**: new data sources go on *new* MUX inputs; existing inputs are unchanged
- **Step 4 — Spreadsheet first**: add the new instruction row AND set new control columns to `0` for all existing instructions
- **Step 5 — Incremental**: run the autograder after *each* instruction group before moving on

<div class="info-box">
Keeping the <strong>RAM at the top level</strong> of the circuit lets you inspect its contents during simulation.
</div>

---

## Project 6 Layout

```text
project06-<github_userid>/
├── part1/    # *.dig + *.hex  (snapshot 1)
├── part2/    # *.dig + *.hex  (snapshot 2)
├── part3/    # *.dig + *.hex  (snapshot 3)
└── final/    # *.dig + *.hex  (finished)
```

- Run autograder **per directory**: `grade test -p project06`
- Decoder spreadsheet: **one workbook, four sheets** (`part1`, `part2`, `part3`, `final`)
- Submit `.xlsx` and PDF export

---

## `ld` and `sd` Instructions

```asm
ld   t0, 8(sp)   # t0 = memory[sp + 8]   (64-bit load)
sd   t0, 8(sp)   # memory[sp + 8] = t0   (64-bit store)
```

**Address computation**: `target_addr = base + offset`

- Reuses the **ALU** (add operation)
- Input A = base register (`RD0`)
- Input B = sign-extended immediate (`imm-I` for loads, `imm-S` for stores)

---

## The Byte-Address Problem

<div class="highlight-box">
Registers hold <strong>byte addresses</strong>. The RAM is indexed by <strong>doubleword (cell) number</strong>.
</div>

Fix: drop the low 3 bits (divide by 8):

```text
DW address = byte_address >> 3
           = byte_address bits [9:3]   (7 bits for 128-cell RAM)
```

Implement in Digital with a **splitter**: wire bits `[9:3]` of the 64-bit ALU result to the 7-bit RAM `A` input.

---

## Byte → DW Address Conversion

<div class="mermaid">
flowchart LR
    ALU["ALU result\n(64-bit byte addr)"] --> SPL["splitter:\nbits 9..3"]
    SPL --> RAMA["RAM A\n(7-bit DW addr)"]
</div>

- Bits `0..2` are the byte offset *within* a doubleword — ignored for `ld`/`sd` (8-byte aligned)
- Bits `3..9` are the cell index (0..127)

---

## RAM (Separated Ports) Ports

| Port | Width | Dir | Purpose |
|------|-------|-----|---------|
| `A` (ADDR) | 7 | in | DW address |
| `Din` | 64 | in | data to write |
| `Dout` | 64 | out | data read |
| `str` | 1 | in | store (write) enable |
| `ld` | 1 | in | load (read) enable |
| `clk` | 1 | in | clock |

Configure: **data bits = 64**, **address bits = 7**

---

## Write-Back: M2R / WDsel MUX

For a load, `Dout` must reach the register file — but `WD` already carries the ALU result.

```text
ALU result  --> | 0         |
                |  M2R MUX  | --> | 0         |
RAM Dout    --> | 1         |     | WDsel MUX | --> RegFile WD
                              ^   | 1  PC+4   |
                             M2R  +-----------+
                                        ^
                                      WDsel
```

New sources always go on **new MUX inputs** — existing instructions unaffected.

---

## `ld` / `sd` Datapath

<div class="mermaid">
flowchart LR
    RD0["RD0\n(base)"] --> ALU["ALU\n(add)"]
    IMM["imm\n(offset)"] --> ALU
    ALU --> CONV["bits 9..3\n(splitter)"]
    CONV --> RAMA["RAM A\n7-bit"]
    RD1["RD1\n(store value)"] --> DIN["RAM Din"]
    RAMA --> RAM["RAM\n64×128"]
    DIN --> RAM
    RAM --> DOUT["RAM Dout"]
    DOUT --> M2R["M2R/WDsel\nMUX"]
    M2R --> WD["RegFile\nWriteData"]
</div>

---

## Control Lines: `ld` and `sd`

| inst | `ld` (read) | `str` (write) | `M2R` | `RFW` |
|------|-------------|---------------|-------|-------|
| `ld` | 1 | 0 | 1 (RAM Dout) | 1 |
| `sd` | 0 | 1 | don't care | 0 |

- `ld`: read RAM, route `Dout` to register file
- `sd`: write whole cell; no register file write

These become new columns in the decoder spreadsheet (Step 4).

---

## Sub-Doubleword: `lw` / `sw`

A 64-bit cell holds **two 32-bit words**:

```text
 63              32 31               0
 +------------------+------------------+
 |  upper word (W1) |  lower word (W0) |
 +------------------+------------------+
        word index 1       word index 0
              ^--- selected by byte-address bit 2
```

- **Load** (`lw`): read cell, select half, sign-extend 32→64
- **Store** (`sw`): **read-modify-write** the cell

---

## Load Word Path

<div class="mermaid">
flowchart LR
    DOUT["RAM Dout\n(64-bit)"] --> SPL["split:\nW0 bits 31..0\nW1 bits 63..32"]
    SPL --> WMUX["word MUX\n(sel = bit 2)"]
    WMUX --> SX["sign-extend\n32 → 64"]
    SX --> MSZ["MSZ MUX\nlb/lw/ld"]
    DOUT --> MSZ
    MSZ --> M2R["to M2R/WDsel\n→ RegFile"]
</div>

MSZ encodings (from `funct3`): `lb = 00`, `lw = 10`, `ld = 11`

---

## Read-Modify-Write for `sw`

The RAM can only write **full 64-bit cells** — we must preserve the half we are not changing.

```text
1. READ  current cell D64cur  (ld=1)
2. MODIFY:
     option A (write lower): { W1 : Wnew }
     option B (write upper): { Wnew : W0 }
   word MUX selects A or B via byte-addr bit 2
3. WRITE merged value back   (str=1)
```

<div class="highlight-box">
For <code>sw</code>: set <strong>both</strong> <code>ld=1</code> and <code>str=1</code> simultaneously.
</div>

---

## Store Word Path

<div class="mermaid">
flowchart TD
    CUR["RAM Dout\nD64cur"] --> S1["split: W0, W1"]
    RD1["RD1 = D64in"] --> S2["Wnew = low 32 bits"]
    S1 --> M1["merge\nW1 : Wnew"]
    S2 --> M1
    S1 --> M2["merge\nWnew : W0"]
    S2 --> M2
    M1 --> WMUX["word MUX\n(sel = bit 2)"]
    M2 --> WMUX
    WMUX --> MSZ["MSZ MUX\nsb/sw/sd"]
    MSZ --> DIN["RAM Din\n(str=1, ld=1)"]
</div>

---

## `lb` / `sb` by Analogy

Same pattern, finer granularity:

| op | granularity | selector bits | sign-ext |
|----|-------------|---------------|----------|
| `ld`/`sd` | 64-bit cell | none | no |
| `lw`/`sw` | 32-bit half | bit 2 | 32→64 |
| `lb`/`sb` | 8-bit byte | bits 2..0 | 8→64 |

Once `lw`/`sw` works, `lb`/`sb` follows the same structure with an 8-way byte-select MUX.

---

## Control Lines: Full Table

| inst | `ld` | `str` | `RFW` | notes |
|------|------|-------|-------|-------|
| `ld` | 1 | 0 | 1 | read cell → RF |
| `sd` | 0 | 1 | 0 | write whole cell |
| `lw` | 1 | 0 | 1 | read, extract, sign-ext |
| `sw` | 1 | 1 | 0 | read-modify-write |
| `lb` | 1 | 0 | 1 | read, extract byte, sign-ext |
| `sb` | 1 | 1 | 0 | read-modify-write (byte) |

---

## Memory Size Hierarchy

| Mnemonic | Bits | Alignment | Cell selector |
|----------|------|-----------|---------------|
| `lb`/`sb` | 8 | 1 byte | addr bits `2..0` |
| `lw`/`sw` | 32 | 4 bytes | addr bit `2` |
| `ld`/`sd` | 64 | 8 bytes | none (whole cell) |

<div class="info-box">
<strong>Key invariant</strong>: addresses in registers are always <em>byte</em> addresses. The DW cell index = byte address >> 3.
</div>

---

## Sign Extension Recap

Loads sign-extend the loaded value to 64 bits:

```text
lb example: read byte = 0b1111_1110  (= -2 signed)
sign bit = 1
result   = 0xFFFF_FFFF_FFFF_FFFE    (= -2 as 64-bit)
```

Trick: shift fully left to put sign bit at MSB, then arithmetic shift right back.

---

## Debugging Tips

- Keep the **RAM at the top level** — only then can you open it and inspect the stack during simulation
- **Single-step** the clock; use **probes** on intermediate wires
- Compare against `objdump` output — match each instruction to its expected behavior
- Add a new partial directory (`part2/`, `part3/`) **before** making the next change

---

## Summary

1. **Standalone packaging**: assembly `main`, no `.global`, `jal` not `call`, `unimp` end marker

2. **Stack pointer**: `li sp, 1024` — top of a 1024-byte RAM (64 data bits, 7 addr bits)

3. **Capacity**: total bytes = 2^(addr bits) × (data bits / 8)

4. **Five-step recipe**: pick → components → datapath (new MUX inputs) → decoder → test

5. **Byte → DW address**: splitter takes bits `[9:3]` of the 64-bit ALU result

6. **`ld`/`sd`**: ALU computes address; M2R MUX routes `Dout` to RF for loads

7. **Sub-doubleword**: `lw`/`sw` use word-index bit; `sw`/`sb` need read-modify-write (`ld=1, str=1`)

8. **Incremental builds**: `part1`–`final` directories; four-sheet decoder spreadsheet