Project 04 and Midterm Review

# Project 04 and Midterm Review

## CS 315 Computer Architecture

---

## Session Overview

<div class="info-box">
<strong>Two halves today:</strong>
<ol>
<li>Implementing load/store instructions in the emulator (Project 04)</li>
<li>Cache memory: direct-mapped, block size, set-associative</li>
</ol>
</div>

- Midterm is **Thursday** — one page of notes, front and back
- Instruction formats may be printed on the exam
- Emphasis on *reasoning*, not memorizing

---

## Midterm Study Checklist

- Extract bit fields from an instruction word
- Reconstruct I-type and S-type immediates
- Apply **sign extension**, **two's complement**, **masking**
- Implement `lw`/`sw`: target address + pointer cast
- **Direct-mapped cache**: slot index, tag, valid bit
- **Block size**: block-aligned base address arithmetic
- **Set-associative cache**: sets, ways, LRU replacement

<div class="highlight-box">
Good one-page sheet: I-type/S-type field layouts, sign-extend idiom, byte/word/block conversions, slot-index formula.
</div>

---

## Load and Store: The Big Picture

Assembly syntax — offset and base register are explicit:

```asm
lw   t0, 8(sp)     # load word:  t0 = mem[sp + 8]
sw   t0, 8(sp)     # store word: mem[sp + 8] = t0
```

**Target address** = base register + sign-extended immediate

| Instruction | Width | Direction |
|-------------|-------|-----------|
| `lw` / `ld` | 32 / 64 bits | memory → register |
| `sw` / `sd` | 32 / 64 bits | register → memory |

<div class="info-box">
Do <strong>not</strong> index the emulated stack array directly — always compute an effective address and dereference through a real pointer.
</div>

---

## I-type Format (Loads)

```text
 31           20 19   15 14  12 11   7 6      0
+---------------+-------+------+------+--------+
|    imm[11:0]  |  rs1  |funct3|  rd  | opcode |
+---------------+-------+------+------+--------+
      12 bits      5       3      5       7
```

- Single contiguous 12-bit immediate in bits [31:20]
- `rs1` = base register, `rd` = destination register
- Immediate is **signed** — must sign-extend bit 11 to 64 bits

```c
int64_t imm = sign_extend(vimm, 11);   // sign bit is bit 11
```

---

## Implementing `lw`

```c
// 1. Sign-extend the 12-bit immediate
int64_t imm = sign_extend(vimm, 11);

// 2. Compute the target address
uint64_t target_address = rsp->regs[rs1] + imm;

// 3. Dereference — pointer type selects access width
rsp->regs[rd] = *((uint32_t *) target_address);  // lw: 4 bytes
// rsp->regs[rd] = *((uint64_t *) target_address); // ld: 8 bytes
```

<div class="highlight-box">
The <strong>pointer cast</strong> is what tells C how wide the access is. Changing <code>uint32_t *</code> to <code>uint64_t *</code> is the only difference between <code>lw</code> and <code>ld</code>.
</div>

---

## S-type Format (Stores)

```text
 31        25 24   20 19   15 14  12 11    7 6      0
+-----------+-------+-------+------+--------+--------+
| imm[11:5] |  rs2  |  rs1  |funct3|imm[4:0]| opcode |
+-----------+-------+-------+------+--------+--------+
   7 bits      5       5      3       5        7
```

- No `rd` — stores write to memory, not a register
- Immediate is **split**: `imm[11:5]` in bits [31:25], `imm[4:0]` in bits [11:7]
- Must **reassemble** before sign-extending

```c
vimm = (imm11_5 << 5) | imm4_0;     // reassemble 12-bit field
int64_t imm = sign_extend(vimm, 11);
```

---

## Implementing `sw`

```c
// Compute target address (same as load)
uint64_t target_address = rsp->regs[rs1] + imm;

// Write register value into memory
*((uint32_t *) target_address) = (uint32_t) rsp->regs[rs2];
```

**Load vs. store symmetry:**

```c
// load word:   register <- memory
rsp->regs[rd] = *((uint32_t *) target_address);

// store word:  memory <- register
*((uint32_t *) target_address) = (uint32_t) rsp->regs[rs2];
```

---

## Sign Extension, Two's Complement, Masking

**Two's complement** — negate by inverting all bits and adding one:

```text
3-bit example:  000=0  001=1  010=2  011=3
                100=-4  101=-3  110=-2  111=-1
```

**Sign extension** idiom — shift sign bit left, then arithmetic right:

```c
int64_t sign_extend(uint64_t v, int sign_bit) {
    int shift = 63 - sign_bit;   // for bit 11: shift = 52
    return ((int64_t)(v << shift)) >> shift;
}
```

**Masking** — shift field to bottom, AND with width mask:

```c
uint32_t field = (instr >> 20) & 0xFFF;   // extract bits [31:20]
```

---

## Cache Memory: Why It Exists

Memory is large and slow; registers are tiny and fast.

A **cache** holds recently used data so most requests avoid slow main memory.

**Locality principles:**

- **Temporal locality** — used now → likely used again soon (loop index)
- **Spatial locality** — touched one address → neighbors likely next (`array[i]` then `array[i+1]`)

<div class="mermaid">
flowchart LR
    CPU["CPU"] --> Cache["Cache\n(fast, small)"]
    Cache --> Mem["Main Memory\n(slow, large)"]
    style Cache fill:#bbf,stroke:#333
    style Mem fill:#fdd,stroke:#333
</div>

---

## Address Layers: Byte, Word, Block

Memory is byte-addressed; we group bytes into words and words into blocks.

```text
byte address bits:
  ...  |  tag  |  slot index  |  byte offset  |
                                  (bits 1:0)
```

**Conversions:**

```c
addr_word = addr / 4;        // byte address -> word address
addr      = addr_word * 4;   // word address -> byte address
```

- Word 0 = bytes 0–3, Word 1 = bytes 4–7, Word 2 = bytes 8–11, ...
- Project 04 I-cache: addresses are **word-aligned** (byte offset always 0)

---

## Direct-Mapped Cache

Every address maps to **exactly one slot**.

For a 4-slot cache:

```c
slot_index = addr_word % 4          // modulus
           = (addr >> 2) & 0b11;    // equivalent bit mask
```

Many addresses share a slot — tell them apart with the **tag**:

```text
word addr   byte addr   slot   tag
   0            0         0     0
   4           16         0     1   <- same slot, different tag
   8           32         0     2   <- same slot, different tag
   1            4         1     0
```

Each slot stores: `valid` bit + `tag` + `data`

---

## Cache Lookup: Direct-Mapped

<div class="mermaid">
flowchart LR
    A["byte address"] --> B["addr >> 2\n(drop byte offset)"]
    B --> C["& 0b11\n(slot index)"]
    C --> D{"valid == 1\nand tag matches?"}
    D -->|yes| E["HIT: return data"]
    D -->|no| F["MISS: fetch from memory,\nfill slot, set valid + tag"]
</div>

```c
uint64_t addr_tag   = addr >> 4;           // drop offset(2) + index(2) bits
uint64_t slot_index = (addr >> 2) & 0b11;
struct cache_slot_st *slot = &cache[slot_index];

if (slot->valid && slot->tag == addr_tag)
    return slot->data;                     // HIT
else { /* fetch, fill, return */ }         // MISS
```

---

## Scaling the Direct-Mapped Cache

| # Slots | Index bits | Tag bits (64-bit addr) |
|---------|------------|------------------------|
| 4       | 2          | 60                     |
| 16      | 4          | 58                     |
| 64      | 6          | 56                     |

For 16 slots:

```c
slot_index = (addr >> 2) & 0b1111;   // 4 index bits
addr_tag   = addr >> 6;              // drop offset(2) + index(4)
```

<div class="info-box">
More slots = more index bits = fewer tag bits. The formula is always: <code>slot_index = addr_word % #slots</code>.
</div>

---

## I-cache vs. D-cache

Two distinct memory address streams in every processor:

<div class="mermaid">
flowchart LR
    PC["PC"] --> IC["I-Cache\n(instructions)"]
    LDST["ld / sd\n(data access)"] --> DC["D-Cache\n(data)"]
    IC --> MEM["Main Memory"]
    DC --> MEM
    style IC fill:#bbf,stroke:#333
    style DC fill:#fbb,stroke:#333
</div>

- L1 is **split** (I-cache + D-cache) so the two streams don't interfere
- L2/L3 are usually **unified**

**Project 04 simulates the I-cache** — read-only and word-aligned (simpler!)

---

## Why I-cache Is Simpler

| Feature | I-cache | D-cache |
|---------|---------|---------|
| Access type | Read-only | Read + Write |
| Alignment | Always word-aligned | Byte, word, doubleword |
| Write policy | None needed | Write-through or write-back |
| Byte offset | Always zero | Variable |

<div class="highlight-box">
No stores = no dirty-block tracking, no write-through/write-back policy to implement.
</div>

---

## Cache Block Size

Each slot can hold a **block** of several words to exploit spatial locality.

```c
struct cache_slot_st {
    uint64_t tag;
    uint32_t block[4];   // 4 words per block
    uint32_t valid;
    uint64_t timestamp;
};
```

Address split with 4-word blocks (4-slot cache):

```c
slot_index  = (addr >> 4) & 0b11;   // skip offset(2)+block(2)
block_index = (addr >> 2) & 0b11;   // which word in the block
```

---

## Block Size: Spatial Locality Payoff

<div class="mermaid">
flowchart TD
    A["access w0 (MISS)"] --> B["fetch entire block:\nw0, w1, w2, w3"]
    B --> C["w1, w2, w3 are now HITS"]
    C --> D["3 free hits from 1 miss!"]
</div>

**Example** — `addr = 16` in a 4-word-block, 4-slot cache:

```text
addr_word  = 16 / 4  = 4
addr_block =  4 / 4  = 1
slot_index =  1 % 4  = 1
```

One miss on word 0 of the block prefetches words 1–3.

---

## Finding the Block Base Address on a Miss

Must load the **entire block** — need the first word's address.

**Arithmetic approach** (for `addr = 24`, 4-word block):

```text
addr_word = 6
rem       = 6 % 4 = 2        (block index = which word within block)
block_start_word = 6 - 2 = 4
block_start_byte = 4 * 4 = 16
```

**Bit-mask approach** — clear the low 4 bits (4-word block = 16 bytes):

```c
// (a) shift out and back
addr_block_start = (addr >> 4) << 4;

// (b) AND mask (zeros low 4 bits)
addr_block_start = addr & 0xFFFFFFFFFFFFFFF0;
```

Both give **byte 16** as the block base for byte address 24.

---

## Loading the Full Block

Once you have the block base address, read all 4 words:

```c
uint32_t block[4];
block[0] = *((uint32_t *) addr);          // word at byte 16
block[1] = *((uint32_t *) (addr + 4));    // word at byte 20
block[2] = *((uint32_t *) (addr + 8));    // word at byte 24
block[3] = *((uint32_t *) (addr + 12));   // word at byte 28
```

Store the block in the cache slot, set `valid = 1` and record the `tag`.

---

## Set-Associative Cache

**Direct-mapped**: one home per address — collisions force eviction even when other slots are free.

**Set-associative**: addresses map to a **set**; any **way** within the set may hold the data.

- Map address to a **set** (direct-mapped style)
- Compare tag against **all ways** in that set (small loop)

---

## Set-Associative Lookup

<div class="mermaid">
flowchart TD
    A["address"] --> B["set_index = (addr >> 3) & 0b111"]
    B --> C["set_base = set_index * num_ways"]
    C --> D["loop ways: compare tags"]
    D -->|match| E["HIT, update timestamp"]
    D -->|no match| F["MISS"]
    F --> G{"unused way\nin set?"}
    G -->|yes| H["fill empty way"]
    G -->|no| I["evict LRU way\n(smallest timestamp)"]
</div>

---

## Set-Associative: Code Sketch

8 sets, 4 ways, 1-word blocks:

```c
uint64_t addr_tag  = addr >> 5;            // drop offset(2) + set-idx(3)
uint64_t set_index = (addr >> 2) & 0b111;  // 3 bits for 8 sets
uint64_t set_base  = set_index * num_ways;

for (int i = 0; i < num_ways; i++) {
    struct cache_slot_st *slot = &cache[set_base + i];
    if (slot->valid && slot->tag == addr_tag) {
        slot->timestamp = num_refs;  // update LRU counter
        return slot->data;           // HIT
    }
}
// MISS: find empty or LRU way in this set, fill it
```

**LRU replacement**: evict the way with the smallest timestamp. Prefer invalid ways first.

---

## Key Concepts Reference

| Concept | Formula / Idiom |
|---------|-----------------|
| Target address | `regs[rs1] + sign_extend(imm, 11)` |
| I-type immediate | single field bits [31:20] |
| S-type immediate | `(imm11_5 << 5) \| imm4_0` |
| Sign extension | `((int64_t)(v << shift)) >> shift` |
| Slot index (direct) | `(addr >> 2) & mask` |
| Tag (4-slot) | `addr >> 4` |
| Block base (mask) | `addr & ~0xF` (4-word block) |
| Set index (8 sets) | `(addr >> 2) & 0b111` |
| LRU eviction | smallest `timestamp` in set |

---

## Practice: Direct-Mapped Slot and Tag

**Q:** 4-slot, 1-word-block cache. What slot and tag does byte address **36** map to?

```text
addr_word  = 36 / 4 = 9
slot_index = 9  % 4 = 1       (or (36 >> 2) & 0b11 = 1)
tag        = 36 >> 4 = 2      (drop 2 offset + 2 index bits)
```

**Answer:** slot **1**, tag **2**.

<div class="info-box">
Byte 4 → slot 1 tag 0, byte 20 → slot 1 tag 1, byte 36 → slot 1 tag 2 — all share slot 1.
</div>

---

## Practice: S-type Immediate Reconstruction

**Q:** `imm[11:5] = 0b0000010`, `imm[4:0] = 0b00100`. What offset?

```text
vimm = (0b0000010 << 5) | 0b00100
     = 0b000001000000 | 0b000000000100
     = 0b000001000100
     = 0x44 = 68
```

Sign bit (bit 11) = 0, so the offset is **+68**.

The store accesses `mem[base + 68]`.

---

## Practice: Block Base Address

**Q:** 4-word-block cache misses on byte address **24**. Block base?

**Arithmetic:**
```text
addr_word = 6,  rem = 6 % 4 = 2,  block_start = 4 * 4 = 16
```

**Bit mask:**
```text
24 = 0b...0001 1000
24 & 0xFFFFFFFFFFFFFFF0 = 0b...0001 0000 = 16
```

Block base = **byte 16**. Block covers bytes 16, 20, 24, 28.

---

## Summary

1. **Load/store target address** = `regs[rs1] + sign_extend(imm, 11)` — always go through a real pointer dereference.

2. **I-type** has a single 12-bit immediate; **S-type** splits it across two fields — reassemble before sign-extending.

3. **Pointer cast selects width**: `uint32_t *` for `lw`/`sw`, `uint64_t *` for `ld`/`sd`.

4. **Sign extension, two's complement, masking** are the core bit-level tools for both immediates and cache indexing.

5. **Direct-mapped cache**: `slot = addr_word % #slots`; store tag + valid bit.

6. **Block size** amortizes miss cost via spatial locality; compute base by clearing low block-offset bits.

7. **Set-associative cache**: map to a set, search all ways, use LRU to evict. Project 04's most advanced cache.

8. **Project 04 simulates the I-cache** — read-only and word-aligned, the simplest case.