← Back to Course
# Project 04 and Midterm Review ## CS 315 Computer Architecture --- ## Session Overview
Two halves today:
Implementing load/store instructions in the emulator (Project 04)
Cache memory: direct-mapped, block size, set-associative
- Midterm is **Thursday** — one page of notes, front and back - Instruction formats may be printed on the exam - Emphasis on *reasoning*, not memorizing --- ## Midterm Study Checklist - Extract bit fields from an instruction word - Reconstruct I-type and S-type immediates - Apply **sign extension**, **two's complement**, **masking** - Implement `lw`/`sw`: target address + pointer cast - **Direct-mapped cache**: slot index, tag, valid bit - **Block size**: block-aligned base address arithmetic - **Set-associative cache**: sets, ways, LRU replacement
Good one-page sheet: I-type/S-type field layouts, sign-extend idiom, byte/word/block conversions, slot-index formula.
--- ## Load and Store: The Big Picture Assembly syntax — offset and base register are explicit: ```asm lw t0, 8(sp) # load word: t0 = mem[sp + 8] sw t0, 8(sp) # store word: mem[sp + 8] = t0 ``` **Target address** = base register + sign-extended immediate | Instruction | Width | Direction | |-------------|-------|-----------| | `lw` / `ld` | 32 / 64 bits | memory → register | | `sw` / `sd` | 32 / 64 bits | register → memory |
Do
not
index the emulated stack array directly — always compute an effective address and dereference through a real pointer.
--- ## I-type Format (Loads) ```text 31 20 19 15 14 12 11 7 6 0 +---------------+-------+------+------+--------+ | imm[11:0] | rs1 |funct3| rd | opcode | +---------------+-------+------+------+--------+ 12 bits 5 3 5 7 ``` - Single contiguous 12-bit immediate in bits [31:20] - `rs1` = base register, `rd` = destination register - Immediate is **signed** — must sign-extend bit 11 to 64 bits ```c int64_t imm = sign_extend(vimm, 11); // sign bit is bit 11 ``` --- ## Implementing `lw` ```c // 1. Sign-extend the 12-bit immediate int64_t imm = sign_extend(vimm, 11); // 2. Compute the target address uint64_t target_address = rsp->regs[rs1] + imm; // 3. Dereference — pointer type selects access width rsp->regs[rd] = *((uint32_t *) target_address); // lw: 4 bytes // rsp->regs[rd] = *((uint64_t *) target_address); // ld: 8 bytes ```
The
pointer cast
is what tells C how wide the access is. Changing
uint32_t *
to
uint64_t *
is the only difference between
lw
and
ld
.
--- ## S-type Format (Stores) ```text 31 25 24 20 19 15 14 12 11 7 6 0 +-----------+-------+-------+------+--------+--------+ | imm[11:5] | rs2 | rs1 |funct3|imm[4:0]| opcode | +-----------+-------+-------+------+--------+--------+ 7 bits 5 5 3 5 7 ``` - No `rd` — stores write to memory, not a register - Immediate is **split**: `imm[11:5]` in bits [31:25], `imm[4:0]` in bits [11:7] - Must **reassemble** before sign-extending ```c vimm = (imm11_5 << 5) | imm4_0; // reassemble 12-bit field int64_t imm = sign_extend(vimm, 11); ``` --- ## Implementing `sw` ```c // Compute target address (same as load) uint64_t target_address = rsp->regs[rs1] + imm; // Write register value into memory *((uint32_t *) target_address) = (uint32_t) rsp->regs[rs2]; ``` **Load vs. store symmetry:** ```c // load word: register <- memory rsp->regs[rd] = *((uint32_t *) target_address); // store word: memory <- register *((uint32_t *) target_address) = (uint32_t) rsp->regs[rs2]; ``` --- ## Sign Extension, Two's Complement, Masking **Two's complement** — negate by inverting all bits and adding one: ```text 3-bit example: 000=0 001=1 010=2 011=3 100=-4 101=-3 110=-2 111=-1 ``` **Sign extension** idiom — shift sign bit left, then arithmetic right: ```c int64_t sign_extend(uint64_t v, int sign_bit) { int shift = 63 - sign_bit; // for bit 11: shift = 52 return ((int64_t)(v << shift)) >> shift; } ``` **Masking** — shift field to bottom, AND with width mask: ```c uint32_t field = (instr >> 20) & 0xFFF; // extract bits [31:20] ``` --- ## Cache Memory: Why It Exists Memory is large and slow; registers are tiny and fast. A **cache** holds recently used data so most requests avoid slow main memory. **Locality principles:** - **Temporal locality** — used now → likely used again soon (loop index) - **Spatial locality** — touched one address → neighbors likely next (`array[i]` then `array[i+1]`)
flowchart LR CPU["CPU"] --> Cache["Cache\n(fast, small)"] Cache --> Mem["Main Memory\n(slow, large)"] style Cache fill:#bbf,stroke:#333 style Mem fill:#fdd,stroke:#333
--- ## Address Layers: Byte, Word, Block Memory is byte-addressed; we group bytes into words and words into blocks. ```text byte address bits: ... | tag | slot index | byte offset | (bits 1:0) ``` **Conversions:** ```c addr_word = addr / 4; // byte address -> word address addr = addr_word * 4; // word address -> byte address ``` - Word 0 = bytes 0–3, Word 1 = bytes 4–7, Word 2 = bytes 8–11, ... - Project 04 I-cache: addresses are **word-aligned** (byte offset always 0) --- ## Direct-Mapped Cache Every address maps to **exactly one slot**. For a 4-slot cache: ```c slot_index = addr_word % 4 // modulus = (addr >> 2) & 0b11; // equivalent bit mask ``` Many addresses share a slot — tell them apart with the **tag**: ```text word addr byte addr slot tag 0 0 0 0 4 16 0 1 <- same slot, different tag 8 32 0 2 <- same slot, different tag 1 4 1 0 ``` Each slot stores: `valid` bit + `tag` + `data` --- ## Cache Lookup: Direct-Mapped
flowchart LR A["byte address"] --> B["addr >> 2\n(drop byte offset)"] B --> C["& 0b11\n(slot index)"] C --> D{"valid == 1\nand tag matches?"} D -->|yes| E["HIT: return data"] D -->|no| F["MISS: fetch from memory,\nfill slot, set valid + tag"]
```c uint64_t addr_tag = addr >> 4; // drop offset(2) + index(2) bits uint64_t slot_index = (addr >> 2) & 0b11; struct cache_slot_st *slot = &cache[slot_index]; if (slot->valid && slot->tag == addr_tag) return slot->data; // HIT else { /* fetch, fill, return */ } // MISS ``` --- ## Scaling the Direct-Mapped Cache | # Slots | Index bits | Tag bits (64-bit addr) | |---------|------------|------------------------| | 4 | 2 | 60 | | 16 | 4 | 58 | | 64 | 6 | 56 | For 16 slots: ```c slot_index = (addr >> 2) & 0b1111; // 4 index bits addr_tag = addr >> 6; // drop offset(2) + index(4) ```
More slots = more index bits = fewer tag bits. The formula is always:
slot_index = addr_word % #slots
.
--- ## I-cache vs. D-cache Two distinct memory address streams in every processor:
flowchart LR PC["PC"] --> IC["I-Cache\n(instructions)"] LDST["ld / sd\n(data access)"] --> DC["D-Cache\n(data)"] IC --> MEM["Main Memory"] DC --> MEM style IC fill:#bbf,stroke:#333 style DC fill:#fbb,stroke:#333
- L1 is **split** (I-cache + D-cache) so the two streams don't interfere - L2/L3 are usually **unified** **Project 04 simulates the I-cache** — read-only and word-aligned (simpler!) --- ## Why I-cache Is Simpler | Feature | I-cache | D-cache | |---------|---------|---------| | Access type | Read-only | Read + Write | | Alignment | Always word-aligned | Byte, word, doubleword | | Write policy | None needed | Write-through or write-back | | Byte offset | Always zero | Variable |
No stores = no dirty-block tracking, no write-through/write-back policy to implement.
--- ## Cache Block Size Each slot can hold a **block** of several words to exploit spatial locality. ```c struct cache_slot_st { uint64_t tag; uint32_t block[4]; // 4 words per block uint32_t valid; uint64_t timestamp; }; ``` Address split with 4-word blocks (4-slot cache): ```text bits: ... | tag | slot(2) | block-word(2) | byte(2) | ``` ```c slot_index = (addr >> 4) & 0b11; // skip offset(2)+block(2) block_index = (addr >> 2) & 0b11; // which word in the block ``` --- ## Block Size: Spatial Locality Payoff
flowchart TD A["access w0 (MISS)"] --> B["fetch entire block:\nw0, w1, w2, w3"] B --> C["w1, w2, w3 are now HITS"] C --> D["3 free hits from 1 miss!"]
**Example** — `addr = 16` in a 4-word-block, 4-slot cache: ```text addr_word = 16 / 4 = 4 addr_block = 4 / 4 = 1 slot_index = 1 % 4 = 1 ``` One miss on word 0 of the block prefetches words 1–3. --- ## Finding the Block Base Address on a Miss Must load the **entire block** — need the first word's address. **Arithmetic approach** (for `addr = 24`, 4-word block): ```text addr_word = 6 rem = 6 % 4 = 2 (block index = which word within block) block_start_word = 6 - 2 = 4 block_start_byte = 4 * 4 = 16 ``` **Bit-mask approach** — clear the low 4 bits (4-word block = 16 bytes): ```c // (a) shift out and back addr_block_start = (addr >> 4) << 4; // (b) AND mask (zeros low 4 bits) addr_block_start = addr & 0xFFFFFFFFFFFFFFF0; ``` Both give **byte 16** as the block base for byte address 24. --- ## Loading the Full Block Once you have the block base address, read all 4 words: ```c uint32_t block[4]; block[0] = *((uint32_t *) addr); // word at byte 16 block[1] = *((uint32_t *) (addr + 4)); // word at byte 20 block[2] = *((uint32_t *) (addr + 8)); // word at byte 24 block[3] = *((uint32_t *) (addr + 12)); // word at byte 28 ``` Store the block in the cache slot, set `valid = 1` and record the `tag`. --- ## Set-Associative Cache **Direct-mapped**: one home per address — collisions force eviction even when other slots are free. **Set-associative**: addresses map to a **set**; any **way** within the set may hold the data. ```text way 1 way 0 +-------------+ +-------------+ set 1 | v tag data | | v tag data | +-------------+ +-------------+ set 0 | v tag data | | v tag data | +-------------+ +-------------+ ``` - Map address to a **set** (direct-mapped style) - Compare tag against **all ways** in that set (small loop) --- ## Set-Associative Lookup
flowchart TD A["address"] --> B["set_index = (addr >> 3) & 0b111"] B --> C["set_base = set_index * num_ways"] C --> D["loop ways: compare tags"] D -->|match| E["HIT, update timestamp"] D -->|no match| F["MISS"] F --> G{"unused way\nin set?"} G -->|yes| H["fill empty way"] G -->|no| I["evict LRU way\n(smallest timestamp)"]
--- ## Set-Associative: Code Sketch 8 sets, 4 ways, 1-word blocks: ```c uint64_t addr_tag = addr >> 5; // drop offset(2) + set-idx(3) uint64_t set_index = (addr >> 2) & 0b111; // 3 bits for 8 sets uint64_t set_base = set_index * num_ways; for (int i = 0; i < num_ways; i++) { struct cache_slot_st *slot = &cache[set_base + i]; if (slot->valid && slot->tag == addr_tag) { slot->timestamp = num_refs; // update LRU counter return slot->data; // HIT } } // MISS: find empty or LRU way in this set, fill it ``` **LRU replacement**: evict the way with the smallest timestamp. Prefer invalid ways first. --- ## Key Concepts Reference | Concept | Formula / Idiom | |---------|-----------------| | Target address | `regs[rs1] + sign_extend(imm, 11)` | | I-type immediate | single field bits [31:20] | | S-type immediate | `(imm11_5 << 5) \| imm4_0` | | Sign extension | `((int64_t)(v << shift)) >> shift` | | Slot index (direct) | `(addr >> 2) & mask` | | Tag (4-slot) | `addr >> 4` | | Block base (mask) | `addr & ~0xF` (4-word block) | | Set index (8 sets) | `(addr >> 2) & 0b111` | | LRU eviction | smallest `timestamp` in set | --- ## Practice: Direct-Mapped Slot and Tag **Q:** 4-slot, 1-word-block cache. What slot and tag does byte address **36** map to? ```text addr_word = 36 / 4 = 9 slot_index = 9 % 4 = 1 (or (36 >> 2) & 0b11 = 1) tag = 36 >> 4 = 2 (drop 2 offset + 2 index bits) ``` **Answer:** slot **1**, tag **2**.
Byte 4 → slot 1 tag 0, byte 20 → slot 1 tag 1, byte 36 → slot 1 tag 2 — all share slot 1.
--- ## Practice: S-type Immediate Reconstruction **Q:** `imm[11:5] = 0b0000010`, `imm[4:0] = 0b00100`. What offset? ```text vimm = (0b0000010 << 5) | 0b00100 = 0b000001000000 | 0b000000000100 = 0b000001000100 = 0x44 = 68 ``` Sign bit (bit 11) = 0, so the offset is **+68**. The store accesses `mem[base + 68]`. --- ## Practice: Block Base Address **Q:** 4-word-block cache misses on byte address **24**. Block base? **Arithmetic:** ```text addr_word = 6, rem = 6 % 4 = 2, block_start = 4 * 4 = 16 ``` **Bit mask:** ```text 24 = 0b...0001 1000 24 & 0xFFFFFFFFFFFFFFF0 = 0b...0001 0000 = 16 ``` Block base = **byte 16**. Block covers bytes 16, 20, 24, 28. --- ## Summary 1. **Load/store target address** = `regs[rs1] + sign_extend(imm, 11)` — always go through a real pointer dereference. 2. **I-type** has a single 12-bit immediate; **S-type** splits it across two fields — reassemble before sign-extending. 3. **Pointer cast selects width**: `uint32_t *` for `lw`/`sw`, `uint64_t *` for `ld`/`sd`. 4. **Sign extension, two's complement, masking** are the core bit-level tools for both immediates and cache indexing. 5. **Direct-mapped cache**: `slot = addr_word % #slots`; store tag + valid bit. 6. **Block size** amortizes miss cost via spatial locality; compute base by clearing low block-offset bits. 7. **Set-associative cache**: map to a set, search all ways, use LRU to evict. Project 04's most advanced cache. 8. **Project 04 simulates the I-cache** — read-only and word-aligned, the simplest case.