← Back to Course
# Lab 06: Cache Analysis ## CS 315 Computer Architecture --- ## Agenda - Lab 06 Q&A: `sign_extend()` fix - Project 04: `call` vs `jal` - Dynamic analysis: counting instructions - Why caches exist - SRAM vs DRAM - Hits, misses, and rates - Locality of reference - The three cache questions - Direct-mapped, fully-associative, set-associative caches - Address breakdown and worked examples --- ## Lab 06: `sign_extend()` Fix The shipped code and lecture derivation disagreed by **one bit**: ```c // Lecture-aligned version (change 64 -> 63) int64_t sign_extend(uint64_t value, int bits) { int shift = 63 - bits; return ((int64_t)(value << shift)) >> shift; } ``` **How it works:** shift sign bit up to bit 63, then arithmetic `>>` replicates it down.
Convention:
if
bits
is the index of the MSB (0-based), use
63 - bits
. If it is the count of meaningful bits, use
64 - bits
. Pick one and use it everywhere.
--- ## Project 04: `call` vs `jal` `call` is a **pseudo-instruction** — the assembler expands it into two real instructions: ```text call foo ==> auipc ra, %pcrel_hi(foo) jalr ra, ra, %pcrel_lo(foo) ``` For small programs (target within ~1 MiB), use `jal` directly: ```text jal ra, foo # ra = pc+4; pc = pc+offset (one instruction) ret # jalr x0, ra, 0 ```
flowchart TD A["Need a function call?"] --> B{"Target within jal range?"} B -- "Yes" --> C["jal ra, label\n(1 instruction)"] B -- "No" --> D["call label\nauipc + jalr"]
--- ## Dynamic Analysis: Counting Instructions **Dynamic** = measured *during* execution (loop body running 1000× contributes 1000). ```c typedef struct { uint64_t total; uint64_t itype; // I-type (addi, lw, jalr, ...) uint64_t rtype; // R-type (add, sub, ...) uint64_t load; // ld, lw, lb, ... uint64_t store; // sd, sw, sb, ... uint64_t branch; // beq, bne, blt, ... } analysis_st; ```
Keep instrumentation in one place — bump counters right after decode, not scattered across every opcode case.
--- ## Dynamic Analysis: Emulator Loop
flowchart LR A["Decode instruction"] --> B["Execute instruction"] B --> C["analyze: bump counters"] C --> D{"halt?"} D -- "no" --> A D -- "yes" --> E["Print analysis report"]
`fibrec 10` executes **1,675 total instructions** — heavy stack traffic (`sd`/`ld`) from recursion, a very different profile from the sort routine. --- ## Why Caches Exist: The CPU–Memory Gap **Moore's Law** doubled transistor count every ~2 years → CPUs got fast. DRAM latency improved far more slowly → widening gap. ```text speed ^ | CPU . ' . | . ' . } gap | . ' . | . ' . _________________________ DRAM +-------------------------------------------------> time ``` Without a cache, every load/store stalls waiting on slow DRAM. A **cache** keeps recently used data in fast SRAM close to the CPU. --- ## SRAM vs DRAM | Property | SRAM | DRAM | |----------|------|------| | Speed | Fast | Slow | | Cost per bit | Expensive | Cheap | | Density | Low (~6 transistors/bit) | High (1T + 1C) | | Refresh needed? | No | Yes | | Typical use | Cache (on-chip) | Main memory |
Why not build all memory from SRAM? Cost and chip area — gigabytes of SRAM would be enormous and prohibitively expensive.
--- ## The Cache in the Memory Hierarchy ```text processor CACHE ($$) Main Memory (DRAM) +---------+ addr +----------+ miss +------------------+ | | -----> | | -----> | CODE / DATA | | | <----- | fast $$ | <----- | | +---------+ data +----------+ +------------------+ ^ | hit: served fast from SRAM ``` - **Hit** — data already in cache; return immediately (fast) - **Miss** — fetch from DRAM, install in cache, return (slow — but cached for next time) --- ## Hits, Misses, and Rates Every memory reference is classified as a **hit** or a **miss**: ```text hit_rate = num_hits / num_refs miss_rate = num_misses / num_refs hit_rate + miss_rate = 1 ```
Real caches achieve hit rates above
97%
because programs obey
locality of reference
.
**Example:** 1,000 references, 30 misses → miss rate = 3%, hit rate = 97%, hits = 970 --- ## Locality of Reference **Temporal locality** — recently used data will be used again soon ```c int sum = 0; for (int i = 0; i < n; i++) sum += a[i]; // sum and i reused every iteration ``` **Spatial locality** — neighbors of used data will be used soon ```c for (int i = 0; i < n; i++) a[i] = i; // a[0], a[1], a[2], ... are adjacent ``` Sequential instructions also show spatial locality: after `PC`, the CPU usually fetches `PC+4`. --- ## The Three Cache Questions
flowchart TD A["Memory reference: addr"] --> Q1["Q1: Where could it be?\n(compute index / placement)"] Q1 --> Q2["Q2: Is the data really there?\n(check valid bit + compare tag)"] Q2 -- "valid && tag matches" --> H["HIT: return data"] Q2 -- "otherwise" --> M["MISS: fetch from memory"] M --> Q3["Q3: Which entry to evict?\n(replacement policy)"] Q3 --> F["Install data, set tag + valid, return"]
--- ## Address Breakdown: Byte vs Word Memory is byte-addressed; for 1-word blocks, drop the low 2 bits: ```text 31 ............. 2 | 1 0 [ word address | bo ] ^^ byte offset (00 for word-aligned) ``` ```c addr_word = addr >> 2; // byte addr -> word addr addr_byte = addr_word << 2; // word addr -> byte addr ``` The **tag** holds the high bits that identify *which* address is in a slot. The **valid bit** distinguishes "empty slot" from "slot containing zero." --- ## Direct-Mapped Cache Every address maps to **exactly one slot** — the slot index is the word address mod the number of slots. ```c addr_word = addr >> 2; slot_index = addr_word & 0b11; // for 4 slots: low 2 bits addr_tag = addr >> 4; // bits above offset + index ``` Address breakdown (4 slots, 1-word block): ```text tag[63:4] slot_index[3:2] byte_offset[1:0] ``` **Strength:** one comparison, no replacement logic **Weakness:** conflict misses — two hot addresses sharing a slot evict each other --- ## Direct-Mapped: Lookup Code ```c struct cache_slot_st { uint64_t tag; uint32_t data; uint32_t valid; }; struct cache_slot_st cache[4]; // lookup int slot_index = (addr >> 2) & 0b11; uint64_t addr_tag = addr >> 4; struct cache_slot_st *slot = &cache[slot_index]; if (slot->valid == 1 && slot->tag == addr_tag) { return slot->data; // HIT } else { slot->data = *(uint32_t *)addr; // MISS: fetch slot->tag = addr_tag; slot->valid = 1; return slot->data; } ```
Common mistake: comparing the full byte address to the tag. The tag is only the
high
bits after removing offset and index.
--- ## Worked Example: Direct-Mapped Trace 4-slot direct-mapped cache, word address stream: `0, 1, 2, 3, 4, 0` | Step | Word addr | Slot | Tag | Hit/Miss | |------|-----------|------|-----|----------| | 1 | 0 | 0 | 0 | MISS (empty) | | 2 | 1 | 1 | 0 | MISS (empty) | | 3 | 2 | 2 | 0 | MISS (empty) | | 4 | 3 | 3 | 0 | MISS (empty) | | 5 | 4 | 0 | 1 | MISS (evicts word 0) | | 6 | 0 | 0 | 0 | MISS (evicts word 4) | 6 refs, 0 hits → **hit rate 0%** Steps 5 & 6 are **conflict misses**: words 0 and 4 share slot 0 and keep evicting each other. --- ## Fully-Associative Cache An address may live in **any** slot — no index, just tag + byte offset: ```text tag[63:2] byte_offset[1:0] ``` ```c addr_tag = addr >> 2; for (int i = 0; i < 4; i++) { // search ALL slots if (cache[i].valid && cache[i].tag == addr_tag) { cache[i].timestamp = num_refs; return cache[i].data; // HIT } } // MISS: pick victim via LRU, install ``` **Strength:** no conflict misses **Weakness:** must compare against every slot (expensive in hardware) --- ## LRU Replacement Track recency with a **logical timestamp** — increment on every reference, stamp each slot on access. ```c struct cache_slot_st *find_lru_slot(struct cache_slot_st *cache) { for (int i = 0; i < 4; i++) if (cache[i].valid == 0) return &cache[i]; // empty first int lru = 0; for (int i = 1; i < 4; i++) if (cache[i].timestamp < cache[lru].timestamp) lru = i; return &cache[lru]; // smallest timestamp = least recently used } ```
Hardware uses approximate LRU (too expensive for exact). In our software simulator we can afford exact LRU.
--- ## Set-Associative Cache The practical compromise: map an address to a **set** (like direct-mapped), but allow it in any **way** within that set. ```text num_sets = total_slots / num_ways ``` For 32 slots, 4 ways: 8 sets, 3 set-index bits: ```text tag[63:5] set_index[4:2] byte_offset[1:0] ``` ```c uint64_t addr_tag = addr >> 5; int set_index = (addr >> 2) & 0b111; int set_base = set_index * num_ways; // first slot of the set // search only cache[set_base .. set_base + num_ways - 1] ``` --- ## Set-Associative: Lookup Flow
flowchart TD A["addr -> set_index, addr_tag"] --> B["Load set: cache[set_base .. set_base+ways-1]"] B --> C{"Any slot: valid && tag == addr_tag?"} C -- "yes" --> D["HIT: update timestamp, return data"] C -- "no" --> E["MISS: find LRU in set, install, return"]
LRU replacement is scoped to the set — only compare timestamps within the set's ways. --- ## Three Organizations Compared
flowchart LR A["Direct-mapped\n1 slot per address"] --> B["Set-associative\nN ways per set"] B --> C["Fully associative\nany slot (1 set)"] A -. "more flexibility / cost" .-> C
| | Direct-mapped | Set-associative | Fully-associative | |--|--|--|--| | Q1 (placement) | 1 slot | 1 set, N ways | anywhere | | Q2 (tag check) | 1 comparison | N comparisons | all-slot scan | | Q3 (eviction) | no choice | LRU in set | LRU global | | Conflict misses | yes | reduced | none | --- ## Block Size and Spatial Locality Real caches fetch a **block** of adjacent words per miss: ```c struct cache_slot_st { uint64_t tag; uint32_t block[4]; // 4 words per block uint32_t valid; uint64_t timestamp; }; ``` Address with 4-word block, 4 slots (direct-mapped): ```text tag[63:6] slot_index[5:4] block_index[3:2] byte_offset[1:0] ```
Larger blocks exploit spatial locality (array streaming). Blocks that are too large waste bandwidth on data never used — block size is a tuning knob.
--- ## Address Breakdown Summary For a **direct-mapped** cache with `S` slots and `B`-word blocks (word-aligned): ```text | tag (high bits) | slot_index (log2 S bits) | block_index (log2 B bits) | byte_offset (2 bits) | ``` For **set-associative** with `S` sets: ```text | tag (high bits) | set_index (log2 S bits) | block_index (log2 B bits) | byte_offset (2 bits) | ``` **Example:** 8 slots, 1-word block, direct-mapped (64-bit addr) ```text tag[63:5] slot_index[4:2] byte_offset[1:0] slot_index = (addr >> 2) & 0b111 addr_tag = addr >> 5 ``` --- ## The Valid Bit: Why It Matters After power-up the cache is all zeros. Without a valid bit: - Slot has `tag = 0`, `data = 0` - Request with `addr_tag = 0` matches — **false hit!** The valid bit is initialized to `0` and only set to `1` after a real fill: ```c if (slot->valid == 1 && slot->tag == addr_tag) { // safe hit — data was actually loaded } ```
A valid bit distinguishes "empty slot initialized to zero" from "slot containing the actual value zero."
--- ## Conflict Miss Demo Stream `0, 4, 0, 4` through a 4-slot direct-mapped cache: Both word 0 and word 4 map to **slot 0** (`0 % 4 == 0`, `4 % 4 == 0`). | Ref | Word | Slot | Tag | Result | |-----|------|------|-----|--------| | 1 | 0 | 0 | 0 | MISS | | 2 | 4 | 0 | 1 | MISS (evicts 0) | | 3 | 0 | 0 | 0 | MISS (evicts 4) | | 4 | 4 | 0 | 1 | MISS (evicts 0) | **Hit rate: 0%** — "ping-pong" conflict miss. A 2-way set-associative cache keeps both in different ways → refs 3 & 4 are hits (50%). --- ## LRU Example: Eviction Decision Fully-associative cache, 4 slots, all valid: | Slot | Timestamp | |------|-----------| | 0 | 12 | | 1 | 7 | | 2 | 20 | | 3 | 15 | Miss occurs at `num_refs = 21`. LRU evicts **slot 1** (smallest timestamp = 7). After install: slot 1 gets `timestamp = 21`. --- ## Key Concepts Recap | Concept | Key Idea | |---------|----------| | Cache (`$$`) | Small fast SRAM between CPU and DRAM | | Hit / Miss | Data present / absent in cache | | Hit rate | `hits / refs`; real programs: >97% | | Temporal locality | Reuse recently accessed data | | Spatial locality | Access neighbors; use blocks | | Tag | High address bits identifying slot content | | Valid bit | Distinguishes empty from zero-data | | Direct-mapped | `slot = word_addr % num_slots` | | Set-associative | Set by index; LRU within set | | Fully associative | Any slot; LRU global | | LRU | Evict slot with smallest timestamp | --- ## Summary 1. **`sign_extend()` fix** — align the shift constant with your bit-numbering convention (`63 - bits` or `64 - bits`); pick one and be consistent. 2. **`jal` vs `call`** — use `jal` in Project 04 to avoid implementing `auipc`; all targets are within ±1 MiB. 3. **Dynamic analysis** — count instruction classes (I-type, R-type, load, store, branch) in one place during the execute loop. 4. **CPU–memory gap** — Moore's Law sped up CPUs; DRAM lagged; SRAM caches hide the latency. 5. **SRAM vs DRAM** — fast/expensive/sparse vs slow/cheap/dense; cache gives the best of both. 6. **Three cache questions** — placement (index), identification (valid + tag), replacement (LRU). 7. **Three organizations** — direct-mapped (simple, conflict misses), fully associative (flexible, expensive), set-associative (practical compromise).