← Back to Course
# Cache Simulation ## CS 315 Computer Architecture --- ## Why Caches Exist - CPUs execute far faster than DRAM can supply data - A **cache** is small, fast SRAM between the CPU and main memory - It holds recently used data so repeat accesses are fast
flowchart LR P["Processor"] -->|"addr (request)"| C["Cache"] C -->|"data"| P C -->|"addr (on miss)"| M["Main Memory"] M -->|"data"| C
--- ## Hits and Misses - **Hit** — data is in the cache; returned immediately - **Miss** — data not in cache; must fetch from main memory (slow) - **Memory reference** = one request from processor to cache ```text num_refs = num_hits + num_misses ```
Caches appear everywhere: OS page caches, database buffer pools, browser caches, CDN edge nodes — all exploit the same principle.
--- ## Hit Rate and Miss Rate ```text hit_rate = num_hits / num_refs miss_rate = num_misses / num_refs hit_rate + miss_rate = 1 ``` **Worked example**: 1000 refs, 850 hits ```text num_misses = 150 hit_rate = 850 / 1000 = 0.85 (85%) miss_rate = 150 / 1000 = 0.15 (15%) ```
Real caches achieve hit rates above 95% — programs are not random!
--- ## Principles of Locality | Principle | Statement | Example | |-----------|-----------|---------| | **Temporal** | Recently used data will be used again soon | Loop counter `i`; loop body instructions | | **Spatial** | Nearby addresses will be accessed soon | `a[i]` then `a[i+1]`; `PC` then `PC+4` | - Temporal locality: keep recently accessed data in the cache - Spatial locality: on a miss, fetch a whole **block** of neighboring data
Project 4 simulates the
instruction fetch
reference stream — a nearly perfect example of spatial locality.
--- ## Addresses, Words, and Blocks - Memory is byte-addressed: each byte has a unique address - A RISC-V **word** = 32 bits = 4 bytes (word-aligned) ```c addr_word = addr >> 2; // byte address -> word index addr = addr_word << 2; // word index -> byte address ``` - The low 2 bits (lost by `>> 2`) are the **byte offset** - A **block** = several consecutive words fetched together on a miss ```text block 0 -> words 0..3 (bytes 0..15) block 1 -> words 4..7 (bytes 16..31) block 2 -> words 8..11 (bytes 32..47) ``` --- ## Direct-Mapped Cache Every address maps to **exactly one** slot: ```c slot_index = addr_word % 4; // 4-slot cache ``` Because 4 is a power of two, use a bitmask instead: ```c slot_index = (addr >> 2) & 0b11; // drop byte bits, keep low 2 ``` | Word addr | Slot (`% 4`) | |-----------|-------------| | 0, 4, 8, 12 | 0 | | 1, 5, 9, 13 | 1 | | 2, 6, 10, 14 | 2 | | 3, 7, 11, 15 | 3 | --- ## Slot Struct: Tag and Valid Bit Many addresses map to the same slot — need extra bookkeeping: - **Tag** — upper address bits identifying *which* block is stored - **Valid bit** — 0 at power-up ("empty"), 1 after a fill ```c struct cache_slot_st { uint64_t tag; uint32_t data; // one word (block size = 1) uint32_t valid; }; ```
Without the valid bit, a zeroed cache would falsely "hit" on address 0 before anything was loaded.
--- ## Address Breakdown: 4 Slots, Block Size 1 ```text 63 ............. 4 | 3 2 | 1 0 tag | index | byte offset ``` - `bits [1:0]` — byte offset (which byte in the word) - `bits [3:2]` — slot index (which of the 4 slots) - `bits [63:4]` — tag ```c addr_tag = addr >> 4; slot_index = (addr >> 2) & 0b11; ``` As cache size grows, **more bits go to index, fewer remain for tag**. --- ## Direct-Mapped Lookup ```c uint32_t dm_lookup(uint64_t addr) { uint64_t addr_tag = addr >> 4; uint32_t slot_index = (addr >> 2) & 0b11; struct cache_slot_st *slot = &cache[slot_index]; if (slot->valid == 1 && slot->tag == addr_tag) { return slot->data; // HIT } else { slot->data = *((uint32_t *) addr); // MISS: fetch from memory slot->tag = addr_tag; slot->valid = 1; return slot->data; } } ``` --- ## Lookup Flow
flowchart TD A["addr"] --> B["index = (addr >> 2) & mask"] A --> C["tag = addr >> 4"] B --> D["slot = cache[index]"] C --> E{"slot.valid == 1 AND slot.tag == tag?"} D --> E E -->|yes| F["HIT: return slot.data"] E -->|no| G["MISS: load from memory, update slot"]
--- ## Direct-Mapped: Strength and Weakness
**Strengths** - Simple and fast - One slot to check - No search required
**Weakness** - **Conflict misses**: two hot addresses sharing a slot thrash each other - Even if the rest of the cache is empty
Example: word 0 and word 4 both map to slot 0. Alternating accesses cause a miss on
every
reference.
--- ## Fully Associative Cache A block can go in **any** slot — eliminates conflict misses. Lookup must check **every** slot's tag (fan-out search):
flowchart TD A["addr tag"] --> C0["= slot0.tag?"] A --> C1["= slot1.tag?"] A --> C2["= slot2.tag?"] A --> C3["= slot3.tag?"] C0 --> H["hit?"] C1 --> H C2 --> H C3 --> H
In software: a loop. In hardware: one comparator per slot — expensive. --- ## Fully Associative: Address and Slot Address breakdown (no index field): ```text 63 ................. 2 | 1 0 tag | byte offset ``` Slot struct adds a **timestamp** for LRU replacement: ```c struct cache_slot_st { uint64_t tag; uint32_t data; uint32_t valid; uint64_t timestamp; // logical time of last use }; ``` --- ## LRU Replacement Policy On a miss, `find_lru_slot()`: 1. **Prefer an unused slot** — scan for `valid == 0` 2. **Otherwise evict LRU** — slot with the smallest `timestamp` The **timestamp** is a logical clock: it equals `num_refs` at the time the slot was last accessed. ```c // On every hit or fill: slot->timestamp = num_refs; // stamp with current ref count ```
Smallest timestamp = longest time since last use = least recently used.
--- ## Fully Associative Lookup ```c uint32_t fa_lookup(uint64_t addr) { num_refs += 1; uint64_t addr_tag = addr >> 2; for (int i = 0; i < N; i++) { struct cache_slot_st *slot = &cache[i]; if (slot->valid == 1 && slot->tag == addr_tag) { slot->timestamp = num_refs; // HIT: restamp return slot->data; } } // MISS: find LRU slot and fill struct cache_slot_st *slot = find_lru_slot(cache); slot->data = *((uint32_t *) addr); slot->tag = addr_tag; slot->valid = 1; slot->timestamp = num_refs; return slot->data; } ``` --- ## Set-Associative Cache The practical middle ground: - Cache is divided into **sets** - An address maps to exactly **one set** (like direct-mapped) - Within that set, a block can go in **any way** (like fully associative) ```text num_sets = total_slots / num_ways ``` A **4-way** cache: 4 slots per set. Direct-mapped = 1-way. Fully associative = N-way (one set). --- ## Set-Associative Lookup
flowchart TD A["addr"] --> B["set_index = (addr >> 2) & set_mask"] A --> T["tag = upper bits"] B --> S["select set (ways 0..3)"] T --> S S --> H{"any way: valid AND tag match?"} H -->|yes| HIT["HIT: update timestamp"] H -->|no| MISS["MISS: fill LRU way in this set"]
--- ## Set-Associative: Address and Layout Address breakdown (8 sets, block size 1): ```text 63 ........ 5 | 4 3 2 | 1 0 tag | set index | byte offset ``` Flat array layout with 4 ways per set: ```c set_base = set_index * num_ways; // first slot of this set // slots set_base .. set_base+3 are the 4 ways ``` Search only `num_ways` slots — not all N! --- ## Set-Associative Lookup Code ```c uint32_t sa_lookup(uint64_t addr) { num_refs += 1; int num_ways = 4; uint64_t addr_tag = addr >> 5; // drop byte + set-index bits uint32_t set_index = (addr >> 2) & 0b111; // 3 bits for 8 sets uint32_t set_base = set_index * num_ways; for (int i = 0; i < num_ways; i++) { struct cache_slot_st *slot = &cache[set_base + i]; if (slot->valid == 1 && slot->tag == addr_tag) { slot->timestamp = num_refs; return slot->data; } } struct cache_slot_st *slot = find_lru_slot_in_set(cache, set_base); slot->data = *((uint32_t *) addr); slot->tag = addr_tag; slot->valid = 1; slot->timestamp = num_refs; return slot->data; } ``` --- ## Comparing the Three Organizations | Organization | Block placement | Lookup cost | Conflict misses | |--------------|----------------|-------------|-----------------| | Direct-mapped | 1 fixed slot | 1 compare | High | | Set-associative (n-way) | any way in 1 set | n compares | Low | | Fully associative | any slot | N compares | None |
Direct-mapped = 1-way set-associative.
Fully associative = N-way with 1 set.
Most real L1/L2 caches are 4- or 8-way set-associative.
--- ## Block Size Greater Than One Word Add a `block[]` array to the slot struct: ```c struct cache_slot_st { uint64_t tag; uint32_t block[4]; // 4 words per slot uint32_t valid; uint64_t timestamp; }; ``` On a **hit**, use the **word offset** to pick the right word: ```c uint32_t word_offset = (addr >> 2) & 0b11; return slot->block[word_offset]; ``` --- ## Address Breakdown with 4-Word Block **Direct-mapped, 4 slots, 4-word block:** ```text 63 ...... 6 | 5 4 | 3 2 | 1 0 tag | slot | word | byte | index | offset| offset ``` **Fully associative, 4-word block:** ```text 63 ...... 4 | 3 2 | 1 0 tag | word | byte | offset| offset ``` The 2 word-offset bits select which of the 4 words in the block to return. --- ## Loading a Block on a Miss ```c uint64_t addr_word = addr / 4; uint64_t block_index = addr_word % 4; // position within block uint64_t block_base = addr_word - block_index; // word addr of first word for (int i = 0; i < 4; i++) { uint64_t byte_addr = (block_base + i) * 4; slot->block[i] = *((uint32_t *) byte_addr); } slot->tag = /* tag bits of addr */; slot->valid = 1; ``` One miss fetches **all 4 words** — the next 3 accesses in the block will be hits. --- ## Spatial Locality Payoff For sequential instruction execution (`PC`, `PC+4`, `PC+8`, `PC+12`...): - Block size 1: every instruction is a miss (16 refs = 16 misses) - Block size 4: only the first word in each block misses ```text block [0-3]: miss, hit, hit, hit block [4-7]: miss, hit, hit, hit ... ```
Live demo: increasing block size from 1 to 4 words raised the direct-mapped hit rate from ~62% to ~85% on the same program.
--- ## Project 4 Cache Configurations | # | Type | Block size | Replacement | Status | |---|------|-----------|-------------|--------| | 1 | Direct-mapped | 1 word | overwrite | given | | 2 | Direct-mapped | 4 words | overwrite | you write | | 3 | 4-way set-associative | 1 word | LRU | you write | | 4 | 4-way set-associative | 4 words | LRU | you write | The emulator calls the cache on every instruction fetch (one ref per `PC`). --- ## Implementation Road Map 1. **Understand the given** direct-mapped, block-size-1 starter code fully 2. **Extend to block size 4**: add `block[4]`, word-offset extraction, block fill loop 3. **Add set-associative**: group slots into sets, search `num_ways` ways, add `find_lru_slot_in_set()` 4. **Track recency**: increment `num_refs` every lookup; stamp accessed slots with current count
The cache type is selected via an
enum
in
RVEMU.h
— one code base supports all four configurations.
--- ## Debugging Strategies Build and test **incrementally** — one configuration at a time. | Technique | What to do | |-----------|-----------| | **Custom test programs** | Small assembly loops that stress one behavior | | **GDB** | Breakpoints, single-step, inspect registers | | **printf instrumentation** | Print slot index, tag, valid, hit/miss on each ref | Verify hit/miss counts against expected values before moving to the next configuration. --- ## Key Address Field Summary | Cache type | Byte offset | Word/Block offset | Index | Tag | |------------|------------|-------------------|-------|-----| | DM, block=1, 4 slots | `[1:0]` | — | `[3:2]` | `[63:4]` | | DM, block=4, 4 slots | `[1:0]` | `[3:2]` | `[5:4]` | `[63:6]` | | FA, block=1 | `[1:0]` | — | — | `[63:2]` | | FA, block=4 | `[1:0]` | `[3:2]` | — | `[63:4]` | | SA 4-way, 8 sets, block=1 | `[1:0]` | — | `[4:2]` | `[63:5]` | --- ## LRU Eviction Example 2-way set, references (as tags): A, B, A, C | Ref | Tag | Action | Way 0 | Way 1 | |-----|-----|--------|-------|-------| | 1 | A | miss, fill way 0 | A (ts=1) | empty | | 2 | B | miss, fill way 1 | A (ts=1) | B (ts=2) | | 3 | A | **HIT**, restamp | A (ts=3) | B (ts=2) | | 4 | C | miss, evict LRU=B | A (ts=3) | C (ts=4) | B evicted because its timestamp (2) was smaller than A's (3). --- ## Summary 1. **Cache** = small fast SRAM between CPU and DRAM; exploits temporal and spatial locality 2. **Performance**: `hit_rate = hits / refs`; real caches achieve > 95% 3. **Address fields**: byte offset | word offset | index/set-index | tag 4. **Direct-mapped**: one slot per address — simple, fast, conflict-prone 5. **Fully associative**: any slot — flexible, expensive (search all) 6. **Set-associative**: any way in one set — practical middle ground; LRU eviction via logical timestamp 7. **Larger blocks** amortize miss cost over spatial neighbors; 4-word block raised hit rate from 62% to 85% 8. **Project 4**: implement configs 2–4 starting from the given DM/block-1 starter