Cache Simulation¶
Overview¶
This lecture introduces cache memory: a small, fast memory that sits between the processor and main memory and holds recently accessed data so that future requests can be served quickly. We study why caches work (the principles of temporal and spatial locality), how a memory address is decomposed into a tag, an index, and an offset, and how three classic cache organizations — direct-mapped, fully associative, and set-associative — locate data and decide what to evict. These ideas are the foundation for the instruction-cache simulator you build in Project 4, where you extend a given direct-mapped cache to a 4-word block size and implement a 4-way set-associative cache with LRU replacement.
Learning Objectives¶
- Explain what a cache is, why main memory is slow relative to the CPU, and how caches exploit temporal and spatial locality.
- Compute hit rate and miss rate from raw hit, miss, and reference counts.
- Decompose a 64-bit byte address into tag, index/set-index, block offset (word offset), and byte offset fields.
- Trace a direct-mapped cache lookup by hand, including valid-bit and tag checks, hits, and misses.
- Compare direct-mapped, fully associative, and set-associative caches in terms of flexibility and cost.
- Describe the role of the valid bit, the tag, and recency metadata (timestamps) in a cache slot.
- Implement an LRU replacement policy using a logical timestamp driven by the reference count.
- Explain how increasing block size improves the hit rate by exploiting spatial locality.
Prerequisites¶
- RISC-V machine code emulation and the
struct rv_statemodel (Lab 3). - Bitwise operators: shift (
<<,>>),AND(&),OR(|), and masking (Project 3 / key concepts). - Binary and hexadecimal number systems, and the relationship between bytes and words.
- C structs, arrays, and pointer dereferencing (
*((uint32_t *) addr)). - The fixed-size integer types
uint8_t,uint32_t, anduint64_t.
1. Why Caches Exist¶
A modern processor can execute instructions far faster than main memory (DRAM) can supply data. If every load and store had to wait for main memory, the CPU would spend most of its time stalled. A cache is a small block of fast memory — typically built from static RAM (SRAM) — placed between the processor and main memory. It keeps copies of recently used data so that repeat accesses are fast.
flowchart LR
P[Processor] -->|"addr (request)"| C[Cache]
C -->|data| P
C -->|"addr (on miss)"| M[Main Memory]
M -->|data| C
style C fill:#f9f,stroke:#333,stroke-width:2px
The processor sends the cache an address and expects a value back. There are two outcomes:
- Hit — the data for that address is already in the cache. The cache returns it immediately, no main-memory access required.
- Miss — the data is not in the cache. The cache must fetch it from main memory, store a copy, and then return it. This is slow.
The instructor drew this as a numbered data path: on a hit (path 1) data flows straight back from the cache to the processor. On a miss (paths 2 and 3) the cache first goes out to memory for the data (2), then forwards it back to the processor (3).
A memory reference (or "ref") is a single request from the processor to the cache. Each reference is either a hit or a miss, so:
Caches Are Everywhere¶
Although this course focuses on the cache memory inside a processor, the same idea appears throughout computer systems: operating-system page caches, database buffer pools, web browser caches, and CDN edge caches all keep recently or frequently used data in faster storage to avoid going back to a slow source.
2. Hit Rate and Miss Rate¶
The effectiveness of a cache is measured by its hit rate and miss rate. During a run we count three things: the number of references, the number of hits, and the number of misses. From those counts:
Because every reference is either a hit or a miss, the two rates are complements:
Our goal is to design caches with high hit rates. In practice, real caches are remarkably successful — hit rates above 95% (often above 97%) are common — because programs do not access memory randomly. They follow the principles of locality (Section 3).
Worked Example¶
Suppose a program makes 1000 memory references, of which 850 are hits.
num_refs = 1000
num_hits = 850
num_misses = 1000 - 850 = 150
hit_rate = 850 / 1000 = 0.85 (85%)
miss_rate = 150 / 1000 = 0.15 (15%)
Note that hit_rate + miss_rate = 0.85 + 0.15 = 1.00, as expected.
3. The Principles of Locality¶
Caches work because real program execution exhibits locality — accesses cluster in time and in space rather than being spread uniformly across memory.
| Principle | Statement | Code example |
|---|---|---|
| Temporal locality | If a location is accessed, it is likely to be accessed again soon. | A loop index i, a loop body, or a function called repeatedly. |
| Spatial locality | If a location is accessed, nearby locations are likely to be accessed soon. | Walking an array a[i], then a[i+1]; executing instruction at PC, then PC+4. |
Temporal locality explains why keeping a recently used value pays off: a loop counter or the instructions inside a loop body are touched again on the very next iteration.
Spatial locality explains why we should fetch a block of neighboring data on a miss rather than a single word. Instruction streams are an excellent example: unless the current instruction is a branch or jump, the next instruction executed is the one at PC + 4. The Project 4 simulator focuses on this instruction-fetch reference stream.
The simulator in Project 4 simulates the cache for instruction memory references only — there is no separate data cache. Instruction memory is effectively read-only, so we never have to write modified data back to memory.
4. Addresses, Words, and Blocks¶
A memory request is just an address. On modern machines addresses are byte addresses: each unique address names one byte (8 bits). You can picture all of memory as a giant array of bytes:
A RISC-V instruction or an int is a word — 32 bits, or 4 bytes. We assume the processor requests word values at word-aligned addresses (multiples of 4). With that assumption, memory becomes a sequence of words:
word 0 -> mem[0], mem[1], mem[2], mem[3]
word 1 -> mem[4], mem[5], mem[6], mem[7]
word 2 -> mem[8], mem[9], mem[10], mem[11]
...
Convert between byte addresses and word addresses by dividing or multiplying by 4:
addr_word = addr / 4; // byte address -> word index
addr = addr_word * 4; // word index -> byte address
Because 4 is a power of two, the same operations can be done with shifts:
The bottom two bits that are lost in the shift are the byte offset — they select a byte within a word. We only need them if we want to extract a specific byte from a word.
Memory as an Array of Blocks¶
To exploit spatial locality, a cache moves data in chunks called blocks. The block size is the number of bytes (or words) transferred between memory and the cache on a miss. If the block size is 4 words, then memory is logically an array of 4-word blocks:
block 0 -> words 0..3 (bytes 0..15)
block 1 -> words 4..7 (bytes 16..31)
block 2 -> words 8..11 (bytes 32..47)
...
The lecture diagram showed memory drawn as a column of bytes on the left, grouped into words in the middle, and grouped into 4-word blocks on the right. A single byte address therefore picks out a byte, a word, and a block all at once — which fields of the address you look at decides which level you mean.
5. Direct-Mapped Cache¶
A direct-mapped cache is the simplest organization: every memory address maps to exactly one slot in the cache. This makes the replacement policy trivial — if the data you want is not in the slot it maps to, you simply overwrite that slot.
The Mapping Function¶
Consider a cache that holds 4 words:
A position in the cache is a slot. To map a word address to a slot index, take the remainder modulo the number of slots:
This is a tiny hash function. With 4 slots, these word addresses all collide on slot 0: 0, 4, 8, 12, 16, .... These all map to slot 1: 1, 5, 9, 13, 17, .... The instructor worked two cases:
Modulo via Bit Masking¶
When the number of slots is a power of two, the modulo is just the low bits of the word address, so we can use a mask instead of a division:
Why does this work? Look at the low bits of consecutive word addresses:
dec bin slot (= low 2 bits)
0 00000 0
1 00001 1
2 00010 2
3 00011 3
4 00100 0
5 00101 1
6 00110 2
7 00111 3
8 01000 0
9 01001 1
10 01010 2
The low two bits are exactly addr_word % 4. The number of index bits grows with the cache size:
| Slots | Index bits | Mask |
|---|---|---|
| 4 | 2 | 0b11 |
| 8 | 3 | 0b111 |
| 16 | 4 | 0b1111 |
| 128 | 7 | 0b1111111 |
As the number of slots increases, more bits go to the index and fewer remain for the tag.
Going Straight from a Byte Address¶
We usually start with the byte address, not the word address. To get the slot index from a byte address, shift right by 2 (to drop the byte offset and form the word address) and then mask:
What Goes in a Slot: Tag and Valid Bit¶
Many addresses map to the same slot, so the slot alone cannot tell us which address it currently holds. We store extra bookkeeping with the data:
- The tag — the upper bits of the address that uniquely identify which of the colliding blocks is in the slot.
- The valid bit — whether the slot holds usable data at all.
Look again at the word addresses that map to slot 0 (0, 4, 8, 12). Their upper bits (000, 001, 010, 011) are all distinct — those upper bits are the tag.
A direct-mapped slot is a struct:
struct cache_slot_st {
uint64_t tag;
uint32_t data; // one word, for block size = 1
uint32_t valid;
};
struct cache_slot_st cache[N];
Why We Need the Valid Bit¶
On power-up, all of memory and the cache are zeroed. But a tag of 0 and data of 0 are perfectly legitimate values, so without extra information the cache would falsely "hit" on address 0 before anything was ever loaded. The valid bit solves this: it starts at 0, meaning "this slot is empty." Only after a real fill do we set it to 1.
Address Breakdown (4 slots, block size 1)¶
For a 64-bit byte address with 4 slots and a 1-word block:
- bits
[1:0]— byte offset (which byte within the word) - bits
[3:2]— slot index (which of the 4 slots) - bits
[63:4]— tag (everything above)
Direct-Mapped Lookup Pseudocode¶
This is the lookup the instructor wrote for a 4-slot, 1-word-block cache (it matches the given Project 4 starter code):
uint32_t dm_lookup(uint64_t addr) {
uint64_t addr_tag = addr >> 4; // drop 2 byte bits + 2 index bits
uint32_t index_mask = 0b11;
uint32_t slot_index = (addr >> 2) & index_mask;
struct cache_slot_st *slot = &cache[slot_index];
if (slot->valid == 1 && slot->tag == addr_tag) {
// hit
return slot->data;
} else {
// miss
slot->data = *((uint32_t *) addr); // fetch the word from memory
slot->tag = addr_tag;
slot->valid = 1;
return slot->data;
}
}
The flow of a single lookup:
flowchart TD
A["addr"] --> B["index = (addr >> 2) & mask"]
A --> C["tag = addr >> (2 + index_bits)"]
B --> D["slot = cache[index]"]
C --> E{"slot.valid == 1<br/>AND slot.tag == tag?"}
D --> E
E -->|yes| F["HIT: return slot.data"]
E -->|no| G["MISS: load from memory,<br/>set tag, valid = 1, return data"]
style F fill:#bfb,stroke:#333
style G fill:#fbb,stroke:#333
Strength and weakness. Direct mapping is simple and fast — one slot to check, no search. But it is rigid: two hot addresses that map to the same slot will repeatedly evict each other (a conflict miss), even if the rest of the cache is empty.
6. Fully Associative Cache¶
A fully associative cache removes the rigidity of direct mapping: a block can be placed in any slot. This eliminates conflict misses, because an incoming block can always go to an empty (or least-recently-used) slot regardless of its address.
The price is search cost. Because there is no fixed slot for an address, a lookup must compare the request's tag against the tag in every slot. The lecture sketch showed the address tag fanning out to a comparator (=) on each slot in parallel, with all comparator outputs feeding into a single "hit?" check.
flowchart TD
A["addr tag"] --> C0["= slot0.tag?"]
A --> C1["= slot1.tag?"]
A --> C2["= slot2.tag?"]
A --> C3["= slot3.tag?"]
C0 --> H["hit?"]
C1 --> H
C2 --> H
C3 --> H
In software this fan-out becomes a loop over the slots; in hardware the comparisons happen in parallel but require a lot of logic (one comparator per slot), which is why fully associative caches are expensive and reserved for small, special structures such as TLBs (hardware support for virtual memory).
Address Breakdown¶
With no index, the address is just a tag and a byte offset:
Slot Structure and Lookup¶
We reuse the slot struct but add a timestamp field to track recency for replacement:
struct cache_slot_st {
uint64_t tag;
uint32_t data;
uint32_t valid;
uint64_t timestamp; // logical time of last use, for LRU
};
struct cache_slot_st cache[N];
uint32_t fa_lookup(uint64_t addr) {
num_refs += 1;
uint64_t addr_tag = addr >> 2; // drop byte offset bits
for (int i = 0; i < N; i++) {
struct cache_slot_st *slot = &cache[i];
if (slot->valid == 1 && slot->tag == addr_tag) {
// hit
slot->timestamp = num_refs; // mark as just used
return slot->data;
}
}
// miss
struct cache_slot_st *slot = find_lru_slot(cache);
slot->data = *((uint32_t *) addr);
slot->tag = addr_tag;
slot->valid = 1;
slot->timestamp = num_refs;
return slot->data;
}
Choosing a Victim: LRU¶
On a miss we need a slot for the incoming block. find_lru_slot() does two things, in order:
- Prefer an unused slot — scan for the first slot with
valid == 0and use it. - Otherwise evict the Least Recently Used (LRU) slot — the slot whose
timestampis smallest, meaning it has gone the longest without a reference.
The timestamp is a logical clock: it is simply the current num_refs value, which monotonically increases with every reference. Each time a slot is read or filled, we stamp it with the current reference count. The slot with the lowest stamp has not been touched in the longest time.
In hardware, a full per-slot timestamp and an exhaustive minimum search are too costly, so real designs use approximate LRU. For the simulator, the exact timestamp approach is fine.
7. Set-Associative Cache¶
A set-associative cache is the practical middle ground between rigid (direct-mapped) and expensive (fully associative). Think of the cache as an array of sets. An address maps to exactly one set using a direct-mapped-style index, but within that set the block may occupy any slot. The number of slots per set is the number of ways.
We name a cache by its associativity: a 4-way set-associative cache has 4 ways per set; an 8-way has 8. A direct-mapped cache is 1-way; a fully associative cache is N-way (one set).
flowchart TD
A["addr"] --> B["set_index = (addr >> 2) & set_mask"]
A --> T["tag = upper bits"]
B --> S["select set"]
S --> W0["way 0: valid? tag==?"]
S --> W1["way 1: valid? tag==?"]
S --> W2["way 2: valid? tag==?"]
S --> W3["way 3: valid? tag==?"]
T --> W0
T --> W1
T --> W2
T --> W3
W0 --> H{"any match?"}
W1 --> H
W2 --> H
W3 --> H
H -->|yes| HIT["HIT: update timestamp"]
H -->|no| MISS["MISS: fill LRU way in this set"]
style HIT fill:#bfb,stroke:#333
style MISS fill:#fbb,stroke:#333
Counting Sets¶
If the cache has N total slots and num_ways per set, then:
For example, 32 slots with 4 ways gives 32 / 4 = 8 sets, which needs 3 index bits (2^3 = 8).
Address Breakdown (block size 1)¶
Here the set index is 3 bits because there are 8 sets.
Laying Out Sets in a Flat Array¶
We can reuse one flat cache[] array and treat consecutive groups of num_ways slots as one set. With 4 ways, slots 0..3 are set 0, slots 4..7 are set 1, and so on. Once we have the set index, the first slot of that set is:
The instructor's diagram laid this out vertically: slots 0–1 were set 0 (way 0, way 1), slots 2–3 were set 1, and set_base pointed at the first way of the selected set.
Set-Associative Lookup Pseudocode¶
This is the instructor's pseudocode for a lookup. (Note: the shift/mask constants here are written for a small example with 2 ways and an 8-slot layout, illustrating the idea — adjust the masks and num_ways to match your actual configuration.)
uint32_t sa_lookup(uint64_t addr) {
num_refs += 1;
int num_ways = 4;
uint64_t addr_tag = addr >> 5; // drop byte offset + set index bits
uint32_t set_index = (addr >> 2) & 0b111; // low set-index bits (8 sets)
uint32_t set_base = set_index * num_ways; // first slot of this set
for (int i = 0; i < num_ways; i++) {
struct cache_slot_st *slot = &cache[set_base + i];
if (slot->valid == 1 && slot->tag == addr_tag) {
// hit
slot->timestamp = num_refs;
return slot->data;
}
}
// miss
struct cache_slot_st *slot = find_lru_slot_in_set(cache, set_base);
slot->data = *((uint32_t *) addr);
slot->tag = addr_tag;
slot->valid = 1;
slot->timestamp = num_refs;
return slot->data;
}
The key efficiency win: we search only the num_ways slots in one set, not all N slots. find_lru_slot_in_set() mirrors the fully associative version but is scoped to a single set: prefer an unused (valid == 0) way first, then evict the LRU way by smallest timestamp.
Comparing the Three Organizations¶
| Organization | Where a block can go | Lookup cost | Conflict misses | Used for |
|---|---|---|---|---|
| Direct-mapped | exactly 1 slot | 1 compare | high | simple caches |
| Set-associative (n-way) | any way in 1 set | n compares | low | most real L1/L2 caches |
| Fully associative | any slot | N compares | none | small structures (TLB) |
Direct-mapped and fully associative are the two extremes of set-associative: 1-way = direct-mapped, N-way (one set) = fully associative.
8. Block Size Greater Than One Word¶
So far each slot has held a single word. To exploit spatial locality we make each slot hold a block of several words. For a 4-word block:
struct cache_slot_st {
uint64_t tag;
uint32_t block[4]; // 4 words per slot
uint32_t valid;
uint64_t timestamp;
};
struct cache_slot_st cache[N];
A single tag now covers an entire 4-word block. On a hit we still must pick out the right word within the block, which needs a new address field: the block offset (the instructor also called it the word offset).
Address Breakdown with a 4-Word Block¶
The address now has four fields. The block-offset bits sit just above the byte offset:
Direct-mapped, 4 slots, 4-word block:
Fully associative, 4-word block:
Set-associative (2-way), 4-word block:
The 2 word-offset bits select which of the 4 words in the block to return (00=word 0, 01=word 1, 10=word 2, 11=word 3).
Hit and Miss Behavior with Blocks¶
The instructor summarized the two cases for block size greater than 1:
1. On a hit — get the word from the slot using the word offset:
uint32_t word_offset = (addr >> 2) & 0b11; // which word in the block
return slot->block[word_offset];
2. On a miss — find the block's starting address and load all N words:
uint64_t addr_word = addr / 4; // word index
uint64_t block_index = addr_word % 4; // 4 = block size in words
uint64_t block_base = addr_word - block_index; // word address of first word
for (int i = 0; i < 4; i++) {
uint64_t byte_addr = (block_base + i) * 4; // convert word -> byte address
slot->block[i] = *((uint32_t *) byte_addr);
}
slot->tag = /* tag bits of addr */;
slot->valid = 1;
block_base is a word address, so remember to multiply by 4 before dereferencing. (You can also do all of this with bit operations on the byte address directly — block_base_byte = addr & ~0xF for a 16-byte block.)
Why Bigger Blocks Help¶
Because instructions execute mostly in sequence (PC, PC+4, PC+8, ...), fetching a 4-word block on the first miss means the next three instruction fetches are hits. The first access still pays the miss cost, but it is amortized over the whole block.
The instructor demonstrated this live: increasing the direct-mapped block size from 1 word to 4 words raised the hit ratio from about 62% to 85% on the same program — a direct payoff from spatial locality.
flowchart LR
A["block size 1<br/>hit rate ~62%"] -->|"exploit spatial locality"| B["block size 4<br/>hit rate ~85%"]
style B fill:#bfb,stroke:#333
There is a limit, though: if blocks get too large relative to the cache, you fetch words you never use and you hold fewer distinct blocks, which can hurt. For Project 4, a 4-word block is the sweet spot we explore.
9. Project 4: The Cache Simulator¶
Project 4 extends your RISC-V emulator with dynamic analysis (instruction counts) and an instruction-cache simulator. The cache simulates only instruction-fetch references — every time the emulator fetches an instruction at the PC, it makes one reference to the cache.
You must implement four cache configurations:
| # | Type | Block size | Replacement | Status |
|---|---|---|---|---|
| 1 | Direct-mapped | 1 word | trivial (overwrite) | given |
| 2 | Direct-mapped | 4 words | trivial (overwrite) | you write |
| 3 | 4-way set-associative | 1 word | LRU | you write |
| 4 | 4-way set-associative | 4 words | LRU | you write |
Implementation Notes from Lecture¶
- The simulator selects the cache type via an enum in
RVEMU.h, so one code base supports both direct-mapped and set-associative modes. - Start from the given direct-mapped, block-size-1 cache and make sure you fully understand how its slot struct, tag, valid bit, and mapping function fit together before extending it.
- For block size 4, add a
block[4]array and a word-offset extraction; on a miss, computeblock_baseand load all four words. - For set-associative, group the flat slot array into sets of 4, search the ways in a set, and add
find_lru_slot_in_set()driven by atimestampupdated on every reference. - Track recency with a logical timestamp (the running reference count) — increment
num_refson every lookup and stamp each accessed slot with it.
Debugging Strategies (from the same session)¶
The session opened with debugging advice for the project and lab. Build and test incrementally rather than writing everything at once, and use these three techniques:
- Custom test programs — write small, targeted assembly/test inputs that exercise exactly one behavior (e.g., a tight loop to stress spatial locality).
- GDB — set breakpoints, single-step, inspect registers, and trace instruction execution to see exactly what the emulator does.
- Code instrumentation — add
printfstatements to print control flow and cache state (hits/misses, slot indexes, tags) so you can watch the cache fill and evict.
Key Concepts¶
| Concept | Definition | Example |
|---|---|---|
| Cache | Small fast memory holding recently used data from main memory | SRAM cache between CPU and DRAM |
| Hit / Miss | Requested data is / is not present in the cache | Hit returns data immediately; miss fetches from memory |
| Hit rate | num_hits / num_refs |
850 hits / 1000 refs = 0.85 |
| Temporal locality | Recently used data is likely reused soon | A loop counter i |
| Spatial locality | Neighbors of accessed data are likely accessed soon | PC then PC+4; a[i] then a[i+1] |
| Tag | Upper address bits identifying which block is in a slot | addr >> 4 for 4-slot, 1-word DM |
| Valid bit | Whether a slot holds usable data | 0 at power-up, 1 after a fill |
| Slot index | (addr >> 2) & mask selecting a direct-mapped slot |
(addr >> 2) & 0b11 for 4 slots |
| Block size | Number of words moved per miss | 4-word block = 16 bytes |
| Word offset | Address bits selecting a word within a block | bits [3:2] for a 4-word block |
| Associativity / way | Number of slots a block may occupy within a set | 4-way = 4 candidate slots per set |
| Set | Group of ways an address maps to | 32 slots / 4 ways = 8 sets |
| LRU | Evict the least recently used slot | smallest timestamp in the set |
| Eviction | Removing data to make room for new data | overwrite victim slot on a miss |
Practice Problems¶
Problem 1: Hit and Miss Rates¶
A program issues 2,000 memory references. The cache records 1,840 hits. What are the number of misses, the hit rate, and the miss rate?
Click to reveal solution
Problem 2: Direct-Mapped Slot and Tag¶
A direct-mapped cache has 4 slots and a block size of 1 word. For the byte address 0x44 (decimal 68), compute the byte offset, the slot index, and the tag.
Click to reveal solution
Problem 3: Trace a Direct-Mapped Cache¶
A 4-slot, 1-word-block direct-mapped cache starts empty (all valid bits 0). The processor references this sequence of word indexes: 0, 1, 2, 0, 4, 0. Mark each as a hit or miss and give the final cache contents.
Click to reveal solution
Recall: slot = word_index % 4, tag = word_index / 4.
ref word slot tag result reason
1 0 0 0 MISS slot 0 empty -> fill (tag 0)
2 1 1 0 MISS slot 1 empty -> fill (tag 0)
3 2 2 0 MISS slot 2 empty -> fill (tag 0)
4 0 0 0 HIT slot 0 valid, tag 0 matches
5 4 0 1 MISS slot 0 has tag 0, need tag 1 -> evict, fill
6 0 0 0 MISS slot 0 now has tag 1, need tag 0 -> evict, fill
Hits = 1, Misses = 5, refs = 6
hit_rate = 1/6 ≈ 0.167 (16.7%)
Final cache:
slot 0: valid=1, tag=0 (word 0) <- last fill
slot 1: valid=1, tag=0 (word 1)
slot 2: valid=1, tag=0 (word 2)
slot 3: valid=0 (never used)
Note refs 5 and 6 thrash: words 0 and 4 both map to slot 0 and
keep evicting each other (a conflict miss). A 2-way set-associative
cache would keep both.
Problem 4: Set-Associative Geometry¶
A 4-way set-associative cache has 64 total slots and a block size of 1 word. How many sets are there, how many bits does the set index need, and how is the byte address 0xA0 decomposed?
Click to reveal solution
num_sets = total_slots / num_ways = 64 / 4 = 16 sets
set index bits = log2(16) = 4 bits
Address fields (1-word block):
byte offset = bits [1:0] (2 bits)
set index = bits [5:2] (4 bits)
tag = bits [63:6]
For addr = 0xA0 = 160 = 0b10100000:
byte offset = 160 & 0b11 = 0
set index = (160 >> 2) & 0b1111 = 40 & 0b1111 = 8
tag = 160 >> 6 = 2
So 0xA0 maps to set 8 and may occupy any of its 4 ways; tag = 2.
Problem 5: Block Size and Spatial Locality¶
A program executes 16 instructions in straight-line order at word indexes 0,1,2,...,15 (no branches), starting from a cold cache. Compare the number of misses for (a) block size 1 word and (b) block size 4 words. Assume the cache is large enough that no eviction occurs.
Click to reveal solution
(a) Block size = 1 word:
Each new word is a fresh miss because nothing was prefetched.
Misses = 16, hits = 0
hit_rate = 0 / 16 = 0%
(b) Block size = 4 words:
The 16 words form 4 blocks: [0-3], [4-7], [8-11], [12-15].
The FIRST access to each block misses and loads all 4 words;
the next 3 accesses in that block are hits.
block [0-3]: miss, hit, hit, hit
block [4-7]: miss, hit, hit, hit
block [8-11]: miss, hit, hit, hit
block [12-15]: miss, hit, hit, hit
Misses = 4, hits = 12
hit_rate = 12 / 16 = 0.75 (75%)
Spatial locality turns 16 misses into 4. This is exactly why the
instructor's demo jumped from ~62% to ~85% by going from a 1-word
to a 4-word block.
Problem 6: LRU Eviction in a Set¶
A single set has 2 ways. References arrive (as tags) in this order, with the logical timestamp = reference number: A, B, A, C. Both ways start invalid. After all four references, which tag was evicted (if any), and what does the set hold?
Click to reveal solution
ref tag action way0 way1
1 A miss, fill invalid way0 A (ts=1) -
2 B miss, fill invalid way1 A (ts=1) B (ts=2)
3 A HIT, restamp A A (ts=3) B (ts=2)
4 C miss, set full -> evict LRU A (ts=3) C (ts=4)
LRU = smallest timestamp = B (ts=2), so B is evicted
Evicted: B
Final set: { A (ts=3), C (ts=4) }
Because A was re-referenced at ref 3, its timestamp (3) was higher
than B's (2), so B became the least recently used and was evicted.
A direct-mapped cache where A and B collided would have thrashed;
the 2-way set kept A available.
Further Reading¶
- Guide to Cache Memory — the course guide covering direct-mapped, fully associative, and set-associative caches in depth.
- Project 4: RISC-V Emulation, Analysis, and Cache Simulation — the assignment spec for the cache simulator.
- Handwritten lecture notes: /notes/CS315-01 2025-10-02 Cache Simulation.pdf
- Cache (computing) — Wikipedia
- CPU cache — Wikipedia
- Locality of reference — Wikipedia
- Patterson and Hennessy, Computer Organization and Design (RISC-V Edition), Chapter 5: Large and Fast — Exploiting Memory Hierarchy.
Summary¶
-
A cache is a small, fast memory between the CPU and slow main memory; it serves repeat and nearby accesses quickly, returning data on a hit and fetching from memory on a miss.
-
Performance is measured by hit and miss rates:
hit_rate = num_hits / num_refs,miss_rate = num_misses / num_refs, and the two always sum to 1. -
Caches work because of locality — temporal (reuse soon) and spatial (neighbors soon) — which is why instruction streams and array traversals cache so well.
-
An address decomposes into fields: byte offset (low 2 bits), then word/block offset, then an index or set index, then the tag. More slots or more sets shift bits from the tag into the index.
-
A direct-mapped cache maps each address to exactly one slot (
(addr >> 2) & mask), checks the valid bit and tag for a hit, and overwrites the slot on a miss — simple but prone to conflict misses. -
Fully associative and set-associative caches add flexibility: fully associative allows any slot (search all, expensive); set-associative maps to a set and allows any way within it (search a few), using LRU to choose a victim.
-
LRU replacement uses a logical timestamp equal to the running reference count; the slot with the smallest timestamp is the least recently used and is evicted, after first preferring any invalid slot.
-
Larger block sizes exploit spatial locality by loading neighboring words on a miss; in the live demo a 4-word block raised the direct-mapped hit rate from about 62% to 85%, and Project 4 asks you to build exactly these direct-mapped and 4-way set-associative variants at 1- and 4-word block sizes.