Lab 06: Cache Analysis¶
Overview¶
This hands-on lab session wraps up the Lab 06 RISC-V emulator (with a Q&A on the
sign_extend() function and the call vs jal decision for Project 04), then
pivots to the next big idea: cache memory. We motivate why caches exist (the
growing CPU–memory performance gap), contrast the two kinds of RAM (SRAM and
DRAM), and walk through the core machinery of a cache: how it answers a memory
reference with a hit or a miss. We then frame the three questions every cache
design must answer and introduce the three cache organizations — direct-mapped,
fully associative, and set associative — that we will build in software in the
upcoming project. The goal is to leave this session able to trace an address
through a cache by hand and predict whether it hits or misses.
Learning Objectives¶
- Resolve the Lab 06
sign_extend()discrepancy and choose betweencallandjalin Project 04 hand-written assembly. - Explain dynamic analysis: instrumenting the emulator to count instruction classes (I-type, R-type, load, store, branch).
- Describe the CPU–memory performance gap and why it motivates a cache hierarchy.
- Distinguish SRAM from DRAM and explain why we use a small fast cache instead of all-SRAM main memory.
- Define hits, misses, references, hit rate, and miss rate, and compute them.
- State the principles of temporal and spatial locality and recognize them in real code.
- Answer the three core cache questions (placement, identification, replacement).
- Compare direct-mapped, fully-associative, and set-associative cache organizations.
Prerequisites¶
- A working Lab 06 RISC-V emulator (decode + execute for the test programs).
- Comfort with RISC-V load/store instructions (
ld,sd,lw,sw) and the program counter. - C fundamentals: structs, arrays, pointers, and casting (
uint32_t *). - Bitwise operators in C: shift (
>>,<<), mask (&), and the modulo operator (%). - Powers of two and binary/hex number representation.
1. Lab 06 Q&A: the sign_extend() Fix¶
We opened with a question that came up while finishing Lab 06: the lecture
explanation of sign extension and the helper code shipped in bits.c disagreed
by one bit. The function takes a value that is narrower than 64 bits and
copies its top (sign) bit into all of the higher bit positions so the value keeps
the same meaning as a 64-bit signed integer.
The classic two-shift trick is:
// Sign-extend the low `bits` bits of `value` to a full 64-bit signed value.
int64_t sign_extend(uint64_t value, int bits) {
int shift = 64 - bits; // <-- the discrepancy was here
return ((int64_t) (value << shift)) >> shift;
}
The idea: shift the value all the way left so its sign bit lands in bit 63, then
do an arithmetic right shift (signed >>) back, which replicates bit 63
downward.
The one-bit issue¶
The shipped code used 64, but the lecture derivation used 63. The right value
depends on exactly how bits is being counted. The fix the class adopted was to
change 64 to 63 so the code matches the lecture's bit-numbering
convention. Both versions can be made correct; the point is that the code and the
explanation must use the same convention so the math is consistent.
// Lecture-aligned version
int64_t sign_extend(uint64_t value, int bits) {
int shift = 63 - bits;
return ((int64_t) (value << shift)) >> shift;
}
Common mistake: mixing the count of bits with the index of the top bit. If
bitsis the number of meaningful bits, the left shift is64 - bits. Ifbitsis the index of the most significant bit (0-based), the left shift is63 - bits. Pick one convention and use it everywhere.
You may keep your original working code if it passes the autograder, but the lecture-aligned change is preferred for logical consistency with how we derive immediates in the slides.
2. Project 04: call vs jal¶
Project 04 reuses the emulator but adds hand-written assembly (a sort routine and
a recursive Fibonacci, fibrec). A pseudo-instruction tripped people up.
In RISC-V, call label is a pseudo-instruction. The assembler expands it into
two real instructions because the target may be far away (a full 32-bit PC-relative
range):
call foo ==> auipc ra, %pcrel_hi(foo) # upper 20 bits of the offset
jalr ra, ra, %pcrel_lo(foo) # lower 12 bits, jump and link
To emulate call faithfully you would have to implement auipc (add upper
immediate to PC). For the small programs in Project 04 the target is always within
the jal range (±1 MiB), so you can use jal directly and skip auipc entirely:
Guidance for Project 04:
- Replace
callwithjalin the provided sort code. - Use
jal(notcall) in yourfibrecimplementation so you do not need to emulateauipc.
flowchart TD
A["Need a function call?"] --> B{"Is target within<br/>jal range (~1 MiB)?"}
B -- "Yes" --> C["Use jal ra, label<br/>(1 instruction)"]
B -- "No" --> D["Use call label<br/>(auipc + jalr)"]
C --> E["Emulator only needs jal + jalr"]
D --> F["Emulator must also implement auipc"]
3. Dynamic Analysis: Counting Instructions at Run Time¶
Project 04 also asks for dynamic analysis: while the emulator executes a program, it counts how many instructions of each class actually run. This is dynamic (measured during execution) rather than static (counted in the binary), so a loop body that runs 1000 times contributes 1000 to the dynamic counts.
The natural place to do this is an analysis struct that the execute loop updates
once per instruction, right after it decodes the instruction format:
typedef struct {
uint64_t total; // every instruction executed
uint64_t itype; // I-type (addi, andi, lw, jalr, ...)
uint64_t rtype; // R-type (add, sub, and, ...)
uint64_t load; // ld, lw, lb, ...
uint64_t store; // sd, sw, sb, ...
uint64_t branch; // beq, bne, blt, ...
} analysis_st;
void analyze(analysis_st *a, uint32_t insn) {
a->total += 1;
switch (opcode_of(insn)) {
case OPC_LOAD: a->load += 1; a->itype += 1; break; // loads are I-type
case OPC_STORE: a->store += 1; break;
case OPC_BRANCH: a->branch += 1; break;
case OPC_OP: a->rtype += 1; break;
case OPC_OP_IMM: a->itype += 1; break;
// ... jal, jalr, lui, auipc, etc.
}
}
Common mistake: double-counting. A load is both a memory access and an I-type instruction. Decide up front whether your categories are disjoint or overlapping and document it, so your counts add up the way the grader expects. Keep the instrumentation in one place — do not sprinkle
count++lines throughout every case.
In the demo, running fibrec 10 produced 1,675 total instructions. We then
compared the memory-access profile of fibrec (heavy stack traffic from
recursion: lots of sd/ld around each call) against the sort routine (heavy
array traffic). Different programs exhibit very different memory behavior, which is
exactly the behavior a cache has to cope with — a perfect segue into cache memory.
flowchart LR
A["Decode instruction"] --> B["Execute instruction"]
B --> C["analyze: bump counters"]
C --> D{"halt?"}
D -- "no" --> A
D -- "yes" --> E["Print analysis report"]
4. Why Caches Exist: the CPU–Memory Gap¶
For decades, Moore's Law — the observation that transistor count per chip roughly doubles every ~1.5 to 2 years — let processor speed climb quickly. Main memory (DRAM) latency, however, improved far more slowly. The result is a widening performance gap: the CPU can issue requests much faster than DRAM can satisfy them.
speed
^
| CPU
| . ' '
| . ' ' } gap (grows over time)
| . ' '
| . ' ' _______________________ memory (DRAM)
| . ' '______/
+---------------------------------------------> time
If every load and store had to wait on DRAM, the fast CPU would spend most of its time stalled. A cache is a small, fast memory that sits between the processor and main memory and keeps recently used data close by, so most accesses are served quickly and the CPU rarely has to wait on slow DRAM.
The memory the processor sees is RAM (DRAM), and it holds the running program:
MEMORY / RAM (DRAM)
processor (CPU) +---------------------+
+-----------+ | STACK | <-- sd rs, (sp) writes here
| [ REGS ] | ---> | (grows downward) | <-- ld rd, (sp) reads here
| [ PC ] | <--- | |
+-----------+ | ... |
| CODE |
| [ IW ] | <-- PC fetches instruction words
+---------------------+
A sd rs, off(sp) stores a register to the stack in memory; a ld rd, off(sp)
reads it back; and the PC continually fetches instruction words (IW) from the
code region. Every one of those touches is a candidate to be served by a cache.
5. Two Types of RAM: SRAM vs DRAM¶
There are two main technologies for building RAM, and the trade-off between them is the reason caches exist.
| Property | SRAM (Static RAM) | DRAM (Dynamic RAM) |
|---|---|---|
| Speed | Fast | Slow |
| Cost per bit | Expensive | Cheap |
| Density | Low (≈6 transistors/bit) | High (1 transistor + 1 cap) |
| Refresh | Not needed | Must be refreshed periodically |
| Typical use | Caches (on-chip) | Main memory |
- SRAM = Static Random Access Memory. Fast, but expensive and not dense, so we can only afford a small amount of it.
- DRAM = Dynamic Random Access Memory. Slow, but cheap and dense, so we can afford a lot of it.
The natural question: why not build all of main memory out of SRAM so everything is fast? Answer: cost and chip area. A machine with gigabytes of SRAM would be enormous and prohibitively expensive. Instead we get the best of both worlds: a small, fast SRAM cache in front of a large, cheap DRAM main memory.
processor CACHE Memory (RAM / DRAM)
+---------+ addr +--------+ miss +-------------------+
| | -----> | | ------> | |
| | <----- | $$ | <------ | CODE |
+---------+ a +--------+ | DATA |
^ +-------------------+
| hit (served fast from $$)
The cache ($$ is the standard shorthand for "cache") takes an addr from the
processor. On a hit, the data a comes straight back from SRAM — fast. On a
miss, the cache goes out to DRAM, brings the data in, fills a slot, and returns
it — slow, but it now sits in the cache for next time.
6. Hits, Misses, and Rates¶
Every time the processor asks the cache for data, that request is a memory
reference (ref). The cache classifies each reference:
- Hit — the data is already in the cache; return it immediately (fast).
- Miss — the data is not in the cache; fetch from main memory, install it, then return it (slow).
Counting these gives us the two headline metrics:
Because every reference is either a hit or a miss:
The goal is a high hit rate. Real caches do astonishingly well — hit rates above 97% are common — because real programs are not random. They obey the principles of locality, covered next.
7. The Three Cache Questions¶
Every cache design boils down to answering three questions about an incoming address:
- Where do we look? Given an address, which cache location(s) could hold its data? (Placement / mapping.)
- Is it actually there? Once we look, how do we know this slot holds the data for this address and not some other address that maps to the same place? (Identification — the tag and valid bit.)
- What do we evict? When the cache is full and we need room for new data, which existing entry do we remove? (Replacement policy.)
flowchart TD
A["Memory reference: addr"] --> Q1["Q1: Where could it be?<br/>(compute index)"]
Q1 --> Q2["Q2: Is the data really there?<br/>(check valid + compare tag)"]
Q2 -- "valid && tag matches" --> H["HIT: return data"]
Q2 -- "otherwise" --> M["MISS: fetch from memory"]
M --> Q3["Q3: Which entry to evict?<br/>(replacement policy)"]
Q3 --> F["Install data, set tag + valid, return"]
Each of the three cache organizations answers these three questions differently. Direct-mapped makes Q1 and Q3 trivial; fully associative makes Q1 trivial but Q3 expensive; set-associative balances them.
8. Principles of Locality¶
Caches work because real programs exhibit locality of reference. There are two flavors.
Temporal locality¶
If you access something, you are likely to access it again soon.
i and sum, and the loop body's instructions, are reused on every iteration —
classic temporal locality. The first access misses; the reuses hit.
Spatial locality¶
If you access something, you are likely to access its neighbors soon.
Accessing a[i] makes a[i+1] very likely next. Code has spatial locality too:
after executing the instruction at PC, the next instruction at PC + 4 is
usually executed next (unless the instruction is a taken branch or jump).
Spatial locality is the reason a cache moves data in blocks (multiple adjacent
words) rather than one word at a time — when you fetch a[i] you grab its
neighbors in the same transfer so they are already present when you ask for them.
| Locality | Statement | Code example |
|---|---|---|
| Temporal | Recently used data is reused soon | loop counter i, repeated function calls |
| Spatial | Neighbors of used data are used soon | array traversal a[i], sequential instructions PC+4 |
9. Addresses, Words, and Blocks¶
A memory reference is just an address. On modern machines an address is a byte address: each distinct address names one 8-bit byte. You can picture memory as a giant array of bytes:
For this lab we start with a cache whose block size is one word (32 bits = 4 bytes) and assume all requests are word-aligned (the address is a multiple of 4). That lets us view memory as a sequence of words:
word 0 -> mem[0], mem[1], mem[2], mem[3]
word 1 -> mem[4], mem[5], mem[6], mem[7]
word 2 -> mem[8], mem[9], mem[10], mem[11]
...
Converting between byte and word addresses:
addr_word = addr / 4; // byte address -> word address (== addr >> 2)
addr = addr_word * 4; // word address -> byte address (== addr_word << 2)
Because the low 2 bits of a word-aligned byte address are always 00, they form the
byte offset within a word. They are exactly the bits we drop when we shift right
by 2 to get a word address.
10. Direct-Mapped Cache¶
A direct-mapped cache is the simplest: every address maps to exactly one slot. That makes Q1 (where to look) and Q3 (what to evict) trivial — there is only one place it can go, so if something is already there, it gets evicted.
Consider a 4-word direct-mapped cache:
The mapping function¶
First convert the byte address to a word address, then take it modulo the number of slots:
Since the cache size is a power of two, the modulo is just the low bits of the word
address, which we can isolate with a mask instead of %:
Why the mask works — look at the low bits of consecutive word addresses:
dec bin slot
0 00000 0
1 00001 1
2 00010 2
3 00011 3
4 00100 0 <- wraps around
5 00101 1
6 00110 2
7 00111 3
8 01000 0
Word addresses 0, 4, 8, 12, ... all collide in slot 0; 1, 5, 9, 13, ... all map
to slot 1; and so on.
The tag and valid bit (Q2)¶
Many word addresses map to the same slot, so we must record which one is actually stored. The remaining high bits — the bits above the slot index — uniquely identify which address is present. We call those the tag.
For word addresses landing in slot 0 (0, 4, 8, 12), the upper bits are 000,
001, 010, 011 — all distinct. Storing the tag alongside the data lets us
confirm a hit.
We also need a valid bit. At power-up the cache is zeroed; a tag of 0 and data of 0 are legitimate values, so without a separate flag we could not tell "real zero data" from "empty slot." The valid bit solves this.
struct cache_slot_st {
uint64_t tag;
uint32_t data;
uint32_t valid;
};
struct cache_slot_st cache[4];
A 64-bit byte address for a 4-slot, 1-word-block direct-mapped cache decomposes as:
Lookup pseudocode¶
addr_tag = addr >> 4; // drop byte offset (2) + index bits (2)
slot_index = (addr >> 2) & 0b11; // the 2 index bits
slot = &cache[slot_index];
if (slot->valid == 1 && slot->tag == addr_tag) {
// HIT
return slot->data;
} else {
// MISS: fetch, install (overwriting whatever was there), return
slot->data = *((uint32_t *) addr);
slot->tag = addr_tag;
slot->valid = 1;
return slot->data;
}
Common mistake: comparing the full byte address to the tag. The tag is only the high bits left after removing the offset and index bits. Compare
addr >> 4, notaddr.
Strength: dead simple, one comparison, no replacement logic. Weakness: conflict misses. Two hot addresses that map to the same slot kick each other out repeatedly, even when the rest of the cache is empty.
11. Fully-Associative Cache¶
A fully-associative cache flips the trade-off: an address may live in any slot. That eliminates conflict misses (nothing forces two addresses to collide), but makes Q2 expensive — to find data we must compare the tag against every slot.
struct cache_slot_st {
uint64_t tag;
uint32_t data;
uint32_t valid;
uint64_t timestamp; // for LRU replacement
};
struct cache_slot_st cache[4];
With no index, the address is just a tag and a byte offset:
Lookup pseudocode¶
num_refs += 1;
addr_tag = addr >> 2; // remove byte offset only
for (int i = 0; i < 4; i++) { // search ALL slots
struct cache_slot_st *slot = &cache[i];
if (slot->valid == 1 && slot->tag == addr_tag) {
// HIT
slot->timestamp = num_refs; // mark recently used
return slot->data;
}
}
// MISS: choose a victim, install
struct cache_slot_st *slot = find_lru_slot(cache);
slot->data = *((uint32_t *) addr);
slot->tag = addr_tag;
slot->valid = 1;
slot->timestamp = num_refs;
return slot->data;
Replacement (Q3): LRU¶
On a miss we need a victim. find_lru_slot():
- First, look for any slot with
valid == 0(an empty slot) — use it. - Otherwise, evict the least recently used (LRU) slot.
A clean software way to track recency is a logical timestamp: a counter that
increments on every reference. Each time a slot is touched, stamp it with the
current num_refs. The slot with the smallest timestamp is the one untouched the
longest — the LRU victim.
struct cache_slot_st *find_lru_slot(struct cache_slot_st *cache) {
// prefer an unused slot
for (int i = 0; i < 4; i++)
if (cache[i].valid == 0)
return &cache[i];
// else the slot with the smallest timestamp
int lru = 0;
for (int i = 1; i < 4; i++)
if (cache[i].timestamp < cache[lru].timestamp)
lru = i;
return &cache[lru];
}
In hardware, perfect LRU and full timestamps are too expensive, so processors use approximate LRU. In our software simulator we can afford the exact version.
Strength: no conflict misses; maximally flexible. Weakness: every lookup scans all slots (lots of comparators in hardware). Only practical for small structures like TLBs in virtual-memory hardware.
12. Set-Associative Cache¶
A set-associative cache is the compromise everyone actually ships. View the cache as an array of sets; map the address to one set using the simple direct-mapped index, but allow it to land in any slot within that set. The number of slots per set is the number of ways: a 4-way set-associative cache has 4 slots per set.
struct cache_slot_st {
uint64_t tag;
uint32_t data;
uint32_t valid;
uint64_t timestamp;
};
struct cache_slot_st cache[32]; // 32 slots total
The number of sets is:
With 32 slots and 4 ways: 32 / 4 = 8 sets. Eight sets need 3 index bits
(2^3 = 8), so the address decomposes as:
Lookup pseudocode (4-way, 32 slots = 8 sets)¶
We lay the 32-slot array out as 8 groups of 4: slots 0..3 are set 0, slots 4..7
are set 1, etc. Once we have set_index, the first slot of the set is
set_index * num_ways.
num_refs += 1;
int num_ways = 4;
uint64_t addr_tag = addr >> 5; // drop byte offset(2) + set_index(3)
int set_index = (addr >> 2) & 0b111; // the 3 set-index bits
int set_base = set_index * num_ways; // first slot in this set
for (int i = 0; i < num_ways; i++) { // search ONLY this set
struct cache_slot_st *slot = &cache[set_base + i];
if (slot->valid == 1 && slot->tag == addr_tag) {
slot->timestamp = num_refs; // HIT
return slot->data;
}
}
// MISS: LRU within this set only
struct cache_slot_st *slot = find_lru_slot_in_set(cache, set_base, num_ways);
slot->data = *((uint32_t *) addr);
slot->tag = addr_tag;
slot->valid = 1;
slot->timestamp = num_refs;
return slot->data;
The key win: we only search and replace within one small set (4 or 8 slots), not the
whole cache, yet we still have flexibility about where in the set the data goes.
find_lru_slot_in_set() first looks for an unused slot in the set, then applies LRU
among the set's slots.
flowchart LR
A["Direct-mapped<br/>1 slot per address"] --> B["Set-associative<br/>N slots per set"]
B --> C["Fully associative<br/>any slot (1 set)"]
A -. "increasing flexibility / cost" .-> C
Note the two extremes are just special cases: a 1-way set-associative cache is a direct-mapped cache, and a set-associative cache with one set (ways == total slots) is fully associative.
13. Block Size and Spatial Locality¶
So far each slot holds one word. To cash in on spatial locality and amortize the cost of a DRAM transfer, real caches store a block of several adjacent words per slot. With a 4-word block:
struct cache_slot_st {
uint64_t tag;
uint32_t block[4]; // 4 words per slot
uint32_t valid;
uint64_t timestamp;
};
Now we add a block offset to the address breakdown (which word within the block) between the byte offset and the index. For a 4-slot cache with a 4-word block:
Direct mapped: tag[63:6] slot_index[5:4] block_index[3:2] byte_offset[1:0]
Fully associative: tag[63:4] block_index[3:2] byte_offset[1:0]
Set assoc (2-way): tag[63:5] set_index[4] block_index[3:2] byte_offset[1:0]
To fill a block on a miss, find the block's base word and load the whole block from memory:
addr_word = addr / 4;
block_index = addr_word % 4; // 4 == block size in words
block_base = addr_word - block_index; // word address of the block start
// load 4 words from memory starting at byte address (block_base * 4)
Larger blocks improve hit rate when programs stream through memory (great spatial locality), but blocks that are too large waste space and bandwidth on data that is never used. Block size is a tuning knob.
14. Worked Example: Tracing a Direct-Mapped Cache¶
Let us trace the 4-word, 1-word-block direct-mapped cache from Section 10 by hand.
Stream of word addresses (so multiply by 4 for byte addresses):
0, 1, 2, 3, 4, 0.
For each, slot = word_addr % 4 and tag = word_addr / 4:
| step | word addr | slot | tag | valid before? | tag match? | result |
|---|---|---|---|---|---|---|
| 1 | 0 | 0 | 0 | no | — | MISS |
| 2 | 1 | 1 | 0 | no | — | MISS |
| 3 | 2 | 2 | 0 | no | — | MISS |
| 4 | 3 | 3 | 0 | no | — | MISS |
| 5 | 4 | 0 | 1 | yes (tag 0) | no | MISS (evicts word 0) |
| 6 | 0 | 0 | 0 | yes (tag 1) | no | MISS (evicts word 4) |
Result: 6 references, 0 hits, 6 misses → hit rate 0%. Steps 5 and 6 are conflict misses: words 0 and 4 both map to slot 0 and keep evicting each other, even though slots 1, 2, 3 sit idle holding stale data. A 2-way set-associative cache of the same size would have kept both word 0 and word 4 (different ways of the same set) and turned step 6 into a hit.
Key Concepts¶
| Concept | Definition | Example |
|---|---|---|
| Cache | Small, fast SRAM memory holding recently used data, sitting between CPU and DRAM | On-chip L1 cache $$ |
| Hit / Miss | Data is / is not already in the cache for a reference | Hit returns fast; miss goes to DRAM |
| Hit rate | num_hits / num_refs |
0.97 means 97% served from cache |
| Temporal locality | Recently used data is likely reused soon | loop counter i, repeated calls |
| Spatial locality | Neighbors of used data are used soon | a[i] then a[i+1]; PC then PC+4 |
| Tag | High address bits stored in a slot to confirm identity | addr >> 4 in a 4-slot, 1-word cache |
| Valid bit | Flag marking whether a slot holds real data | distinguishes "empty" from "zero data" |
| Direct-mapped | Each address maps to exactly one slot | slot = word_addr % num_slots |
| Fully associative | An address may occupy any slot | search all slots, LRU replacement |
| Set associative | Map to a set, free placement within the set | 4-way: 4 slots per set |
| Block size | Number of adjacent words moved per miss | 4-word block exploits spatial locality |
| LRU | Evict the least-recently-used entry | smallest logical timestamp |
| SRAM vs DRAM | Fast/expensive/sparse vs slow/cheap/dense | cache vs main memory |
Practice Problems¶
Problem 1: Hit/Miss/Rate basics¶
A cache services 1,000 memory references and records 30 misses. What are the miss rate and hit rate? How many hits occurred?
Click to reveal solution
A 97% hit rate is typical for real programs thanks to locality.Problem 2: Address breakdown¶
For a direct-mapped cache with 8 slots and a 1-word block, split a 64-bit
byte address into tag, slot_index, and byte_offset. How many bits is each
field, and how do you compute the slot index from addr?
Click to reveal solution
- `byte_offset` = 2 bits (word-aligned, low bits `[1:0]`). - `slot_index` = 3 bits, because `2^3 = 8` slots (`[4:2]`). - `tag` = the remaining high bits `[63:5]`.Problem 3: Trace a direct-mapped cache¶
Using a 4-slot, 1-word-block direct-mapped cache (initially all invalid), trace the
word address stream 0, 4, 0, 4. Mark each as hit or miss and give the final hit
rate.
Click to reveal solution
Both 0 and 4 map to slot 0 (`0 % 4 == 0`, `4 % 4 == 0`) with tags 0 and 1 respectively. | ref | word addr | slot | tag | result | reason | |-----|-----------|------|-----|--------|--------| | 1 | 0 | 0 | 0 | MISS | slot empty | | 2 | 4 | 0 | 1 | MISS | tag mismatch, evicts 0 | | 3 | 0 | 0 | 0 | MISS | tag mismatch, evicts 4 | | 4 | 4 | 0 | 1 | MISS | tag mismatch, evicts 0 | 4 refs, 0 hits → **hit rate 0%**. This is a pathological conflict-miss pattern ("ping-pong"). A 2-way set-associative cache would hold both 0 and 4 and turn refs 3 and 4 into hits (50% hit rate).Problem 4: Why a valid bit?¶
After power-up the cache is all zeros. A request arrives whose computed tag is 0.
Explain precisely why, without a valid bit, the cache would wrongly report a hit, and
how the valid bit prevents it.
Click to reveal solution
On reset, every slot has `tag = 0` and `data = 0`. A request whose tag is also 0 would satisfy `slot.tag == addr_tag` even though nothing real was ever loaded — the cache returns stale 0 data and counts a false hit. The valid bit is initialized to 0 and only set to 1 after a real fill. The hit test becomes `slot.valid == 1 && slot.tag == addr_tag`, so an empty (just-reset) slot can never produce a hit, regardless of tag/data values.Problem 5: Set-associative geometry¶
A 4-way set-associative cache has 64 total slots and a 1-word block. How many sets
are there, how many set-index bits, and how do you compute set_base from
set_index?
Click to reveal solution
Lookup compares the tag against the 4 slots `cache[set_base .. set_base+3]` only.Problem 6: LRU eviction¶
A fully-associative cache has 4 slots, all valid. Their logical timestamps (set on
each access) are: slot0=12, slot1=7, slot2=20, slot3=15. A miss occurs. Which slot is
evicted, and what is its new timestamp if the current num_refs is 21?
Click to reveal solution
LRU evicts the slot with the **smallest** timestamp — that is the one untouched the longest. The smallest is slot1 (7), so **slot1 is evicted**. After installing the new block, slot1's timestamp is updated to the current reference count: `num_refs = 21`. (`num_refs` is typically incremented at the start of handling this reference.)Further Reading¶
- Cache Memory Guide — full treatment of direct-mapped, fully-associative, set-associative caches, and block size
- Key Concepts — bit manipulation, masks, and number representation used throughout this lab
- RISC-V Reference Guide — load/store instructions,
jal/jalr, and the calling convention - Lab 06: RISC-V Emulation — the emulator this session builds on
- Source notes: CS315-01 2025-10-01 Lab Lab06 Analysis Cache.pdf
- Cache (computing) — Wikipedia
- CPU cache — Wikipedia
- Locality of reference — Wikipedia
- Moore's law — Wikipedia
Summary¶
-
Lab 06 wrap-up: fix
sign_extend()by aligning the shift constant with the lecture's bit-numbering convention (change64to63); pick one convention and apply it consistently. -
Project 04: use
jalinstead of thecallpseudo-instruction in the sort andfibreccode so you do not have to emulateauipc; targets are withinjalrange. -
Dynamic analysis instruments the emulator to count instruction classes (I-type, R-type, load, store, branch) as they execute;
fibrec 10runs 1,675 instructions, and recursion vs sorting show very different memory profiles. -
Caches exist because of the CPU–memory gap: CPU speed (Moore's Law) outpaced DRAM latency, so a small fast SRAM cache keeps hot data close and hides slow DRAM.
-
SRAM vs DRAM is the core trade-off: SRAM is fast/expensive/sparse (caches), DRAM is slow/cheap/dense (main memory); we cannot afford all-SRAM memory.
-
Hits and misses drive
hit_rate = hits/refsandmiss_rate = 1 - hit_rate; real programs hit >97% because of temporal and spatial locality. -
Every cache answers three questions: where to look (placement/index), whether the data is really there (valid bit + tag), and what to evict (replacement policy).
-
Three organizations: direct-mapped (one slot per address, simple but conflict misses), fully-associative (any slot, no conflicts but expensive search), and set-associative (map to a set, free placement within it — the practical compromise); larger block sizes exploit spatial locality.