Lab 06: Cache Analysis¶

Overview¶

This hands-on lab session wraps up the Lab 06 RISC-V emulator (with a Q&A on the sign_extend() function and the call vs jal decision for Project 04), then pivots to the next big idea: cache memory. We motivate why caches exist (the growing CPU–memory performance gap), contrast the two kinds of RAM (SRAM and DRAM), and walk through the core machinery of a cache: how it answers a memory reference with a hit or a miss. We then frame the three questions every cache design must answer and introduce the three cache organizations — direct-mapped, fully associative, and set associative — that we will build in software in the upcoming project. The goal is to leave this session able to trace an address through a cache by hand and predict whether it hits or misses.

Learning Objectives¶

Resolve the Lab 06 sign_extend() discrepancy and choose between call and jal in Project 04 hand-written assembly.
Explain dynamic analysis: instrumenting the emulator to count instruction classes (I-type, R-type, load, store, branch).
Describe the CPU–memory performance gap and why it motivates a cache hierarchy.
Distinguish SRAM from DRAM and explain why we use a small fast cache instead of all-SRAM main memory.
Define hits, misses, references, hit rate, and miss rate, and compute them.
State the principles of temporal and spatial locality and recognize them in real code.
Answer the three core cache questions (placement, identification, replacement).
Compare direct-mapped, fully-associative, and set-associative cache organizations.

Prerequisites¶

A working Lab 06 RISC-V emulator (decode + execute for the test programs).
Comfort with RISC-V load/store instructions (ld, sd, lw, sw) and the program counter.
C fundamentals: structs, arrays, pointers, and casting (uint32_t *).
Bitwise operators in C: shift (>>, <<), mask (&), and the modulo operator (%).
Powers of two and binary/hex number representation.

1. Lab 06 Q&A: the `sign_extend()` Fix¶

We opened with a question that came up while finishing Lab 06: the lecture explanation of sign extension and the helper code shipped in bits.c disagreed by one bit. The function takes a value that is narrower than 64 bits and copies its top (sign) bit into all of the higher bit positions so the value keeps the same meaning as a 64-bit signed integer.

The classic two-shift trick is:

// Sign-extend the low `bits` bits of `value` to a full 64-bit signed value.
int64_t sign_extend(uint64_t value, int bits) {
    int shift = 64 - bits;            // <-- the discrepancy was here
    return ((int64_t) (value << shift)) >> shift;
}

The idea: shift the value all the way left so its sign bit lands in bit 63, then do an arithmetic right shift (signed >>) back, which replicates bit 63 downward.

The one-bit issue¶

The shipped code used 64, but the lecture derivation used 63. The right value depends on exactly how bits is being counted. The fix the class adopted was to change 64 to 63 so the code matches the lecture's bit-numbering convention. Both versions can be made correct; the point is that the code and the explanation must use the same convention so the math is consistent.

// Lecture-aligned version
int64_t sign_extend(uint64_t value, int bits) {
    int shift = 63 - bits;
    return ((int64_t) (value << shift)) >> shift;
}

Common mistake: mixing the count of bits with the index of the top bit. If bits is the number of meaningful bits, the left shift is 64 - bits. If bits is the index of the most significant bit (0-based), the left shift is 63 - bits. Pick one convention and use it everywhere.

You may keep your original working code if it passes the autograder, but the lecture-aligned change is preferred for logical consistency with how we derive immediates in the slides.

2. Project 04: `call` vs `jal`¶

Project 04 reuses the emulator but adds hand-written assembly (a sort routine and a recursive Fibonacci, fibrec). A pseudo-instruction tripped people up.

In RISC-V, call label is a pseudo-instruction. The assembler expands it into two real instructions because the target may be far away (a full 32-bit PC-relative range):

call foo   ==>   auipc ra, %pcrel_hi(foo)     # upper 20 bits of the offset
                 jalr  ra, ra, %pcrel_lo(foo) # lower 12 bits, jump and link

To emulate call faithfully you would have to implement auipc (add upper immediate to PC). For the small programs in Project 04 the target is always within the jal range (±1 MiB), so you can use jal directly and skip auipc entirely:

jal ra, foo    # one instruction: ra = pc + 4; pc = pc + offset
ret            # jalr x0, ra, 0  (return)

Guidance for Project 04:

Replace call with jal in the provided sort code.
Use jal (not call) in your fibrec implementation so you do not need to emulate auipc.

flowchart TD
    A["Need a function call?"] --> B{"Is target within<br/>jal range (~1 MiB)?"}
    B -- "Yes" --> C["Use jal ra, label<br/>(1 instruction)"]
    B -- "No" --> D["Use call label<br/>(auipc + jalr)"]
    C --> E["Emulator only needs jal + jalr"]
    D --> F["Emulator must also implement auipc"]

3. Dynamic Analysis: Counting Instructions at Run Time¶

Project 04 also asks for dynamic analysis: while the emulator executes a program, it counts how many instructions of each class actually run. This is dynamic (measured during execution) rather than static (counted in the binary), so a loop body that runs 1000 times contributes 1000 to the dynamic counts.

The natural place to do this is an analysis struct that the execute loop updates once per instruction, right after it decodes the instruction format:

typedef struct {
    uint64_t total;     // every instruction executed
    uint64_t itype;     // I-type (addi, andi, lw, jalr, ...)
    uint64_t rtype;     // R-type (add, sub, and, ...)
    uint64_t load;      // ld, lw, lb, ...
    uint64_t store;     // sd, sw, sb, ...
    uint64_t branch;    // beq, bne, blt, ...
} analysis_st;

void analyze(analysis_st *a, uint32_t insn) {
    a->total += 1;
    switch (opcode_of(insn)) {
        case OPC_LOAD:   a->load   += 1; a->itype += 1; break; // loads are I-type
        case OPC_STORE:  a->store  += 1;                break;
        case OPC_BRANCH: a->branch += 1;                break;
        case OPC_OP:     a->rtype  += 1;                break;
        case OPC_OP_IMM: a->itype  += 1;                break;
        // ... jal, jalr, lui, auipc, etc.
    }
}

Common mistake: double-counting. A load is both a memory access and an I-type instruction. Decide up front whether your categories are disjoint or overlapping and document it, so your counts add up the way the grader expects. Keep the instrumentation in one place — do not sprinkle count++ lines throughout every case.

In the demo, running fibrec 10 produced 1,675 total instructions. We then compared the memory-access profile of fibrec (heavy stack traffic from recursion: lots of sd/ld around each call) against the sort routine (heavy array traffic). Different programs exhibit very different memory behavior, which is exactly the behavior a cache has to cope with — a perfect segue into cache memory.

flowchart LR
    A["Decode instruction"] --> B["Execute instruction"]
    B --> C["analyze: bump counters"]
    C --> D{"halt?"}
    D -- "no" --> A
    D -- "yes" --> E["Print analysis report"]

4. Why Caches Exist: the CPU–Memory Gap¶

For decades, Moore's Law — the observation that transistor count per chip roughly doubles every ~1.5 to 2 years — let processor speed climb quickly. Main memory (DRAM) latency, however, improved far more slowly. The result is a widening performance gap: the CPU can issue requests much faster than DRAM can satisfy them.

   speed
     ^
     |                                   CPU
     |                            .  '  '
     |                     .  '  '         } gap (grows over time)
     |              .  '  '
     |        .  '  '  _______________________  memory (DRAM)
     |   .  '  '______/
     +---------------------------------------------> time

If every load and store had to wait on DRAM, the fast CPU would spend most of its time stalled. A cache is a small, fast memory that sits between the processor and main memory and keeps recently used data close by, so most accesses are served quickly and the CPU rarely has to wait on slow DRAM.

The memory the processor sees is RAM (DRAM), and it holds the running program:

                       MEMORY / RAM (DRAM)
   processor (CPU)    +---------------------+
   +-----------+      |       STACK         | <-- sd rs, (sp) writes here
   | [ REGS  ] | ---> |  (grows downward)   | <-- ld rd, (sp) reads here
   | [  PC   ] | <--- |                     |
   +-----------+      |        ...          |
                      |       CODE          |
                      |   [   IW   ]        | <-- PC fetches instruction words
                      +---------------------+

A sd rs, off(sp) stores a register to the stack in memory; a ld rd, off(sp) reads it back; and the PC continually fetches instruction words (IW) from the code region. Every one of those touches is a candidate to be served by a cache.

5. Two Types of RAM: SRAM vs DRAM¶

There are two main technologies for building RAM, and the trade-off between them is the reason caches exist.

Property	SRAM (Static RAM)	DRAM (Dynamic RAM)
Speed	Fast	Slow
Cost per bit	Expensive	Cheap
Density	Low (≈6 transistors/bit)	High (1 transistor + 1 cap)
Refresh	Not needed	Must be refreshed periodically
Typical use	Caches (on-chip)	Main memory

SRAM = Static Random Access Memory. Fast, but expensive and not dense, so we can only afford a small amount of it.
DRAM = Dynamic Random Access Memory. Slow, but cheap and dense, so we can afford a lot of it.

The natural question: why not build all of main memory out of SRAM so everything is fast? Answer: cost and chip area. A machine with gigabytes of SRAM would be enormous and prohibitively expensive. Instead we get the best of both worlds: a small, fast SRAM cache in front of a large, cheap DRAM main memory.

   processor          CACHE              Memory (RAM / DRAM)
   +---------+   addr  +--------+  miss   +-------------------+
   |         |  -----> |        | ------> |                   |
   |         |  <----- |  $$    | <------ |       CODE        |
   +---------+    a    +--------+         |       DATA        |
        ^                                 +-------------------+
        | hit (served fast from $$)

The cache ($$ is the standard shorthand for "cache") takes an addr from the processor. On a hit, the data a comes straight back from SRAM — fast. On a miss, the cache goes out to DRAM, brings the data in, fills a slot, and returns it — slow, but it now sits in the cache for next time.

6. Hits, Misses, and Rates¶

Every time the processor asks the cache for data, that request is a memory reference (ref). The cache classifies each reference:

Hit — the data is already in the cache; return it immediately (fast).
Miss — the data is not in the cache; fetch from main memory, install it, then return it (slow).

Counting these gives us the two headline metrics:

hit_rate  = num_hits   / num_refs
miss_rate = num_misses / num_refs

Because every reference is either a hit or a miss:

hit_rate  = 1 - miss_rate
miss_rate = 1 - hit_rate

The goal is a high hit rate. Real caches do astonishingly well — hit rates above 97% are common — because real programs are not random. They obey the principles of locality, covered next.

7. The Three Cache Questions¶

Every cache design boils down to answering three questions about an incoming address:

Where do we look? Given an address, which cache location(s) could hold its data? (Placement / mapping.)
Is it actually there? Once we look, how do we know this slot holds the data for this address and not some other address that maps to the same place? (Identification — the tag and valid bit.)
What do we evict? When the cache is full and we need room for new data, which existing entry do we remove? (Replacement policy.)

flowchart TD
    A["Memory reference: addr"] --> Q1["Q1: Where could it be?<br/>(compute index)"]
    Q1 --> Q2["Q2: Is the data really there?<br/>(check valid + compare tag)"]
    Q2 -- "valid && tag matches" --> H["HIT: return data"]
    Q2 -- "otherwise" --> M["MISS: fetch from memory"]
    M --> Q3["Q3: Which entry to evict?<br/>(replacement policy)"]
    Q3 --> F["Install data, set tag + valid, return"]

Each of the three cache organizations answers these three questions differently. Direct-mapped makes Q1 and Q3 trivial; fully associative makes Q1 trivial but Q3 expensive; set-associative balances them.

8. Principles of Locality¶

Caches work because real programs exhibit locality of reference. There are two flavors.

Temporal locality¶

If you access something, you are likely to access it again soon.

int sum = 0;
for (int i = 0; i < n; i++) {   // i and sum are touched every iteration
    sum += a[i];
}

i and sum, and the loop body's instructions, are reused on every iteration — classic temporal locality. The first access misses; the reuses hit.

Spatial locality¶

If you access something, you are likely to access its neighbors soon.

for (int i = 0; i < n; i++) {
    a[i] = i;        // a[0], a[1], a[2], ... are adjacent in memory
}

Accessing a[i] makes a[i+1] very likely next. Code has spatial locality too: after executing the instruction at PC, the next instruction at PC + 4 is usually executed next (unless the instruction is a taken branch or jump).

Spatial locality is the reason a cache moves data in blocks (multiple adjacent words) rather than one word at a time — when you fetch a[i] you grab its neighbors in the same transfer so they are already present when you ask for them.

Locality	Statement	Code example
Temporal	Recently used data is reused soon	loop counter `i`, repeated function calls
Spatial	Neighbors of used data are used soon	array traversal `a[i]`, sequential instructions `PC+4`

9. Addresses, Words, and Blocks¶

A memory reference is just an address. On modern machines an address is a byte address: each distinct address names one 8-bit byte. You can picture memory as a giant array of bytes:

uint8_t mem[N];   // index == byte address

For this lab we start with a cache whose block size is one word (32 bits = 4 bytes) and assume all requests are word-aligned (the address is a multiple of 4). That lets us view memory as a sequence of words:

word 0 -> mem[0],  mem[1],  mem[2],  mem[3]
word 1 -> mem[4],  mem[5],  mem[6],  mem[7]
word 2 -> mem[8],  mem[9],  mem[10], mem[11]
...

Converting between byte and word addresses:

addr_word = addr / 4;          // byte address  -> word address  (== addr >> 2)
addr      = addr_word * 4;      // word address  -> byte address  (== addr_word << 2)

Because the low 2 bits of a word-aligned byte address are always 00, they form the byte offset within a word. They are exactly the bits we drop when we shift right by 2 to get a word address.

   31 ............ 2 | 1 0
   [   word address  | bo ]
                       ^^ byte offset (00 for word-aligned access)

10. Direct-Mapped Cache¶

A direct-mapped cache is the simplest: every address maps to exactly one slot. That makes Q1 (where to look) and Q3 (what to evict) trivial — there is only one place it can go, so if something is already there, it gets evicted.

Consider a 4-word direct-mapped cache:

uint32_t cache[4];

The mapping function¶

First convert the byte address to a word address, then take it modulo the number of slots:

addr_word  = addr / 4;          // or addr >> 2
slot_index = addr_word % 4;     // which of the 4 slots

Since the cache size is a power of two, the modulo is just the low bits of the word address, which we can isolate with a mask instead of %:

slot_index = addr_word & 0b11;  // low 2 bits == addr_word % 4

Why the mask works — look at the low bits of consecutive word addresses:

dec  bin     slot
 0   00000    0
 1   00001    1
 2   00010    2
 3   00011    3
 4   00100    0   <- wraps around
 5   00101    1
 6   00110    2
 7   00111    3
 8   01000    0

Word addresses 0, 4, 8, 12, ... all collide in slot 0; 1, 5, 9, 13, ... all map to slot 1; and so on.

The tag and valid bit (Q2)¶

Many word addresses map to the same slot, so we must record which one is actually stored. The remaining high bits — the bits above the slot index — uniquely identify which address is present. We call those the tag.

For word addresses landing in slot 0 (0, 4, 8, 12), the upper bits are 000, 001, 010, 011 — all distinct. Storing the tag alongside the data lets us confirm a hit.

We also need a valid bit. At power-up the cache is zeroed; a tag of 0 and data of 0 are legitimate values, so without a separate flag we could not tell "real zero data" from "empty slot." The valid bit solves this.

struct cache_slot_st {
    uint64_t tag;
    uint32_t data;
    uint32_t valid;
};

struct cache_slot_st cache[4];

A 64-bit byte address for a 4-slot, 1-word-block direct-mapped cache decomposes as:

tag[63:4]   slot_index[3:2]   byte_offset[1:0]

Lookup pseudocode¶

addr_tag    = addr >> 4;             // drop byte offset (2) + index bits (2)
slot_index  = (addr >> 2) & 0b11;    // the 2 index bits
slot        = &cache[slot_index];

if (slot->valid == 1 && slot->tag == addr_tag) {
    // HIT
    return slot->data;
} else {
    // MISS: fetch, install (overwriting whatever was there), return
    slot->data  = *((uint32_t *) addr);
    slot->tag   = addr_tag;
    slot->valid = 1;
    return slot->data;
}

Common mistake: comparing the full byte address to the tag. The tag is only the high bits left after removing the offset and index bits. Compare addr >> 4, not addr.

Strength: dead simple, one comparison, no replacement logic. Weakness: conflict misses. Two hot addresses that map to the same slot kick each other out repeatedly, even when the rest of the cache is empty.

11. Fully-Associative Cache¶

A fully-associative cache flips the trade-off: an address may live in any slot. That eliminates conflict misses (nothing forces two addresses to collide), but makes Q2 expensive — to find data we must compare the tag against every slot.

struct cache_slot_st {
    uint64_t tag;
    uint32_t data;
    uint32_t valid;
    uint64_t timestamp;   // for LRU replacement
};

struct cache_slot_st cache[4];

With no index, the address is just a tag and a byte offset:

tag[63:2]   byte_offset[1:0]

Lookup pseudocode¶

num_refs += 1;
addr_tag = addr >> 2;                 // remove byte offset only

for (int i = 0; i < 4; i++) {         // search ALL slots
    struct cache_slot_st *slot = &cache[i];
    if (slot->valid == 1 && slot->tag == addr_tag) {
        // HIT
        slot->timestamp = num_refs;   // mark recently used
        return slot->data;
    }
}

// MISS: choose a victim, install
struct cache_slot_st *slot = find_lru_slot(cache);
slot->data      = *((uint32_t *) addr);
slot->tag       = addr_tag;
slot->valid     = 1;
slot->timestamp = num_refs;
return slot->data;

Replacement (Q3): LRU¶

On a miss we need a victim. find_lru_slot():

First, look for any slot with valid == 0 (an empty slot) — use it.
Otherwise, evict the least recently used (LRU) slot.

A clean software way to track recency is a logical timestamp: a counter that increments on every reference. Each time a slot is touched, stamp it with the current num_refs. The slot with the smallest timestamp is the one untouched the longest — the LRU victim.

struct cache_slot_st *find_lru_slot(struct cache_slot_st *cache) {
    // prefer an unused slot
    for (int i = 0; i < 4; i++)
        if (cache[i].valid == 0)
            return &cache[i];
    // else the slot with the smallest timestamp
    int lru = 0;
    for (int i = 1; i < 4; i++)
        if (cache[i].timestamp < cache[lru].timestamp)
            lru = i;
    return &cache[lru];
}

In hardware, perfect LRU and full timestamps are too expensive, so processors use approximate LRU. In our software simulator we can afford the exact version.

Strength: no conflict misses; maximally flexible. Weakness: every lookup scans all slots (lots of comparators in hardware). Only practical for small structures like TLBs in virtual-memory hardware.

12. Set-Associative Cache¶

A set-associative cache is the compromise everyone actually ships. View the cache as an array of sets; map the address to one set using the simple direct-mapped index, but allow it to land in any slot within that set. The number of slots per set is the number of ways: a 4-way set-associative cache has 4 slots per set.

struct cache_slot_st {
    uint64_t tag;
    uint32_t data;
    uint32_t valid;
    uint64_t timestamp;
};

struct cache_slot_st cache[32];     // 32 slots total

The number of sets is:

num_sets = total_slots / num_ways

With 32 slots and 4 ways: 32 / 4 = 8 sets. Eight sets need 3 index bits (2^3 = 8), so the address decomposes as:

tag[63:5]   set_index[4:2]   byte_offset[1:0]

Lookup pseudocode (4-way, 32 slots = 8 sets)¶

We lay the 32-slot array out as 8 groups of 4: slots 0..3 are set 0, slots 4..7 are set 1, etc. Once we have set_index, the first slot of the set is set_index * num_ways.

num_refs += 1;
int num_ways  = 4;
uint64_t addr_tag   = addr >> 5;              // drop byte offset(2) + set_index(3)
int      set_index  = (addr >> 2) & 0b111;    // the 3 set-index bits
int      set_base   = set_index * num_ways;   // first slot in this set

for (int i = 0; i < num_ways; i++) {          // search ONLY this set
    struct cache_slot_st *slot = &cache[set_base + i];
    if (slot->valid == 1 && slot->tag == addr_tag) {
        slot->timestamp = num_refs;           // HIT
        return slot->data;
    }
}

// MISS: LRU within this set only
struct cache_slot_st *slot = find_lru_slot_in_set(cache, set_base, num_ways);
slot->data      = *((uint32_t *) addr);
slot->tag       = addr_tag;
slot->valid     = 1;
slot->timestamp = num_refs;
return slot->data;

The key win: we only search and replace within one small set (4 or 8 slots), not the whole cache, yet we still have flexibility about where in the set the data goes. find_lru_slot_in_set() first looks for an unused slot in the set, then applies LRU among the set's slots.

flowchart LR
    A["Direct-mapped<br/>1 slot per address"] --> B["Set-associative<br/>N slots per set"]
    B --> C["Fully associative<br/>any slot (1 set)"]
    A -. "increasing flexibility / cost" .-> C

Note the two extremes are just special cases: a 1-way set-associative cache is a direct-mapped cache, and a set-associative cache with one set (ways == total slots) is fully associative.

13. Block Size and Spatial Locality¶

So far each slot holds one word. To cash in on spatial locality and amortize the cost of a DRAM transfer, real caches store a block of several adjacent words per slot. With a 4-word block:

struct cache_slot_st {
    uint64_t tag;
    uint32_t block[4];     // 4 words per slot
    uint32_t valid;
    uint64_t timestamp;
};

Now we add a block offset to the address breakdown (which word within the block) between the byte offset and the index. For a 4-slot cache with a 4-word block:

Direct mapped:       tag[63:6]  slot_index[5:4]  block_index[3:2]  byte_offset[1:0]
Fully associative:   tag[63:4]                    block_index[3:2]  byte_offset[1:0]
Set assoc (2-way):   tag[63:5]  set_index[4]      block_index[3:2]  byte_offset[1:0]

To fill a block on a miss, find the block's base word and load the whole block from memory:

addr_word   = addr / 4;
block_index = addr_word % 4;            // 4 == block size in words
block_base  = addr_word - block_index;  // word address of the block start
// load 4 words from memory starting at byte address (block_base * 4)

Larger blocks improve hit rate when programs stream through memory (great spatial locality), but blocks that are too large waste space and bandwidth on data that is never used. Block size is a tuning knob.

14. Worked Example: Tracing a Direct-Mapped Cache¶

Let us trace the 4-word, 1-word-block direct-mapped cache from Section 10 by hand. Stream of word addresses (so multiply by 4 for byte addresses): 0, 1, 2, 3, 4, 0.

For each, slot = word_addr % 4 and tag = word_addr / 4:

step	word addr	slot	tag	valid before?	tag match?	result
1	0	0	0	no	—	MISS
2	1	1	0	no	—	MISS
3	2	2	0	no	—	MISS
4	3	3	0	no	—	MISS
5	4	0	1	yes (tag 0)	no	MISS (evicts word 0)
6	0	0	0	yes (tag 1)	no	MISS (evicts word 4)

Result: 6 references, 0 hits, 6 misses → hit rate 0%. Steps 5 and 6 are conflict misses: words 0 and 4 both map to slot 0 and keep evicting each other, even though slots 1, 2, 3 sit idle holding stale data. A 2-way set-associative cache of the same size would have kept both word 0 and word 4 (different ways of the same set) and turned step 6 into a hit.

Key Concepts¶

Concept	Definition	Example
Cache	Small, fast SRAM memory holding recently used data, sitting between CPU and DRAM	On-chip L1 cache `$$`
Hit / Miss	Data is / is not already in the cache for a reference	Hit returns fast; miss goes to DRAM
Hit rate	`num_hits / num_refs`	0.97 means 97% served from cache
Temporal locality	Recently used data is likely reused soon	loop counter `i`, repeated calls
Spatial locality	Neighbors of used data are used soon	`a[i]` then `a[i+1]`; `PC` then `PC+4`
Tag	High address bits stored in a slot to confirm identity	`addr >> 4` in a 4-slot, 1-word cache
Valid bit	Flag marking whether a slot holds real data	distinguishes "empty" from "zero data"
Direct-mapped	Each address maps to exactly one slot	`slot = word_addr % num_slots`
Fully associative	An address may occupy any slot	search all slots, LRU replacement
Set associative	Map to a set, free placement within the set	4-way: 4 slots per set
Block size	Number of adjacent words moved per miss	4-word block exploits spatial locality
LRU	Evict the least-recently-used entry	smallest logical timestamp
SRAM vs DRAM	Fast/expensive/sparse vs slow/cheap/dense	cache vs main memory

Practice Problems¶

Problem 1: Hit/Miss/Rate basics¶

A cache services 1,000 memory references and records 30 misses. What are the miss rate and hit rate? How many hits occurred?

Click to reveal solution

miss_rate = num_misses / num_refs = 30 / 1000 = 0.03  (3%)
hit_rate  = 1 - miss_rate         = 0.97          (97%)
num_hits  = num_refs - num_misses = 1000 - 30 = 970

A 97% hit rate is typical for real programs thanks to locality.

Problem 2: Address breakdown¶

For a direct-mapped cache with 8 slots and a 1-word block, split a 64-bit byte address into tag, slot_index, and byte_offset. How many bits is each field, and how do you compute the slot index from addr?

Click to reveal solution

- `byte_offset` = 2 bits (word-aligned, low bits `[1:0]`). - `slot_index` = 3 bits, because `2^3 = 8` slots (`[4:2]`). - `tag` = the remaining high bits `[63:5]`.

tag[63:5]   slot_index[4:2]   byte_offset[1:0]

slot_index = (addr >> 2) & 0b111;   // == (addr / 4) % 8
addr_tag   = addr >> 5;             // drop offset(2) + index(3)

Problem 3: Trace a direct-mapped cache¶

Using a 4-slot, 1-word-block direct-mapped cache (initially all invalid), trace the word address stream 0, 4, 0, 4. Mark each as hit or miss and give the final hit rate.

Click to reveal solution

Both 0 and 4 map to slot 0 (`0 % 4 == 0`, `4 % 4 == 0`) with tags 0 and 1 respectively. | ref | word addr | slot | tag | result | reason | |-----|-----------|------|-----|--------|--------| | 1 | 0 | 0 | 0 | MISS | slot empty | | 2 | 4 | 0 | 1 | MISS | tag mismatch, evicts 0 | | 3 | 0 | 0 | 0 | MISS | tag mismatch, evicts 4 | | 4 | 4 | 0 | 1 | MISS | tag mismatch, evicts 0 | 4 refs, 0 hits → **hit rate 0%**. This is a pathological conflict-miss pattern ("ping-pong"). A 2-way set-associative cache would hold both 0 and 4 and turn refs 3 and 4 into hits (50% hit rate).

Problem 4: Why a valid bit?¶

After power-up the cache is all zeros. A request arrives whose computed tag is 0. Explain precisely why, without a valid bit, the cache would wrongly report a hit, and how the valid bit prevents it.

Click to reveal solution

On reset, every slot has `tag = 0` and `data = 0`. A request whose tag is also 0 would satisfy `slot.tag == addr_tag` even though nothing real was ever loaded — the cache returns stale 0 data and counts a false hit. The valid bit is initialized to 0 and only set to 1 after a real fill. The hit test becomes `slot.valid == 1 && slot.tag == addr_tag`, so an empty (just-reset) slot can never produce a hit, regardless of tag/data values.

Problem 5: Set-associative geometry¶

A 4-way set-associative cache has 64 total slots and a 1-word block. How many sets are there, how many set-index bits, and how do you compute set_base from set_index?

Click to reveal solution

num_sets   = total_slots / num_ways = 64 / 4 = 16 sets
set_index  bits = log2(16) = 4 bits

addr breakdown:  tag[63:6]  set_index[5:2]  byte_offset[1:0]

set_index = (addr >> 2) & 0b1111;   // low 4 bits of the word address
set_base  = set_index * num_ways;   // first slot of the set in the flat array

Lookup compares the tag against the 4 slots `cache[set_base .. set_base+3]` only.

Problem 6: LRU eviction¶

A fully-associative cache has 4 slots, all valid. Their logical timestamps (set on each access) are: slot0=12, slot1=7, slot2=20, slot3=15. A miss occurs. Which slot is evicted, and what is its new timestamp if the current num_refs is 21?

Click to reveal solution

LRU evicts the slot with the **smallest** timestamp — that is the one untouched the longest. The smallest is slot1 (7), so **slot1 is evicted**. After installing the new block, slot1's timestamp is updated to the current reference count: `num_refs = 21`. (`num_refs` is typically incremented at the start of handling this reference.)

Summary¶

Lab 06 wrap-up: fix sign_extend() by aligning the shift constant with the lecture's bit-numbering convention (change 64 to 63); pick one convention and apply it consistently.
Project 04: use jal instead of the call pseudo-instruction in the sort and fibrec code so you do not have to emulate auipc; targets are within jal range.
Dynamic analysis instruments the emulator to count instruction classes (I-type, R-type, load, store, branch) as they execute; fibrec 10 runs 1,675 instructions, and recursion vs sorting show very different memory profiles.
Caches exist because of the CPU–memory gap: CPU speed (Moore's Law) outpaced DRAM latency, so a small fast SRAM cache keeps hot data close and hides slow DRAM.
SRAM vs DRAM is the core trade-off: SRAM is fast/expensive/sparse (caches), DRAM is slow/cheap/dense (main memory); we cannot afford all-SRAM memory.
Hits and misses drive hit_rate = hits/refs and miss_rate = 1 - hit_rate; real programs hit >97% because of temporal and spatial locality.
Every cache answers three questions: where to look (placement/index), whether the data is really there (valid bit + tag), and what to evict (replacement policy).
Three organizations: direct-mapped (one slot per address, simple but conflict misses), fully-associative (any slot, no conflicts but expensive search), and set-associative (map to a set, free placement within it — the practical compromise); larger block sizes exploit spatial locality.

Lab 06: Cache Analysis¶

Overview¶

Learning Objectives¶

Prerequisites¶

1. Lab 06 Q&A: the sign_extend() Fix¶

The one-bit issue¶

2. Project 04: call vs jal¶

3. Dynamic Analysis: Counting Instructions at Run Time¶

4. Why Caches Exist: the CPU–Memory Gap¶

5. Two Types of RAM: SRAM vs DRAM¶

6. Hits, Misses, and Rates¶

7. The Three Cache Questions¶

8. Principles of Locality¶

Temporal locality¶

Spatial locality¶

9. Addresses, Words, and Blocks¶

10. Direct-Mapped Cache¶

The mapping function¶

The tag and valid bit (Q2)¶

Lookup pseudocode¶

11. Fully-Associative Cache¶

Lookup pseudocode¶

Replacement (Q3): LRU¶

12. Set-Associative Cache¶

Lookup pseudocode (4-way, 32 slots = 8 sets)¶

13. Block Size and Spatial Locality¶

14. Worked Example: Tracing a Direct-Mapped Cache¶

Key Concepts¶

Practice Problems¶

Problem 1: Hit/Miss/Rate basics¶

Problem 2: Address breakdown¶

Problem 3: Trace a direct-mapped cache¶

Problem 4: Why a valid bit?¶

Problem 5: Set-associative geometry¶

Problem 6: LRU eviction¶

Further Reading¶

Summary¶

1. Lab 06 Q&A: the `sign_extend()` Fix¶

2. Project 04: `call` vs `jal`¶