Skip to content

Cache Simulation

Overview

This lecture introduces cache memory: a small, fast memory that sits between the processor and main memory and holds recently accessed data so that future requests can be served quickly. We study why caches work (the principles of temporal and spatial locality), how a memory address is decomposed into a tag, an index, and an offset, and how three classic cache organizations — direct-mapped, fully associative, and set-associative — locate data and decide what to evict. These ideas are the foundation for the instruction-cache simulator you build in Project 4, where you extend a given direct-mapped cache to a 4-word block size and implement a 4-way set-associative cache with LRU replacement.

Learning Objectives

  • Explain what a cache is, why main memory is slow relative to the CPU, and how caches exploit temporal and spatial locality.
  • Compute hit rate and miss rate from raw hit, miss, and reference counts.
  • Decompose a 64-bit byte address into tag, index/set-index, block offset (word offset), and byte offset fields.
  • Trace a direct-mapped cache lookup by hand, including valid-bit and tag checks, hits, and misses.
  • Compare direct-mapped, fully associative, and set-associative caches in terms of flexibility and cost.
  • Describe the role of the valid bit, the tag, and recency metadata (timestamps) in a cache slot.
  • Implement an LRU replacement policy using a logical timestamp driven by the reference count.
  • Explain how increasing block size improves the hit rate by exploiting spatial locality.

Prerequisites

  • RISC-V machine code emulation and the struct rv_state model (Lab 3).
  • Bitwise operators: shift (<<, >>), AND (&), OR (|), and masking (Project 3 / key concepts).
  • Binary and hexadecimal number systems, and the relationship between bytes and words.
  • C structs, arrays, and pointer dereferencing (*((uint32_t *) addr)).
  • The fixed-size integer types uint8_t, uint32_t, and uint64_t.

1. Why Caches Exist

A modern processor can execute instructions far faster than main memory (DRAM) can supply data. If every load and store had to wait for main memory, the CPU would spend most of its time stalled. A cache is a small block of fast memory — typically built from static RAM (SRAM) — placed between the processor and main memory. It keeps copies of recently used data so that repeat accesses are fast.

flowchart LR
    P[Processor] -->|"addr (request)"| C[Cache]
    C -->|data| P
    C -->|"addr (on miss)"| M[Main Memory]
    M -->|data| C

    style C fill:#f9f,stroke:#333,stroke-width:2px

The processor sends the cache an address and expects a value back. There are two outcomes:

  1. Hit — the data for that address is already in the cache. The cache returns it immediately, no main-memory access required.
  2. Miss — the data is not in the cache. The cache must fetch it from main memory, store a copy, and then return it. This is slow.

The instructor drew this as a numbered data path: on a hit (path 1) data flows straight back from the cache to the processor. On a miss (paths 2 and 3) the cache first goes out to memory for the data (2), then forwards it back to the processor (3).

A memory reference (or "ref") is a single request from the processor to the cache. Each reference is either a hit or a miss, so:

num_refs = num_hits + num_misses

Caches Are Everywhere

Although this course focuses on the cache memory inside a processor, the same idea appears throughout computer systems: operating-system page caches, database buffer pools, web browser caches, and CDN edge caches all keep recently or frequently used data in faster storage to avoid going back to a slow source.


2. Hit Rate and Miss Rate

The effectiveness of a cache is measured by its hit rate and miss rate. During a run we count three things: the number of references, the number of hits, and the number of misses. From those counts:

hit_rate  = num_hits   / num_refs
miss_rate = num_misses / num_refs

Because every reference is either a hit or a miss, the two rates are complements:

hit_rate  = 1 - miss_rate
miss_rate = 1 - hit_rate

Our goal is to design caches with high hit rates. In practice, real caches are remarkably successful — hit rates above 95% (often above 97%) are common — because programs do not access memory randomly. They follow the principles of locality (Section 3).

Worked Example

Suppose a program makes 1000 memory references, of which 850 are hits.

num_refs  = 1000
num_hits  = 850
num_misses = 1000 - 850 = 150

hit_rate  = 850 / 1000  = 0.85  (85%)
miss_rate = 150 / 1000  = 0.15  (15%)

Note that hit_rate + miss_rate = 0.85 + 0.15 = 1.00, as expected.


3. The Principles of Locality

Caches work because real program execution exhibits locality — accesses cluster in time and in space rather than being spread uniformly across memory.

Principle Statement Code example
Temporal locality If a location is accessed, it is likely to be accessed again soon. A loop index i, a loop body, or a function called repeatedly.
Spatial locality If a location is accessed, nearby locations are likely to be accessed soon. Walking an array a[i], then a[i+1]; executing instruction at PC, then PC+4.

Temporal locality explains why keeping a recently used value pays off: a loop counter or the instructions inside a loop body are touched again on the very next iteration.

Spatial locality explains why we should fetch a block of neighboring data on a miss rather than a single word. Instruction streams are an excellent example: unless the current instruction is a branch or jump, the next instruction executed is the one at PC + 4. The Project 4 simulator focuses on this instruction-fetch reference stream.

The simulator in Project 4 simulates the cache for instruction memory references only — there is no separate data cache. Instruction memory is effectively read-only, so we never have to write modified data back to memory.


4. Addresses, Words, and Blocks

A memory request is just an address. On modern machines addresses are byte addresses: each unique address names one byte (8 bits). You can picture all of memory as a giant array of bytes:

uint8_t mem[N];   // N bytes; the index is the byte address

A RISC-V instruction or an int is a word — 32 bits, or 4 bytes. We assume the processor requests word values at word-aligned addresses (multiples of 4). With that assumption, memory becomes a sequence of words:

word 0  -> mem[0], mem[1], mem[2], mem[3]
word 1  -> mem[4], mem[5], mem[6], mem[7]
word 2  -> mem[8], mem[9], mem[10], mem[11]
...

Convert between byte addresses and word addresses by dividing or multiplying by 4:

addr_word = addr / 4;        // byte address -> word index
addr      = addr_word * 4;   // word index   -> byte address

Because 4 is a power of two, the same operations can be done with shifts:

addr_word = addr >> 2;       // divide by 4
addr      = addr_word << 2;  // multiply by 4

The bottom two bits that are lost in the shift are the byte offset — they select a byte within a word. We only need them if we want to extract a specific byte from a word.

Memory as an Array of Blocks

To exploit spatial locality, a cache moves data in chunks called blocks. The block size is the number of bytes (or words) transferred between memory and the cache on a miss. If the block size is 4 words, then memory is logically an array of 4-word blocks:

block 0 -> words 0..3   (bytes 0..15)
block 1 -> words 4..7   (bytes 16..31)
block 2 -> words 8..11  (bytes 32..47)
...

The lecture diagram showed memory drawn as a column of bytes on the left, grouped into words in the middle, and grouped into 4-word blocks on the right. A single byte address therefore picks out a byte, a word, and a block all at once — which fields of the address you look at decides which level you mean.


5. Direct-Mapped Cache

A direct-mapped cache is the simplest organization: every memory address maps to exactly one slot in the cache. This makes the replacement policy trivial — if the data you want is not in the slot it maps to, you simply overwrite that slot.

The Mapping Function

Consider a cache that holds 4 words:

uint32_t cache[4];

A position in the cache is a slot. To map a word address to a slot index, take the remainder modulo the number of slots:

slot_index = addr_word % 4;

This is a tiny hash function. With 4 slots, these word addresses all collide on slot 0: 0, 4, 8, 12, 16, .... These all map to slot 1: 1, 5, 9, 13, 17, .... The instructor worked two cases:

addr_word   slot_index = addr_word % 4
    4            4 % 4 = 0
   17           17 % 4 = 1

Modulo via Bit Masking

When the number of slots is a power of two, the modulo is just the low bits of the word address, so we can use a mask instead of a division:

slot_index = addr_word & 0b11;   // for 4 slots, keep the low 2 bits

Why does this work? Look at the low bits of consecutive word addresses:

dec   bin     slot (= low 2 bits)
 0   00000     0
 1   00001     1
 2   00010     2
 3   00011     3
 4   00100     0
 5   00101     1
 6   00110     2
 7   00111     3
 8   01000     0
 9   01001     1
10   01010     2

The low two bits are exactly addr_word % 4. The number of index bits grows with the cache size:

Slots Index bits Mask
4 2 0b11
8 3 0b111
16 4 0b1111
128 7 0b1111111

As the number of slots increases, more bits go to the index and fewer remain for the tag.

Going Straight from a Byte Address

We usually start with the byte address, not the word address. To get the slot index from a byte address, shift right by 2 (to drop the byte offset and form the word address) and then mask:

slot_index = (addr >> 2) & 0b11;

What Goes in a Slot: Tag and Valid Bit

Many addresses map to the same slot, so the slot alone cannot tell us which address it currently holds. We store extra bookkeeping with the data:

  • The tag — the upper bits of the address that uniquely identify which of the colliding blocks is in the slot.
  • The valid bit — whether the slot holds usable data at all.

Look again at the word addresses that map to slot 0 (0, 4, 8, 12). Their upper bits (000, 001, 010, 011) are all distinct — those upper bits are the tag.

A direct-mapped slot is a struct:

struct cache_slot_st {
    uint64_t tag;
    uint32_t data;     // one word, for block size = 1
    uint32_t valid;
};

struct cache_slot_st cache[N];

Why We Need the Valid Bit

On power-up, all of memory and the cache are zeroed. But a tag of 0 and data of 0 are perfectly legitimate values, so without extra information the cache would falsely "hit" on address 0 before anything was ever loaded. The valid bit solves this: it starts at 0, meaning "this slot is empty." Only after a real fill do we set it to 1.

Address Breakdown (4 slots, block size 1)

For a 64-bit byte address with 4 slots and a 1-word block:

 63 ............. 4 | 3   2 | 1   0
        tag         | index | byte offset
  • bits [1:0] — byte offset (which byte within the word)
  • bits [3:2] — slot index (which of the 4 slots)
  • bits [63:4] — tag (everything above)

Direct-Mapped Lookup Pseudocode

This is the lookup the instructor wrote for a 4-slot, 1-word-block cache (it matches the given Project 4 starter code):

uint32_t dm_lookup(uint64_t addr) {
    uint64_t addr_tag   = addr >> 4;            // drop 2 byte bits + 2 index bits
    uint32_t index_mask = 0b11;
    uint32_t slot_index = (addr >> 2) & index_mask;
    struct cache_slot_st *slot = &cache[slot_index];

    if (slot->valid == 1 && slot->tag == addr_tag) {
        // hit
        return slot->data;
    } else {
        // miss
        slot->data  = *((uint32_t *) addr);     // fetch the word from memory
        slot->tag   = addr_tag;
        slot->valid = 1;
        return slot->data;
    }
}

The flow of a single lookup:

flowchart TD
    A["addr"] --> B["index = (addr >> 2) & mask"]
    A --> C["tag = addr >> (2 + index_bits)"]
    B --> D["slot = cache[index]"]
    C --> E{"slot.valid == 1<br/>AND slot.tag == tag?"}
    D --> E
    E -->|yes| F["HIT: return slot.data"]
    E -->|no| G["MISS: load from memory,<br/>set tag, valid = 1, return data"]

    style F fill:#bfb,stroke:#333
    style G fill:#fbb,stroke:#333

Strength and weakness. Direct mapping is simple and fast — one slot to check, no search. But it is rigid: two hot addresses that map to the same slot will repeatedly evict each other (a conflict miss), even if the rest of the cache is empty.


6. Fully Associative Cache

A fully associative cache removes the rigidity of direct mapping: a block can be placed in any slot. This eliminates conflict misses, because an incoming block can always go to an empty (or least-recently-used) slot regardless of its address.

The price is search cost. Because there is no fixed slot for an address, a lookup must compare the request's tag against the tag in every slot. The lecture sketch showed the address tag fanning out to a comparator (=) on each slot in parallel, with all comparator outputs feeding into a single "hit?" check.

flowchart TD
    A["addr tag"] --> C0["= slot0.tag?"]
    A --> C1["= slot1.tag?"]
    A --> C2["= slot2.tag?"]
    A --> C3["= slot3.tag?"]
    C0 --> H["hit?"]
    C1 --> H
    C2 --> H
    C3 --> H

In software this fan-out becomes a loop over the slots; in hardware the comparisons happen in parallel but require a lot of logic (one comparator per slot), which is why fully associative caches are expensive and reserved for small, special structures such as TLBs (hardware support for virtual memory).

Address Breakdown

With no index, the address is just a tag and a byte offset:

 63 ................. 2 | 1   0
          tag           | byte offset

Slot Structure and Lookup

We reuse the slot struct but add a timestamp field to track recency for replacement:

struct cache_slot_st {
    uint64_t tag;
    uint32_t data;
    uint32_t valid;
    uint64_t timestamp;   // logical time of last use, for LRU
};

struct cache_slot_st cache[N];
uint32_t fa_lookup(uint64_t addr) {
    num_refs += 1;
    uint64_t addr_tag = addr >> 2;              // drop byte offset bits

    for (int i = 0; i < N; i++) {
        struct cache_slot_st *slot = &cache[i];
        if (slot->valid == 1 && slot->tag == addr_tag) {
            // hit
            slot->timestamp = num_refs;          // mark as just used
            return slot->data;
        }
    }

    // miss
    struct cache_slot_st *slot = find_lru_slot(cache);
    slot->data      = *((uint32_t *) addr);
    slot->tag       = addr_tag;
    slot->valid     = 1;
    slot->timestamp = num_refs;
    return slot->data;
}

Choosing a Victim: LRU

On a miss we need a slot for the incoming block. find_lru_slot() does two things, in order:

  1. Prefer an unused slot — scan for the first slot with valid == 0 and use it.
  2. Otherwise evict the Least Recently Used (LRU) slot — the slot whose timestamp is smallest, meaning it has gone the longest without a reference.

The timestamp is a logical clock: it is simply the current num_refs value, which monotonically increases with every reference. Each time a slot is read or filled, we stamp it with the current reference count. The slot with the lowest stamp has not been touched in the longest time.

In hardware, a full per-slot timestamp and an exhaustive minimum search are too costly, so real designs use approximate LRU. For the simulator, the exact timestamp approach is fine.


7. Set-Associative Cache

A set-associative cache is the practical middle ground between rigid (direct-mapped) and expensive (fully associative). Think of the cache as an array of sets. An address maps to exactly one set using a direct-mapped-style index, but within that set the block may occupy any slot. The number of slots per set is the number of ways.

We name a cache by its associativity: a 4-way set-associative cache has 4 ways per set; an 8-way has 8. A direct-mapped cache is 1-way; a fully associative cache is N-way (one set).

flowchart TD
    A["addr"] --> B["set_index = (addr >> 2) & set_mask"]
    A --> T["tag = upper bits"]
    B --> S["select set"]
    S --> W0["way 0: valid? tag==?"]
    S --> W1["way 1: valid? tag==?"]
    S --> W2["way 2: valid? tag==?"]
    S --> W3["way 3: valid? tag==?"]
    T --> W0
    T --> W1
    T --> W2
    T --> W3
    W0 --> H{"any match?"}
    W1 --> H
    W2 --> H
    W3 --> H
    H -->|yes| HIT["HIT: update timestamp"]
    H -->|no| MISS["MISS: fill LRU way in this set"]

    style HIT fill:#bfb,stroke:#333
    style MISS fill:#fbb,stroke:#333

Counting Sets

If the cache has N total slots and num_ways per set, then:

num_sets = N / num_ways

For example, 32 slots with 4 ways gives 32 / 4 = 8 sets, which needs 3 index bits (2^3 = 8).

Address Breakdown (block size 1)

 63 ........ 5 | 4   3   2 | 1   0
      tag      | set index | byte offset

Here the set index is 3 bits because there are 8 sets.

Laying Out Sets in a Flat Array

We can reuse one flat cache[] array and treat consecutive groups of num_ways slots as one set. With 4 ways, slots 0..3 are set 0, slots 4..7 are set 1, and so on. Once we have the set index, the first slot of that set is:

set_base = set_index * num_ways;

The instructor's diagram laid this out vertically: slots 0–1 were set 0 (way 0, way 1), slots 2–3 were set 1, and set_base pointed at the first way of the selected set.

Set-Associative Lookup Pseudocode

This is the instructor's pseudocode for a lookup. (Note: the shift/mask constants here are written for a small example with 2 ways and an 8-slot layout, illustrating the idea — adjust the masks and num_ways to match your actual configuration.)

uint32_t sa_lookup(uint64_t addr) {
    num_refs += 1;
    int num_ways = 4;

    uint64_t addr_tag  = addr >> 5;             // drop byte offset + set index bits
    uint32_t set_index = (addr >> 2) & 0b111;   // low set-index bits (8 sets)
    uint32_t set_base  = set_index * num_ways;  // first slot of this set

    for (int i = 0; i < num_ways; i++) {
        struct cache_slot_st *slot = &cache[set_base + i];
        if (slot->valid == 1 && slot->tag == addr_tag) {
            // hit
            slot->timestamp = num_refs;
            return slot->data;
        }
    }

    // miss
    struct cache_slot_st *slot = find_lru_slot_in_set(cache, set_base);
    slot->data      = *((uint32_t *) addr);
    slot->tag       = addr_tag;
    slot->valid     = 1;
    slot->timestamp = num_refs;
    return slot->data;
}

The key efficiency win: we search only the num_ways slots in one set, not all N slots. find_lru_slot_in_set() mirrors the fully associative version but is scoped to a single set: prefer an unused (valid == 0) way first, then evict the LRU way by smallest timestamp.

Comparing the Three Organizations

Organization Where a block can go Lookup cost Conflict misses Used for
Direct-mapped exactly 1 slot 1 compare high simple caches
Set-associative (n-way) any way in 1 set n compares low most real L1/L2 caches
Fully associative any slot N compares none small structures (TLB)

Direct-mapped and fully associative are the two extremes of set-associative: 1-way = direct-mapped, N-way (one set) = fully associative.


8. Block Size Greater Than One Word

So far each slot has held a single word. To exploit spatial locality we make each slot hold a block of several words. For a 4-word block:

struct cache_slot_st {
    uint64_t tag;
    uint32_t block[4];     // 4 words per slot
    uint32_t valid;
    uint64_t timestamp;
};

struct cache_slot_st cache[N];

A single tag now covers an entire 4-word block. On a hit we still must pick out the right word within the block, which needs a new address field: the block offset (the instructor also called it the word offset).

Address Breakdown with a 4-Word Block

The address now has four fields. The block-offset bits sit just above the byte offset:

Direct-mapped, 4 slots, 4-word block:

 63 ...... 6 | 5   4 | 3   2 | 1   0
     tag     | slot  | word  | byte
             | index | offset| offset

Fully associative, 4-word block:

 63 ...... 4 | 3   2 | 1   0
     tag     | word  | byte
             | offset| offset

Set-associative (2-way), 4-word block:

 63 ...... 5 | 4  | 3   2 | 1   0
     tag     |set | word  | byte
             |idx | offset| offset

The 2 word-offset bits select which of the 4 words in the block to return (00=word 0, 01=word 1, 10=word 2, 11=word 3).

Hit and Miss Behavior with Blocks

The instructor summarized the two cases for block size greater than 1:

1. On a hit — get the word from the slot using the word offset:

uint32_t word_offset = (addr >> 2) & 0b11;   // which word in the block
return slot->block[word_offset];

2. On a miss — find the block's starting address and load all N words:

uint64_t addr_word   = addr / 4;                  // word index
uint64_t block_index = addr_word % 4;             // 4 = block size in words
uint64_t block_base  = addr_word - block_index;   // word address of first word

for (int i = 0; i < 4; i++) {
    uint64_t byte_addr = (block_base + i) * 4;     // convert word -> byte address
    slot->block[i] = *((uint32_t *) byte_addr);
}
slot->tag   = /* tag bits of addr */;
slot->valid = 1;

block_base is a word address, so remember to multiply by 4 before dereferencing. (You can also do all of this with bit operations on the byte address directly — block_base_byte = addr & ~0xF for a 16-byte block.)

Why Bigger Blocks Help

Because instructions execute mostly in sequence (PC, PC+4, PC+8, ...), fetching a 4-word block on the first miss means the next three instruction fetches are hits. The first access still pays the miss cost, but it is amortized over the whole block.

The instructor demonstrated this live: increasing the direct-mapped block size from 1 word to 4 words raised the hit ratio from about 62% to 85% on the same program — a direct payoff from spatial locality.

flowchart LR
    A["block size 1<br/>hit rate ~62%"] -->|"exploit spatial locality"| B["block size 4<br/>hit rate ~85%"]
    style B fill:#bfb,stroke:#333

There is a limit, though: if blocks get too large relative to the cache, you fetch words you never use and you hold fewer distinct blocks, which can hurt. For Project 4, a 4-word block is the sweet spot we explore.


9. Project 4: The Cache Simulator

Project 4 extends your RISC-V emulator with dynamic analysis (instruction counts) and an instruction-cache simulator. The cache simulates only instruction-fetch references — every time the emulator fetches an instruction at the PC, it makes one reference to the cache.

You must implement four cache configurations:

# Type Block size Replacement Status
1 Direct-mapped 1 word trivial (overwrite) given
2 Direct-mapped 4 words trivial (overwrite) you write
3 4-way set-associative 1 word LRU you write
4 4-way set-associative 4 words LRU you write

Implementation Notes from Lecture

  • The simulator selects the cache type via an enum in RVEMU.h, so one code base supports both direct-mapped and set-associative modes.
  • Start from the given direct-mapped, block-size-1 cache and make sure you fully understand how its slot struct, tag, valid bit, and mapping function fit together before extending it.
  • For block size 4, add a block[4] array and a word-offset extraction; on a miss, compute block_base and load all four words.
  • For set-associative, group the flat slot array into sets of 4, search the ways in a set, and add find_lru_slot_in_set() driven by a timestamp updated on every reference.
  • Track recency with a logical timestamp (the running reference count) — increment num_refs on every lookup and stamp each accessed slot with it.

Debugging Strategies (from the same session)

The session opened with debugging advice for the project and lab. Build and test incrementally rather than writing everything at once, and use these three techniques:

  • Custom test programs — write small, targeted assembly/test inputs that exercise exactly one behavior (e.g., a tight loop to stress spatial locality).
  • GDB — set breakpoints, single-step, inspect registers, and trace instruction execution to see exactly what the emulator does.
  • Code instrumentation — add printf statements to print control flow and cache state (hits/misses, slot indexes, tags) so you can watch the cache fill and evict.

Key Concepts

Concept Definition Example
Cache Small fast memory holding recently used data from main memory SRAM cache between CPU and DRAM
Hit / Miss Requested data is / is not present in the cache Hit returns data immediately; miss fetches from memory
Hit rate num_hits / num_refs 850 hits / 1000 refs = 0.85
Temporal locality Recently used data is likely reused soon A loop counter i
Spatial locality Neighbors of accessed data are likely accessed soon PC then PC+4; a[i] then a[i+1]
Tag Upper address bits identifying which block is in a slot addr >> 4 for 4-slot, 1-word DM
Valid bit Whether a slot holds usable data 0 at power-up, 1 after a fill
Slot index (addr >> 2) & mask selecting a direct-mapped slot (addr >> 2) & 0b11 for 4 slots
Block size Number of words moved per miss 4-word block = 16 bytes
Word offset Address bits selecting a word within a block bits [3:2] for a 4-word block
Associativity / way Number of slots a block may occupy within a set 4-way = 4 candidate slots per set
Set Group of ways an address maps to 32 slots / 4 ways = 8 sets
LRU Evict the least recently used slot smallest timestamp in the set
Eviction Removing data to make room for new data overwrite victim slot on a miss

Practice Problems

Problem 1: Hit and Miss Rates

A program issues 2,000 memory references. The cache records 1,840 hits. What are the number of misses, the hit rate, and the miss rate?

Click to reveal solution
num_refs   = 2000
num_hits   = 1840
num_misses = 2000 - 1840 = 160

hit_rate   = 1840 / 2000 = 0.92  (92%)
miss_rate  = 160  / 2000 = 0.08  (8%)

Check: hit_rate + miss_rate = 0.92 + 0.08 = 1.00  ✓
       miss_rate = 1 - hit_rate = 1 - 0.92 = 0.08  ✓

Problem 2: Direct-Mapped Slot and Tag

A direct-mapped cache has 4 slots and a block size of 1 word. For the byte address 0x44 (decimal 68), compute the byte offset, the slot index, and the tag.

Click to reveal solution
addr = 0x44 = 68 = 0b1000100

Byte offset = addr & 0b11        = 0b00          = 0
Word index  = addr >> 2          = 68 / 4        = 17
Slot index  = (addr >> 2) & 0b11 = 17 & 0b11     = 1   (since 17 % 4 = 1)
Tag         = addr >> 4          = 68 >> 4       = 4

So 0x44 maps to slot 1 with tag 4.
This matches the lecture's worked example: word 17, 17 % 4 = 1.

Problem 3: Trace a Direct-Mapped Cache

A 4-slot, 1-word-block direct-mapped cache starts empty (all valid bits 0). The processor references this sequence of word indexes: 0, 1, 2, 0, 4, 0. Mark each as a hit or miss and give the final cache contents.

Click to reveal solution
Recall: slot = word_index % 4, tag = word_index / 4.

ref  word  slot  tag  result   reason
 1    0     0     0   MISS     slot 0 empty -> fill (tag 0)
 2    1     1     0   MISS     slot 1 empty -> fill (tag 0)
 3    2     2     0   MISS     slot 2 empty -> fill (tag 0)
 4    0     0     0   HIT      slot 0 valid, tag 0 matches
 5    4     0     1   MISS     slot 0 has tag 0, need tag 1 -> evict, fill
 6    0     0     0   MISS     slot 0 now has tag 1, need tag 0 -> evict, fill

Hits = 1, Misses = 5, refs = 6
hit_rate = 1/6 ≈ 0.167 (16.7%)

Final cache:
  slot 0: valid=1, tag=0  (word 0)   <- last fill
  slot 1: valid=1, tag=0  (word 1)
  slot 2: valid=1, tag=0  (word 2)
  slot 3: valid=0        (never used)

Note refs 5 and 6 thrash: words 0 and 4 both map to slot 0 and
keep evicting each other (a conflict miss). A 2-way set-associative
cache would keep both.

Problem 4: Set-Associative Geometry

A 4-way set-associative cache has 64 total slots and a block size of 1 word. How many sets are there, how many bits does the set index need, and how is the byte address 0xA0 decomposed?

Click to reveal solution
num_sets = total_slots / num_ways = 64 / 4 = 16 sets
set index bits = log2(16) = 4 bits

Address fields (1-word block):
  byte offset = bits [1:0]   (2 bits)
  set index   = bits [5:2]   (4 bits)
  tag         = bits [63:6]

For addr = 0xA0 = 160 = 0b10100000:
  byte offset = 160 & 0b11        = 0
  set index   = (160 >> 2) & 0b1111 = 40 & 0b1111 = 8
  tag         = 160 >> 6          = 2

So 0xA0 maps to set 8 and may occupy any of its 4 ways; tag = 2.

Problem 5: Block Size and Spatial Locality

A program executes 16 instructions in straight-line order at word indexes 0,1,2,...,15 (no branches), starting from a cold cache. Compare the number of misses for (a) block size 1 word and (b) block size 4 words. Assume the cache is large enough that no eviction occurs.

Click to reveal solution
(a) Block size = 1 word:
    Each new word is a fresh miss because nothing was prefetched.
    Misses = 16, hits = 0
    hit_rate = 0 / 16 = 0%

(b) Block size = 4 words:
    The 16 words form 4 blocks: [0-3], [4-7], [8-11], [12-15].
    The FIRST access to each block misses and loads all 4 words;
    the next 3 accesses in that block are hits.

    block [0-3]:   miss, hit, hit, hit
    block [4-7]:   miss, hit, hit, hit
    block [8-11]:  miss, hit, hit, hit
    block [12-15]: miss, hit, hit, hit

    Misses = 4, hits = 12
    hit_rate = 12 / 16 = 0.75 (75%)

Spatial locality turns 16 misses into 4. This is exactly why the
instructor's demo jumped from ~62% to ~85% by going from a 1-word
to a 4-word block.

Problem 6: LRU Eviction in a Set

A single set has 2 ways. References arrive (as tags) in this order, with the logical timestamp = reference number: A, B, A, C. Both ways start invalid. After all four references, which tag was evicted (if any), and what does the set hold?

Click to reveal solution
ref  tag  action                          way0        way1
 1    A   miss, fill invalid way0          A (ts=1)    -
 2    B   miss, fill invalid way1          A (ts=1)    B (ts=2)
 3    A   HIT, restamp A                   A (ts=3)    B (ts=2)
 4    C   miss, set full -> evict LRU      A (ts=3)    C (ts=4)
         LRU = smallest timestamp = B (ts=2), so B is evicted

Evicted: B
Final set: { A (ts=3), C (ts=4) }

Because A was re-referenced at ref 3, its timestamp (3) was higher
than B's (2), so B became the least recently used and was evicted.
A direct-mapped cache where A and B collided would have thrashed;
the 2-way set kept A available.

Further Reading


Summary

  1. A cache is a small, fast memory between the CPU and slow main memory; it serves repeat and nearby accesses quickly, returning data on a hit and fetching from memory on a miss.

  2. Performance is measured by hit and miss rates: hit_rate = num_hits / num_refs, miss_rate = num_misses / num_refs, and the two always sum to 1.

  3. Caches work because of locality — temporal (reuse soon) and spatial (neighbors soon) — which is why instruction streams and array traversals cache so well.

  4. An address decomposes into fields: byte offset (low 2 bits), then word/block offset, then an index or set index, then the tag. More slots or more sets shift bits from the tag into the index.

  5. A direct-mapped cache maps each address to exactly one slot ((addr >> 2) & mask), checks the valid bit and tag for a hit, and overwrites the slot on a miss — simple but prone to conflict misses.

  6. Fully associative and set-associative caches add flexibility: fully associative allows any slot (search all, expensive); set-associative maps to a set and allows any way within it (search a few), using LRU to choose a victim.

  7. LRU replacement uses a logical timestamp equal to the running reference count; the slot with the smallest timestamp is the least recently used and is evicted, after first preferring any invalid slot.

  8. Larger block sizes exploit spatial locality by loading neighboring words on a miss; in the live demo a 4-word block raised the direct-mapped hit rate from about 62% to 85%, and Project 4 asks you to build exactly these direct-mapped and 4-way set-associative variants at 1- and 4-word block sizes.