Cache Simulation

# Cache Simulation

## CS 315 Computer Architecture

---

## Why Caches Exist

- CPUs execute far faster than DRAM can supply data
- A **cache** is small, fast SRAM between the CPU and main memory
- It holds recently used data so repeat accesses are fast

---

## Hits and Misses

- **Hit** — data is in the cache; returned immediately
- **Miss** — data not in cache; must fetch from main memory (slow)
- **Memory reference** = one request from processor to cache

```text
num_refs = num_hits + num_misses
```

<div class="info-box">
Caches appear everywhere: OS page caches, database buffer pools,
browser caches, CDN edge nodes — all exploit the same principle.
</div>

---

## Hit Rate and Miss Rate

```text
hit_rate  = num_hits   / num_refs
miss_rate = num_misses / num_refs

hit_rate + miss_rate = 1
```

**Worked example**: 1000 refs, 850 hits

```text
num_misses = 150
hit_rate   = 850 / 1000 = 0.85  (85%)
miss_rate  = 150 / 1000 = 0.15  (15%)
```

<div class="highlight-box">
Real caches achieve hit rates above 95% — programs are not random!
</div>

---

## Principles of Locality

| Principle | Statement | Example |
|-----------|-----------|---------|
| **Temporal** | Recently used data will be used again soon | Loop counter `i`; loop body instructions |
| **Spatial** | Nearby addresses will be accessed soon | `a[i]` then `a[i+1]`; `PC` then `PC+4` |

- Temporal locality: keep recently accessed data in the cache
- Spatial locality: on a miss, fetch a whole **block** of neighboring data

<div class="info-box">
Project 4 simulates the <strong>instruction fetch</strong> reference stream — a nearly perfect example of spatial locality.
</div>

---

## Addresses, Words, and Blocks

- Memory is byte-addressed: each byte has a unique address
- A RISC-V **word** = 32 bits = 4 bytes (word-aligned)

```c
addr_word = addr >> 2;   // byte address -> word index
addr      = addr_word << 2;  // word index -> byte address
```

- The low 2 bits (lost by `>> 2`) are the **byte offset**
- A **block** = several consecutive words fetched together on a miss

```text
block 0 -> words 0..3   (bytes  0..15)
block 1 -> words 4..7   (bytes 16..31)
block 2 -> words 8..11  (bytes 32..47)
```

---

## Direct-Mapped Cache

Every address maps to **exactly one** slot:

```c
slot_index = addr_word % 4;   // 4-slot cache
```

Because 4 is a power of two, use a bitmask instead:

```c
slot_index = (addr >> 2) & 0b11;  // drop byte bits, keep low 2
```

| Word addr | Slot (`% 4`) |
|-----------|-------------|
| 0, 4, 8, 12 | 0 |
| 1, 5, 9, 13 | 1 |
| 2, 6, 10, 14 | 2 |
| 3, 7, 11, 15 | 3 |

---

## Slot Struct: Tag and Valid Bit

Many addresses map to the same slot — need extra bookkeeping:

- **Tag** — upper address bits identifying *which* block is stored
- **Valid bit** — 0 at power-up ("empty"), 1 after a fill

```c
struct cache_slot_st {
    uint64_t tag;
    uint32_t data;    // one word (block size = 1)
    uint32_t valid;
};
```

<div class="highlight-box">
Without the valid bit, a zeroed cache would falsely "hit" on address 0
before anything was loaded.
</div>

---

## Address Breakdown: 4 Slots, Block Size 1

```text
 63 ............. 4 | 3   2 | 1   0
        tag         | index | byte offset
```

- `bits [1:0]` — byte offset (which byte in the word)
- `bits [3:2]` — slot index (which of the 4 slots)
- `bits [63:4]` — tag

```c
addr_tag   = addr >> 4;
slot_index = (addr >> 2) & 0b11;
```

As cache size grows, **more bits go to index, fewer remain for tag**.

---

## Direct-Mapped Lookup

```c
uint32_t dm_lookup(uint64_t addr) {
    uint64_t addr_tag   = addr >> 4;
    uint32_t slot_index = (addr >> 2) & 0b11;
    struct cache_slot_st *slot = &cache[slot_index];

if (slot->valid == 1 && slot->tag == addr_tag) {
        return slot->data;          // HIT
    } else {
        slot->data  = *((uint32_t *) addr);  // MISS: fetch from memory
        slot->tag   = addr_tag;
        slot->valid = 1;
        return slot->data;
    }
}
```

---

## Lookup Flow

<div class="mermaid">
flowchart TD
    A["addr"] --> B["index = (addr >> 2) & mask"]
    A --> C["tag = addr >> 4"]
    B --> D["slot = cache[index]"]
    C --> E{"slot.valid == 1 AND slot.tag == tag?"}
    D --> E
    E -->|yes| F["HIT: return slot.data"]
    E -->|no| G["MISS: load from memory, update slot"]
</div>

---

## Direct-Mapped: Strength and Weakness

**Strengths**
- Simple and fast
- One slot to check
- No search required

</div>
<div>

**Weakness**
- **Conflict misses**: two hot addresses sharing a slot thrash each other
- Even if the rest of the cache is empty

</div>
</div>

<div class="highlight-box">
Example: word 0 and word 4 both map to slot 0. Alternating accesses
cause a miss on <em>every</em> reference.
</div>

---

## Fully Associative Cache

A block can go in **any** slot — eliminates conflict misses.

Lookup must check **every** slot's tag (fan-out search):

<div class="mermaid">
flowchart TD
    A["addr tag"] --> C0["= slot0.tag?"]
    A --> C1["= slot1.tag?"]
    A --> C2["= slot2.tag?"]
    A --> C3["= slot3.tag?"]
    C0 --> H["hit?"]
    C1 --> H
    C2 --> H
    C3 --> H
</div>

In software: a loop. In hardware: one comparator per slot — expensive.

---

## Fully Associative: Address and Slot

Address breakdown (no index field):

```text
 63 ................. 2 | 1   0
          tag           | byte offset
```

Slot struct adds a **timestamp** for LRU replacement:

```c
struct cache_slot_st {
    uint64_t tag;
    uint32_t data;
    uint32_t valid;
    uint64_t timestamp;   // logical time of last use
};
```

---

## LRU Replacement Policy

On a miss, `find_lru_slot()`:

1. **Prefer an unused slot** — scan for `valid == 0`
2. **Otherwise evict LRU** — slot with the smallest `timestamp`

The **timestamp** is a logical clock: it equals `num_refs` at the time the slot was last accessed.

```c
// On every hit or fill:
slot->timestamp = num_refs;   // stamp with current ref count
```

<div class="info-box">
Smallest timestamp = longest time since last use = least recently used.
</div>

---

## Fully Associative Lookup

```c
uint32_t fa_lookup(uint64_t addr) {
    num_refs += 1;
    uint64_t addr_tag = addr >> 2;

for (int i = 0; i < N; i++) {
        struct cache_slot_st *slot = &cache[i];
        if (slot->valid == 1 && slot->tag == addr_tag) {
            slot->timestamp = num_refs;   // HIT: restamp
            return slot->data;
        }
    }
    // MISS: find LRU slot and fill
    struct cache_slot_st *slot = find_lru_slot(cache);
    slot->data = *((uint32_t *) addr);
    slot->tag = addr_tag;  slot->valid = 1;
    slot->timestamp = num_refs;
    return slot->data;
}
```

---

## Set-Associative Cache

The practical middle ground:

- Cache is divided into **sets**
- An address maps to exactly **one set** (like direct-mapped)
- Within that set, a block can go in **any way** (like fully associative)

```text
num_sets = total_slots / num_ways
```

A **4-way** cache: 4 slots per set. Direct-mapped = 1-way. Fully associative = N-way (one set).

---

## Set-Associative Lookup

<div class="mermaid">
flowchart TD
    A["addr"] --> B["set_index = (addr >> 2) & set_mask"]
    A --> T["tag = upper bits"]
    B --> S["select set (ways 0..3)"]
    T --> S
    S --> H{"any way: valid AND tag match?"}
    H -->|yes| HIT["HIT: update timestamp"]
    H -->|no| MISS["MISS: fill LRU way in this set"]
</div>

---

## Set-Associative: Address and Layout

Address breakdown (8 sets, block size 1):

```text
 63 ........ 5 | 4   3   2 | 1   0
      tag      | set index | byte offset
```

Flat array layout with 4 ways per set:

```c
set_base = set_index * num_ways;  // first slot of this set
// slots set_base .. set_base+3 are the 4 ways
```

Search only `num_ways` slots — not all N!

---

## Set-Associative Lookup Code

```c
uint32_t sa_lookup(uint64_t addr) {
    num_refs += 1;
    int num_ways    = 4;
    uint64_t addr_tag  = addr >> 5;           // drop byte + set-index bits
    uint32_t set_index = (addr >> 2) & 0b111; // 3 bits for 8 sets
    uint32_t set_base  = set_index * num_ways;

for (int i = 0; i < num_ways; i++) {
        struct cache_slot_st *slot = &cache[set_base + i];
        if (slot->valid == 1 && slot->tag == addr_tag) {
            slot->timestamp = num_refs;
            return slot->data;
        }
    }
    struct cache_slot_st *slot = find_lru_slot_in_set(cache, set_base);
    slot->data = *((uint32_t *) addr);
    slot->tag = addr_tag;  slot->valid = 1;
    slot->timestamp = num_refs;
    return slot->data;
}
```

---

## Comparing the Three Organizations

| Organization | Block placement | Lookup cost | Conflict misses |
|--------------|----------------|-------------|-----------------|
| Direct-mapped | 1 fixed slot | 1 compare | High |
| Set-associative (n-way) | any way in 1 set | n compares | Low |
| Fully associative | any slot | N compares | None |

<div class="info-box">
Direct-mapped = 1-way set-associative.<br>
Fully associative = N-way with 1 set.<br>
Most real L1/L2 caches are 4- or 8-way set-associative.
</div>

---

## Block Size Greater Than One Word

Add a `block[]` array to the slot struct:

```c
struct cache_slot_st {
    uint64_t tag;
    uint32_t block[4];    // 4 words per slot
    uint32_t valid;
    uint64_t timestamp;
};
```

On a **hit**, use the **word offset** to pick the right word:

```c
uint32_t word_offset = (addr >> 2) & 0b11;
return slot->block[word_offset];
```

---

## Address Breakdown with 4-Word Block

**Direct-mapped, 4 slots, 4-word block:**

**Fully associative, 4-word block:**

The 2 word-offset bits select which of the 4 words in the block to return.

---

## Loading a Block on a Miss

```c
uint64_t addr_word   = addr / 4;
uint64_t block_index = addr_word % 4;         // position within block
uint64_t block_base  = addr_word - block_index; // word addr of first word

for (int i = 0; i < 4; i++) {
    uint64_t byte_addr = (block_base + i) * 4;
    slot->block[i] = *((uint32_t *) byte_addr);
}
slot->tag   = /* tag bits of addr */;
slot->valid = 1;
```

One miss fetches **all 4 words** — the next 3 accesses in the block will be hits.

---

## Spatial Locality Payoff

For sequential instruction execution (`PC`, `PC+4`, `PC+8`, `PC+12`...):

- Block size 1: every instruction is a miss (16 refs = 16 misses)
- Block size 4: only the first word in each block misses

```text
block [0-3]:  miss, hit, hit, hit
block [4-7]:  miss, hit, hit, hit
...
```

<div class="highlight-box">
Live demo: increasing block size from 1 to 4 words raised the
direct-mapped hit rate from ~62% to ~85% on the same program.
</div>

---

## Project 4 Cache Configurations

| # | Type | Block size | Replacement | Status |
|---|------|-----------|-------------|--------|
| 1 | Direct-mapped | 1 word | overwrite | given |
| 2 | Direct-mapped | 4 words | overwrite | you write |
| 3 | 4-way set-associative | 1 word | LRU | you write |
| 4 | 4-way set-associative | 4 words | LRU | you write |

The emulator calls the cache on every instruction fetch (one ref per `PC`).

---

## Implementation Road Map

1. **Understand the given** direct-mapped, block-size-1 starter code fully
2. **Extend to block size 4**: add `block[4]`, word-offset extraction, block fill loop
3. **Add set-associative**: group slots into sets, search `num_ways` ways, add `find_lru_slot_in_set()`
4. **Track recency**: increment `num_refs` every lookup; stamp accessed slots with current count

<div class="info-box">
The cache type is selected via an <code>enum</code> in <code>RVEMU.h</code> —
one code base supports all four configurations.
</div>

---

## Debugging Strategies

Build and test **incrementally** — one configuration at a time.

| Technique | What to do |
|-----------|-----------|
| **Custom test programs** | Small assembly loops that stress one behavior |
| **GDB** | Breakpoints, single-step, inspect registers |
| **printf instrumentation** | Print slot index, tag, valid, hit/miss on each ref |

Verify hit/miss counts against expected values before moving to the next configuration.

---

## Key Address Field Summary

| Cache type | Byte offset | Word/Block offset | Index | Tag |
|------------|------------|-------------------|-------|-----|
| DM, block=1, 4 slots | `[1:0]` | — | `[3:2]` | `[63:4]` |
| DM, block=4, 4 slots | `[1:0]` | `[3:2]` | `[5:4]` | `[63:6]` |
| FA, block=1 | `[1:0]` | — | — | `[63:2]` |
| FA, block=4 | `[1:0]` | `[3:2]` | — | `[63:4]` |
| SA 4-way, 8 sets, block=1 | `[1:0]` | — | `[4:2]` | `[63:5]` |

---

## LRU Eviction Example

2-way set, references (as tags): A, B, A, C

| Ref | Tag | Action | Way 0 | Way 1 |
|-----|-----|--------|-------|-------|
| 1 | A | miss, fill way 0 | A (ts=1) | empty |
| 2 | B | miss, fill way 1 | A (ts=1) | B (ts=2) |
| 3 | A | **HIT**, restamp | A (ts=3) | B (ts=2) |
| 4 | C | miss, evict LRU=B | A (ts=3) | C (ts=4) |

B evicted because its timestamp (2) was smaller than A's (3).

---

## Summary

1. **Cache** = small fast SRAM between CPU and DRAM; exploits temporal and spatial locality

2. **Performance**: `hit_rate = hits / refs`; real caches achieve > 95%

3. **Address fields**: byte offset | word offset | index/set-index | tag

4. **Direct-mapped**: one slot per address — simple, fast, conflict-prone

5. **Fully associative**: any slot — flexible, expensive (search all)

6. **Set-associative**: any way in one set — practical middle ground; LRU eviction via logical timestamp

7. **Larger blocks** amortize miss cost over spatial neighbors; 4-word block raised hit rate from 62% to 85%

8. **Project 4**: implement configs 2–4 starting from the given DM/block-1 starter