Lab 06: Cache Analysis

# Lab 06: Cache Analysis

## CS 315 Computer Architecture

---

## Agenda

- Lab 06 Q&A: `sign_extend()` fix
- Project 04: `call` vs `jal`
- Dynamic analysis: counting instructions
- Why caches exist
- SRAM vs DRAM
- Hits, misses, and rates
- Locality of reference
- The three cache questions
- Direct-mapped, fully-associative, set-associative caches
- Address breakdown and worked examples

---

## Lab 06: `sign_extend()` Fix

The shipped code and lecture derivation disagreed by **one bit**:

```c
// Lecture-aligned version (change 64 -> 63)
int64_t sign_extend(uint64_t value, int bits) {
    int shift = 63 - bits;
    return ((int64_t)(value << shift)) >> shift;
}
```

**How it works:** shift sign bit up to bit 63, then arithmetic `>>` replicates it down.

<div class="highlight-box">
<strong>Convention:</strong> if <code>bits</code> is the index of the MSB (0-based), use <code>63 - bits</code>. If it is the count of meaningful bits, use <code>64 - bits</code>. Pick one and use it everywhere.
</div>

---

## Project 04: `call` vs `jal`

`call` is a **pseudo-instruction** — the assembler expands it into two real instructions:

```text
call foo  ==>  auipc ra, %pcrel_hi(foo)
               jalr  ra, ra, %pcrel_lo(foo)
```

For small programs (target within ~1 MiB), use `jal` directly:

```text
jal ra, foo    # ra = pc+4; pc = pc+offset  (one instruction)
ret            # jalr x0, ra, 0
```

<div class="mermaid">
flowchart TD
    A["Need a function call?"] --> B{"Target within jal range?"}
    B -- "Yes" --> C["jal ra, label\n(1 instruction)"]
    B -- "No" --> D["call label\nauipc + jalr"]
</div>

---

## Dynamic Analysis: Counting Instructions

**Dynamic** = measured *during* execution (loop body running 1000× contributes 1000).

```c
typedef struct {
    uint64_t total;
    uint64_t itype;   // I-type (addi, lw, jalr, ...)
    uint64_t rtype;   // R-type (add, sub, ...)
    uint64_t load;    // ld, lw, lb, ...
    uint64_t store;   // sd, sw, sb, ...
    uint64_t branch;  // beq, bne, blt, ...
} analysis_st;
```

<div class="info-box">
Keep instrumentation in one place — bump counters right after decode, not scattered across every opcode case.
</div>

---

## Dynamic Analysis: Emulator Loop

<div class="mermaid">
flowchart LR
    A["Decode instruction"] --> B["Execute instruction"]
    B --> C["analyze: bump counters"]
    C --> D{"halt?"}
    D -- "no" --> A
    D -- "yes" --> E["Print analysis report"]
</div>

`fibrec 10` executes **1,675 total instructions** — heavy stack traffic (`sd`/`ld`) from recursion, a very different profile from the sort routine.

---

## Why Caches Exist: The CPU–Memory Gap

**Moore's Law** doubled transistor count every ~2 years → CPUs got fast.
DRAM latency improved far more slowly → widening gap.

```text
speed
  ^
  |                              CPU  . ' .
  |                       . ' .          } gap
  |                 . ' .
  |  . ' .  _________________________  DRAM
  +-------------------------------------------------> time
```

Without a cache, every load/store stalls waiting on slow DRAM.
A **cache** keeps recently used data in fast SRAM close to the CPU.

---

## SRAM vs DRAM

| Property | SRAM | DRAM |
|----------|------|------|
| Speed | Fast | Slow |
| Cost per bit | Expensive | Cheap |
| Density | Low (~6 transistors/bit) | High (1T + 1C) |
| Refresh needed? | No | Yes |
| Typical use | Cache (on-chip) | Main memory |

<div class="highlight-box">
Why not build all memory from SRAM? Cost and chip area — gigabytes of SRAM would be enormous and prohibitively expensive.
</div>

---

## The Cache in the Memory Hierarchy

```text
processor          CACHE ($$)         Main Memory (DRAM)
+---------+  addr  +----------+  miss  +------------------+
|         | -----> |          | -----> |   CODE / DATA    |
|         | <----- |  fast $$ | <----- |                  |
+---------+  data  +----------+        +------------------+
     ^
     | hit: served fast from SRAM
```

- **Hit** — data already in cache; return immediately (fast)
- **Miss** — fetch from DRAM, install in cache, return (slow — but cached for next time)

---

## Hits, Misses, and Rates

Every memory reference is classified as a **hit** or a **miss**:

```text
hit_rate  = num_hits   / num_refs
miss_rate = num_misses / num_refs

hit_rate + miss_rate = 1
```

<div class="info-box">
Real caches achieve hit rates above <strong>97%</strong> because programs obey <em>locality of reference</em>.
</div>

**Example:** 1,000 references, 30 misses
→ miss rate = 3%, hit rate = 97%, hits = 970

---

## Locality of Reference

**Temporal locality** — recently used data will be used again soon

```c
int sum = 0;
for (int i = 0; i < n; i++)
    sum += a[i];   // sum and i reused every iteration
```

**Spatial locality** — neighbors of used data will be used soon

```c
for (int i = 0; i < n; i++)
    a[i] = i;      // a[0], a[1], a[2], ... are adjacent
```

Sequential instructions also show spatial locality: after `PC`, the CPU usually fetches `PC+4`.

---

## The Three Cache Questions

<div class="mermaid">
flowchart TD
    A["Memory reference: addr"] --> Q1["Q1: Where could it be?\n(compute index / placement)"]
    Q1 --> Q2["Q2: Is the data really there?\n(check valid bit + compare tag)"]
    Q2 -- "valid && tag matches" --> H["HIT: return data"]
    Q2 -- "otherwise" --> M["MISS: fetch from memory"]
    M --> Q3["Q3: Which entry to evict?\n(replacement policy)"]
    Q3 --> F["Install data, set tag + valid, return"]
</div>

---

## Address Breakdown: Byte vs Word

Memory is byte-addressed; for 1-word blocks, drop the low 2 bits:

```text
31 ............. 2 | 1 0
[   word address   | bo ]
                    ^^ byte offset (00 for word-aligned)
```

```c
addr_word  = addr >> 2;     // byte addr -> word addr
addr_byte  = addr_word << 2; // word addr -> byte addr
```

The **tag** holds the high bits that identify *which* address is in a slot.
The **valid bit** distinguishes "empty slot" from "slot containing zero."

---

## Direct-Mapped Cache

Every address maps to **exactly one slot** — the slot index is the word address mod the number of slots.

```c
addr_word  = addr >> 2;
slot_index = addr_word & 0b11;   // for 4 slots: low 2 bits
addr_tag   = addr >> 4;          // bits above offset + index
```

Address breakdown (4 slots, 1-word block):
```text
tag[63:4]   slot_index[3:2]   byte_offset[1:0]
```

**Strength:** one comparison, no replacement logic
**Weakness:** conflict misses — two hot addresses sharing a slot evict each other

---

## Direct-Mapped: Lookup Code

```c
struct cache_slot_st { uint64_t tag; uint32_t data; uint32_t valid; };
struct cache_slot_st cache[4];

// lookup
int slot_index = (addr >> 2) & 0b11;
uint64_t addr_tag = addr >> 4;
struct cache_slot_st *slot = &cache[slot_index];

if (slot->valid == 1 && slot->tag == addr_tag) {
    return slot->data;                // HIT
} else {
    slot->data  = *(uint32_t *)addr;  // MISS: fetch
    slot->tag   = addr_tag;
    slot->valid = 1;
    return slot->data;
}
```

<div class="highlight-box">
Common mistake: comparing the full byte address to the tag. The tag is only the <em>high</em> bits after removing offset and index.
</div>

---

## Worked Example: Direct-Mapped Trace

4-slot direct-mapped cache, word address stream: `0, 1, 2, 3, 4, 0`

| Step | Word addr | Slot | Tag | Hit/Miss |
|------|-----------|------|-----|----------|
| 1 | 0 | 0 | 0 | MISS (empty) |
| 2 | 1 | 1 | 0 | MISS (empty) |
| 3 | 2 | 2 | 0 | MISS (empty) |
| 4 | 3 | 3 | 0 | MISS (empty) |
| 5 | 4 | 0 | 1 | MISS (evicts word 0) |
| 6 | 0 | 0 | 0 | MISS (evicts word 4) |

6 refs, 0 hits → **hit rate 0%**
Steps 5 & 6 are **conflict misses**: words 0 and 4 share slot 0 and keep evicting each other.

---

## Fully-Associative Cache

An address may live in **any** slot — no index, just tag + byte offset:

```text
tag[63:2]   byte_offset[1:0]
```

```c
addr_tag = addr >> 2;
for (int i = 0; i < 4; i++) {      // search ALL slots
    if (cache[i].valid && cache[i].tag == addr_tag) {
        cache[i].timestamp = num_refs;
        return cache[i].data;       // HIT
    }
}
// MISS: pick victim via LRU, install
```

**Strength:** no conflict misses
**Weakness:** must compare against every slot (expensive in hardware)

---

## LRU Replacement

Track recency with a **logical timestamp** — increment on every reference, stamp each slot on access.

```c
struct cache_slot_st *find_lru_slot(struct cache_slot_st *cache) {
    for (int i = 0; i < 4; i++)
        if (cache[i].valid == 0) return &cache[i]; // empty first
    int lru = 0;
    for (int i = 1; i < 4; i++)
        if (cache[i].timestamp < cache[lru].timestamp) lru = i;
    return &cache[lru];  // smallest timestamp = least recently used
}
```

<div class="info-box">
Hardware uses approximate LRU (too expensive for exact). In our software simulator we can afford exact LRU.
</div>

---

## Set-Associative Cache

The practical compromise: map an address to a **set** (like direct-mapped), but allow it in any **way** within that set.

```text
num_sets = total_slots / num_ways
```

For 32 slots, 4 ways: 8 sets, 3 set-index bits:
```text
tag[63:5]   set_index[4:2]   byte_offset[1:0]
```

```c
uint64_t addr_tag  = addr >> 5;
int set_index      = (addr >> 2) & 0b111;
int set_base       = set_index * num_ways;   // first slot of the set
// search only cache[set_base .. set_base + num_ways - 1]
```

---

## Set-Associative: Lookup Flow

<div class="mermaid">
flowchart TD
    A["addr -> set_index, addr_tag"] --> B["Load set: cache[set_base .. set_base+ways-1]"]
    B --> C{"Any slot: valid && tag == addr_tag?"}
    C -- "yes" --> D["HIT: update timestamp, return data"]
    C -- "no" --> E["MISS: find LRU in set, install, return"]
</div>

LRU replacement is scoped to the set — only compare timestamps within the set's ways.

---

## Three Organizations Compared

<div class="mermaid">
flowchart LR
    A["Direct-mapped\n1 slot per address"] --> B["Set-associative\nN ways per set"]
    B --> C["Fully associative\nany slot (1 set)"]
    A -. "more flexibility / cost" .-> C
</div>

---

## Block Size and Spatial Locality

Real caches fetch a **block** of adjacent words per miss:

```c
struct cache_slot_st {
    uint64_t tag;
    uint32_t block[4];   // 4 words per block
    uint32_t valid;
    uint64_t timestamp;
};
```

Address with 4-word block, 4 slots (direct-mapped):
```text
tag[63:6]  slot_index[5:4]  block_index[3:2]  byte_offset[1:0]
```

<div class="info-box">
Larger blocks exploit spatial locality (array streaming). Blocks that are too large waste bandwidth on data never used — block size is a tuning knob.
</div>

---

## Address Breakdown Summary

For a **direct-mapped** cache with `S` slots and `B`-word blocks (word-aligned):

For **set-associative** with `S` sets:

**Example:** 8 slots, 1-word block, direct-mapped (64-bit addr)
```text
tag[63:5]   slot_index[4:2]   byte_offset[1:0]
slot_index = (addr >> 2) & 0b111
addr_tag   = addr >> 5
```

---

## The Valid Bit: Why It Matters

After power-up the cache is all zeros. Without a valid bit:

- Slot has `tag = 0`, `data = 0`
- Request with `addr_tag = 0` matches — **false hit!**

The valid bit is initialized to `0` and only set to `1` after a real fill:

```c
if (slot->valid == 1 && slot->tag == addr_tag) {
    // safe hit — data was actually loaded
}
```

<div class="highlight-box">
A valid bit distinguishes "empty slot initialized to zero" from "slot containing the actual value zero."
</div>

---

## Conflict Miss Demo

Stream `0, 4, 0, 4` through a 4-slot direct-mapped cache:

Both word 0 and word 4 map to **slot 0** (`0 % 4 == 0`, `4 % 4 == 0`).

| Ref | Word | Slot | Tag | Result |
|-----|------|------|-----|--------|
| 1 | 0 | 0 | 0 | MISS |
| 2 | 4 | 0 | 1 | MISS (evicts 0) |
| 3 | 0 | 0 | 0 | MISS (evicts 4) |
| 4 | 4 | 0 | 1 | MISS (evicts 0) |

**Hit rate: 0%** — "ping-pong" conflict miss.
A 2-way set-associative cache keeps both in different ways → refs 3 & 4 are hits (50%).

---

## LRU Example: Eviction Decision

Fully-associative cache, 4 slots, all valid:

| Slot | Timestamp |
|------|-----------|
| 0 | 12 |
| 1 | 7 |
| 2 | 20 |
| 3 | 15 |

Miss occurs at `num_refs = 21`. LRU evicts **slot 1** (smallest timestamp = 7).
After install: slot 1 gets `timestamp = 21`.

---

## Key Concepts Recap

| Concept | Key Idea |
|---------|----------|
| Cache (`$$`) | Small fast SRAM between CPU and DRAM |
| Hit / Miss | Data present / absent in cache |
| Hit rate | `hits / refs`; real programs: >97% |
| Temporal locality | Reuse recently accessed data |
| Spatial locality | Access neighbors; use blocks |
| Tag | High address bits identifying slot content |
| Valid bit | Distinguishes empty from zero-data |
| Direct-mapped | `slot = word_addr % num_slots` |
| Set-associative | Set by index; LRU within set |
| Fully associative | Any slot; LRU global |
| LRU | Evict slot with smallest timestamp |

---

## Summary

1. **`sign_extend()` fix** — align the shift constant with your bit-numbering convention (`63 - bits` or `64 - bits`); pick one and be consistent.

2. **`jal` vs `call`** — use `jal` in Project 04 to avoid implementing `auipc`; all targets are within ±1 MiB.

3. **Dynamic analysis** — count instruction classes (I-type, R-type, load, store, branch) in one place during the execute loop.

4. **CPU–memory gap** — Moore's Law sped up CPUs; DRAM lagged; SRAM caches hide the latency.

5. **SRAM vs DRAM** — fast/expensive/sparse vs slow/cheap/dense; cache gives the best of both.

6. **Three cache questions** — placement (index), identification (valid + tag), replacement (LRU).

7. **Three organizations** — direct-mapped (simple, conflict misses), fully associative (flexible, expensive), set-associative (practical compromise).