Project 04 and Midterm Review¶

Overview¶

This session prepares you for the Thursday midterm and for Project 04 (RISC-V emulation analysis and cache simulation). The first half walks through implementing load and store instructions in the emulator, with careful attention to the I-type and S-type instruction formats, immediate construction, sign extension, target-address calculation, and pointer type casting. The second half introduces processor cache memory: direct-mapped caches, the byte/word/block address arithmetic that drives slot indexing, instruction vs. data caches, the effect of block size, and set-associative caches. Throughout, the emphasis is on reasoning about instruction formats and address bits rather than memorizing them.

Learning Objectives¶

Implement lw/sw (and by extension ld/sd) in the emulator by computing a target address and casting through the correct pointer type.
Decode the I-type and S-type instruction formats and reconstruct the immediate from its bit fields.
Apply two's complement, sign extension, and masking to immediates and addresses.
Convert between byte addresses, word addresses, and block addresses.
Map an address to a direct-mapped cache slot using both modulus arithmetic and bit manipulation, and identify the tag.
Explain why instruction caches (I-cache) and data caches (D-cache) are separated, and which one Project 04 simulates.
Compute the block-aligned base address for a multi-word block on a cache miss.
Describe set-associative caches: sets, ways, set indexing, and LRU replacement.

Prerequisites¶

RISC-V machine-code decoding and the struct rv_state / register-file model from Lab 03.
C pointers, pointer casts, and fixed-width integer types (uint32_t, int64_t, uint64_t).
Bitwise operators: &, |, <<, >>, and the difference between logical and arithmetic right shift.
Two's complement representation of signed integers.
The RISC-V guide and the Cache Memory guide.

1. Midterm Logistics and Study Strategy¶

The midterm is on Thursday. You may bring one page of notes, front and back (double-sided). A few points about what to expect:

You do not need to memorize instruction formats. Familiarity is enough — you should be able to reason about a format if it is given to you. Instruction formats may be printed on the exam.
Be ready to extract bit fields from an instruction word and reconstruct immediates, especially for the I-type and S-type formats.
Understand load/store implementation: target-address calculation and sign extension.
Review two's complement (invert and add one), sign extension, and masking.
Review caching: direct-mapped caches, block size, the tag/index/offset split of an address, and set-associative caches.

A good one-page sheet captures the mechanics: the I-type and S-type field layouts, the sign-extend idiom, the byte/word/block conversions, and the slot-index formula. The rest of these notes are organized around exactly those mechanics.

2. Loads and Stores in the Emulator¶

A load reads a value from memory into a register; a store writes a register value out to memory. In RISC-V both use a base register plus a signed immediate offset to form the address. The assembly syntax makes the offset and base explicit:

lw   t0, 8(sp)     # load word: t0 = mem[sp + 8]
sw   t0, 8(sp)     # store word: mem[sp + 8] = t0

Reading 8(sp):

lw  t0, 8(sp)
        |   |
        |   +-- base register (often sp)
        +------ offset (the immediate)

The address the instruction actually touches is the target address:

target_address = base + offset
               = base + imm

A critical project rule: do not index the emulated stack array directly. Even though sp usually points into the emulator's stack, you must compute the effective address and access memory through the emulator's memory interface (a real pointer dereference). This keeps loads and stores uniform regardless of which region of memory they touch.

The two operations differ only in direction and access width:

Instruction	Meaning	Width	Direction
`lw`	load word	32 bits	memory → register
`lwu`	load word unsigned	32 bits	memory → register (zero-extend)
`ld`	load doubleword	64 bits	memory → register
`sw`	store word	32 bits	register → memory
`sd`	store doubleword	64 bits	register → memory

The core logic is identical for word vs. doubleword loads — only the access size changes, which is selected by the funct3 field. Once you can implement lw, implementing ld is just a change of the pointer type from uint32_t * to uint64_t *.

3. The I-type Format and Load Word (`lw`)¶

Loads use the I-type format. The 12-bit immediate sits in the top bits as a single contiguous field, which makes it the easiest immediate to reconstruct.

 31           20 19   15 14  12 11   7 6      0
+---------------+-------+------+------+--------+
|    imm[11:0]  |  rs1  |funct3|  rd  | opcode |
+---------------+-------+------+------+--------+
      12 bits      5      3      5        7
        |          |             |
        |          |             +-- rd  = t0 (destination)
        |          +---------------- rs1 = sp (base register)
        +------------------------- the signed offset (int64_t imm)

For lw t0, 8(sp): rs1 selects sp (the base), rd selects t0 (the destination), and imm[11:0] holds the offset 8. The immediate is a signed two's-complement value, so before using it we must sign-extend its top bit (bit 11) out to 64 bits:

int64_t imm = sign_extend(vimm, 11);   // sign bit is bit 11

Here vimm is the raw 12-bit field pulled out of the instruction word, and 11 is the index of its sign bit. Now compute the target address and perform the load. The address is unsigned, the immediate is signed; the addition still works because two's complement addition is the same operation either way:

uint64_t target_address;
target_address = rsp->regs[rs1] + imm;

// Load a 32-bit word from memory through a real pointer.
rsp->regs[rd] = *((uint32_t *) target_address);

The cast (uint32_t *) is what makes this a word load: dereferencing a uint32_t * reads exactly 4 bytes. For ld, the only change is the pointer type:

rsp->regs[rd] = *((uint64_t *) target_address);   // load doubleword

Why a cast at all? target_address is just a number. To read memory we must tell C how wide the access is. The pointer type encodes the width (uint32_t * → 4 bytes, uint64_t * → 8 bytes), and the compiler emits the correct lw/ld under the hood.

4. The S-type Format and Store Word (`sw`)¶

Stores use the S-type format. There is no rd field — a store has no destination register, it writes to memory. To make room, the 12-bit immediate is split into two pieces placed in different parts of the instruction word. This split is the main thing that makes S-type harder than I-type.

 31        25 24   20 19   15 14  12 11    7 6      0
+-----------+-------+-------+------+--------+--------+
| imm[11:5] |  rs2  |  rs1  |funct3|imm[4:0]| opcode |
+-----------+-------+-------+------+--------+--------+
   7 bits      5       5      3       5        7
       |        |       |              |
       |        |       |              +-- imm[4:0]  (low 5 bits of offset)
       |        |       +----------------- rs1 = sp  (base register)
       |        +------------------------- rs2 = t0  (value to store)
       +---------------------------------- imm[11:5] (high 7 bits of offset)

To rebuild the immediate, extract both halves and shift the high half into place before OR-ing them together, then sign-extend:

uint64_t imm11_5;   // 7 bits from [31:25]
uint64_t imm4_0;    // 5 bits from [11:7]
uint64_t vimm;

vimm = (imm11_5 << 5) | imm4_0;     // reassemble the 12-bit field
int64_t imm = sign_extend(vimm, 11);

The address calculation is the same as for a load. The difference is the direction of the assignment — we write the register value into memory:

uint64_t target_address;
target_address = rsp->regs[rs1] + imm;

// Store a 32-bit word from rs2 into memory.
*((uint32_t *) target_address) = (uint32_t) rsp->regs[rs2];

The (uint32_t) cast on the right truncates the 64-bit register to the low 32 bits, and the (uint32_t *) cast on the left makes the write 4 bytes wide. For sd, drop both uint32_t casts in favor of uint64_t.

Comparing the two assignment forms side by side makes the symmetry obvious:

// load word:   register <- memory
rsp->regs[rd] = *((uint32_t *) target_address);

// store word:  memory <- register
*((uint32_t *) target_address) = (uint32_t) rsp->regs[rs2];

5. Sign Extension, Two's Complement, and Masking¶

These three bit operations underpin everything above, and they are prime midterm material.

Two's complement represents signed integers. To negate a value, invert all bits and add one (twos = ~val + 1). The most significant bit is the sign bit (0 = positive, 1 = negative). A 3-bit example:

bin  dec
000   0
001   1
010   2
011   3
100  -4     <- sign bit set
101  -3
110  -2
111  -1

Sign extension widens a small signed value to a larger width while preserving its value. The idiom is shift the sign bit all the way left, then arithmetic shift right so the sign bit is replicated:

// Sign-extend a 12-bit immediate (sign bit at index 11) to 64 bits.
int64_t sign_extend(uint64_t v, int sign_bit) {
    int shift = 63 - sign_bit;     // for bit 11: shift = 52
    return ((int64_t) (v << shift)) >> shift;
}

Casting to int64_t before the right shift is essential — a right shift on a signed type is an arithmetic shift (fills with the sign bit), while a right shift on an unsigned type is a logical shift (fills with zeros).

Masking isolates a bit field: shift the field to the bottom, then AND with a mask of the field width:

uint8_t x = 0b11011010;
uint32_t v = x >> 2;        // bring bits [5:2] down to [3:0]
v = v & 0b1111;             // clear everything above bit 3

Masking shows up constantly in cache indexing (Section 7): addr_word & 0b11 keeps only the low two bits, which is exactly addr_word % 4 for a power-of-two slot count.

6. Cache Memory: Motivation and the Address Layers¶

Memory is large and slow; registers are tiny and fast. A cache is a small, fast memory that holds recently used data so most requests never have to go to slow main memory. Caches work because real programs exhibit locality:

Temporal locality — a value used now is likely used again soon (a loop index, a function called repeatedly).
Spatial locality — if you touch one address, you will likely touch a neighbor soon (array[i] then array[i+1]; instruction at PC then PC+4).

Project 04 simulates a cache, so you must be fluent in three views of the same address. Memory is byte-addressed, but we group bytes into words and words into blocks:

addr (byte)
63                                  3   2   1   0
+----------------------------- ... -+---+---+---+---+
|              ...                  | s1| s0| b1| b0|
+----------------------------- ... -+---+---+---+---+
                                    \_______/ \_____/
                                    slot index byte offset

The bottom two bits (b1 b0) are the byte offset within a word — they are lost when we go from a byte address to a word address. The conversions:

addr_word = addr / 4;        // byte address -> word address  (or addr >> 2)
addr      = addr_word * 4;   // word address -> byte address  (or addr_word << 2)

Word 0 lives in bytes 0–3, word 1 in bytes 4–7, word 2 in bytes 8–11, and so on. Project 04's instruction cache assumes addresses are word-aligned (multiples of 4), so the byte offset is always zero and can be ignored.

7. Direct-Mapped Cache¶

In a direct-mapped cache, every address maps to exactly one slot. The mapping is the cache's whole personality. Consider a cache with 4 slots:

uint32_t cache[4];

To find the slot for an address, first convert to a word address, then take it modulo the number of slots:

4 slots
addr_word  = addr / 4
slot_index = addr_word % (# slots)
           = addr_word % 4

Because 4 is a power of two, the modulus is identical to masking the low two bits of the word address — and masking the low two bits of the word address is the same as shifting the byte address right by 2 first:

slot_index = (addr >> 2) & 0b11;   // shift off byte offset, keep 2 index bits

Many byte addresses map to the same slot. We tell them apart with the tag — the high bits left over after removing the index and offset bits. Here are word addresses 0,1, 4,5, and 8,9 with the low bits of each byte address shown (index bits boxed, tag is everything above):

word addr   byte addr   bits (low end)              slot
   0            0        ...0000 0000 [00] 00         0
   1            4        ...0000 0000 [01] 00         1
   4           16        ...0000 0001 [00] 00         0
   5           20        ...0000 0001 [01] 00         1
   8           32        ...0000 0010 [00] 00         0
   9           36        ...0000 0010 [01] 00         1

Word addresses 0, 4, and 8 all land in slot 0; their differing high bits (the tag) are what distinguish them once they are in the cache. Each slot must therefore store a tag and a valid bit (so a freshly powered-up cache full of zeros is not mistaken for real data):

struct cache_slot_st {
    uint64_t tag;
    uint32_t data;
    uint32_t valid;
};
struct cache_slot_st cache[4];

flowchart LR
    A["byte address"] --> B["addr &gt;&gt; 2<br/>(drop byte offset)"]
    B --> C["&amp; 0b11<br/>(slot index)"]
    C --> D{"valid == 1<br/>and<br/>tag matches?"}
    D -->|yes| E["HIT: return data"]
    D -->|no| F["MISS: fetch from memory,<br/>fill slot, set valid + tag"]

Lookup pseudocode for a 4-slot direct-mapped cache:

uint64_t addr_tag   = addr >> 4;            // drop 2 offset + 2 index bits
uint64_t slot_index = (addr >> 2) & 0b11;
struct cache_slot_st *slot = &cache[slot_index];

if (slot->valid == 1 && slot->tag == addr_tag) {
    // hit
    return slot->data;
} else {
    // miss: bring the word in from memory and record it
    slot->data  = *((uint32_t *) addr);
    slot->tag   = addr_tag;
    slot->valid = 1;
    return slot->data;
}

Scaling up: a 16-slot direct-mapped cache uses 4 index bits instead of 2. The formulas become slot_index = addr_word % 16 or, with bits, slot_index = (addr >> 2) & 0b1111. More slots means more index bits and correspondingly fewer tag bits.

8. Instruction Cache vs. Data Cache¶

A processor generates two distinct streams of memory addresses:

The PC drives an address stream of instruction fetches (every cycle, in program order with PC+4 steps).
Load/store instructions (ld, lw, lb, sd, sw, sb) drive an address stream of data accesses.

        Processor
   +-------------------+
   | ld/lw/lb          |  data addr  --->  Data path to memory
   | sd/sw/sb          |
   |                   |
   | [ PC ]            |  instr addr --->  Instruction path to memory
   +-------------------+

Modern designs split the L1 cache into a D-cache (data) and an I-cache (instructions) so the two streams do not interfere and can be accessed in parallel. Higher levels (L2, L3) are usually unified (hold both).

flowchart LR
    PC["PC"] --> IC["I-Cache"]
    LDST["ld / sd<br/>(data access)"] --> DC["D-Cache"]
    IC --> MEM["Main Memory"]
    DC --> MEM
    style IC fill:#bbf,stroke:#333
    style DC fill:#fbb,stroke:#333

Project 04 simulates the I-cache, which is much simpler because instruction memory is:

Read-only — code is not modified while it runs, so there is no store policy to design.
Word-aligned — every instruction is a 32-bit word at a multiple-of-4 address, so the byte offset is always zero.

A full data cache is harder: it must handle byte, word, and doubleword accesses, and on a store it must decide when memory is updated:

Write-through — update memory on every store (simple, more memory traffic).
Write-back — update memory only when the dirty block is evicted (less traffic, the common choice; requires coherency between cores).

Project 04 sidesteps all of that by caching read-only instructions.

9. Cache Block Size and Spatial Locality¶

So far each slot held a single word. To exploit spatial locality, a slot can hold a block of several words. With a 4-word block, fetching one word also brings in its three neighbors, amortizing the cost of the slow memory access. The slot now stores an array:

struct cache_slot_st {
    uint64_t tag;
    uint32_t block[4];     // 4 words per block
    uint32_t valid;
    uint64_t timestamp;
};
struct cache_slot_st cache[N];

With a block, the address splits into one more field — the block index (which word within the block):

addr
63                       5   4   3   2   1   0
+----------------- ... ---+---+---+---+---+---+
|        tag              | s1| s0| w1| w0| ...byte offset (b1 b0)
+----------------- ... ---+---+---+---+---+---+
                          \_____/ \_____/ \_____/
                          slot     block   byte
                          index    index   offset

The arithmetic stacks one division on top of the other — group bytes into words, then words into blocks:

addr_word  = addr / 4              # bytes -> words
addr_block = addr_word / 4         # words -> blocks
addr_block = addr / 16             # bytes -> blocks (combined; 4 words = 16 bytes)
slot_index = addr_block % 4        # 4 slots in this example

Worked example for addr = 16:

addr       = 16
addr_word  = 16 / 4 = 4
addr_block =  4 / 4 = 1
slot_index =  1 % 4 = 1

With bit manipulation, for a 4-slot cache with 4-word blocks (block index = 2 bits, slot index = 2 bits, byte offset = 2 bits):

slot_index  = (addr >> 4) & 0b11;   // skip offset(2) + block index(2)
block_index = (addr >> 2) & 0b11;   // skip offset(2), keep block-word selector

flowchart TD
    A["block of 4 words<br/>w3 w2 w1 w0"] --> B["one cache slot"]
    C["access w0 (miss)"] --> D["fetch entire block:<br/>w0, w1, w2, w3"]
    D --> E["future accesses to<br/>w1, w2, w3 are HITS"]

The payoff: a single miss on w0 pulls in w1, w2, and w3, so accessing those neighbors afterward hits in the cache — spatial locality turned into free hits.

10. Finding the Block Base on a Miss¶

On a miss we must load the whole block from memory, which means computing the address of the block's first word. The notes show two equivalent techniques.

Arithmetic approach (modulus to find the offset, then subtract it):

Miss
addr_word = 6
addr      = 24

rem = addr_word % 4 = 6 % 4 = 2          # block index (which word in the block)
addr_word_block_start = addr_word - rem  # = 4
addr = addr_word_block_start * 4         # back to a byte address
addr = 16

So word 6 (byte 24) lives in the block that starts at word 4 (byte 16). Now read all four words of the block from memory, advancing by 4 bytes each time:

uint32_t block[4];
block[0] = *((uint32_t *) addr);        // word at byte 16
block[1] = *((uint32_t *) (addr + 4));  // word at byte 20
block[2] = *((uint32_t *) (addr + 8));  // word at byte 24
block[3] = *((uint32_t *) (addr + 12)); // word at byte 28

Bit-manipulation approach (clear the low bits that select within the block). A 4-word block spans 16 bytes, so the low 4 bits of the byte address index within the block; clearing them snaps the address down to the block base:

word addr   byte addr   bits (low end, block-index bits boxed)
   6           24        ...0000 0001 [1000]
   4           16        ...0000 0001 [0000]   <- block base

Two ways to clear those 4 bits:

// (a) shift the low bits out and back in
addr_block_start = (addr >> 4) << 4;

// (b) AND with a mask that zeros the low 4 bits
uint64_t addr_block_mask = 0xFFFFFFFFFFFFFFF0;  // ...FFFF FFF0 = low 4 bits cleared
addr_block_start = addr & addr_block_mask;

Both yield byte address 16, the start of the block containing byte 24. The mask 0xFFFFFFFFFFFFFFF0 is simply all ones with the bottom nibble zeroed — exactly the bits that select a word inside a 16-byte block.

11. Set-Associative Cache¶

A direct-mapped cache is simple but rigid (one home per address; collisions force eviction even when other slots are empty). A fully-associative cache is flexible but expensive (compare the tag against every slot). A set-associative cache is the practical middle ground.

The cache is divided into sets; each set contains several ways (slots). An address is mapped to a set using the direct-mapped technique, then may occupy any way within that set. An n-way set-associative cache has n ways per set.

          way 1            way 0
        +-------------+  +-------------+
 set 1  | v tag data  |  | v tag data  |
        +-------------+  +-------------+
 set 0  | v tag data  |  | v tag data  |
        +-------------+  +-------------+
          (a slot = one way within one set)

Lookup has two steps:

Map the address to a set index (direct-mapped style, using the set-index bits of the address).
Compare the tag against all ways in that set (a small loop, not the whole cache).

addr
63                          ...   b1  b0
+-------------- ... ----+--------+---+---+
|        tag            |set idx | byte  |
+-------------- ... ----+--------+---+---+
                            ^
                        set index

If we treat one flat slot array as logical groups of ways, the first slot of a set is set_index * num_ways:

num_ways  = 4;
addr_tag  = addr >> 5;            // drop offset(2) + set index(3)
set_index = (addr >> 3) & 0b111;  // 8 sets => 3 index bits
set_base  = set_index * num_ways; // first slot of this set

for (int i = 0; i < num_ways; i++) {
    struct cache_slot_st *slot = &cache[set_base + i];
    if (slot->valid == 1 && slot->tag == addr_tag) {
        slot->timestamp = num_refs;   // record use for LRU
        return slot->data;            // hit
    }
}
// miss: choose a way to fill/evict within this set

Replacement. When all ways in the target set are valid, we must evict one. The usual policy is LRU (Least Recently Used): keep a timestamp (e.g., the running reference count) on each slot, update it on every access, and on eviction pick the way with the smallest timestamp. find_lru_slot_in_set() should first prefer any unused (invalid) way, then fall back to LRU among the valid ways.

flowchart TD
    A["address"] --> B["set_index = (addr &gt;&gt; 3) &amp; 0b111"]
    B --> C["set_base = set_index * num_ways"]
    C --> D["loop ways: compare tags"]
    D -->|match| E["HIT, update timestamp"]
    D -->|no match| F["MISS"]
    F --> G{"unused way<br/>in set?"}
    G -->|yes| H["fill empty way"]
    G -->|no| I["evict LRU way<br/>(smallest timestamp)"]

This is the heart of Project 04's most advanced cache: a 4-way set-associative cache with LRU replacement, available with both 1-word and 4-word block sizes.

Key Concepts¶

Concept	Definition	Example
Target address	`base register + sign-extended immediate`; the effective memory address of a load/store	`target = regs[rs1] + imm`
I-type immediate	A single 12-bit field in bits [31:20]; sign-extended before use	`lw t0, 8(sp)`
S-type immediate	A 12-bit offset split into `imm[11:5]` and `imm[4:0]`	`sw t0, 8(sp)`
Sign extension	Widen a signed value preserving its value, by replicating the sign bit	`(int64_t)(v<<52) >> 52`
Masking	Isolate a bit field by shifting down and AND-ing with a mask	`(addr >> 2) & 0b11`
Word address	`addr / 4`; index into memory viewed as an array of 4-byte words	byte 16 → word 4
Slot index	Direct-mapped slot an address maps to	`addr_word % (#slots)`
Tag	High address bits stored in a slot to disambiguate addresses that share a slot	`addr >> 4` (4-slot cache)
Valid bit	Marks whether a slot holds real data (vs. power-up zeros)	`slot.valid == 1`
Block size	Number of words moved between memory and cache per fill	4-word block = 16 bytes
I-cache / D-cache	Separate L1 caches for instructions vs. data	Project 04 simulates the I-cache
Set / Way	A set-associative cache is an array of sets; each set has n ways	4-way set associative
LRU	Evict the least recently used way in a full set	smallest timestamp

Practice Problems¶

Problem 1: Implement `lw`¶

Given lw t0, 12(sp) with sp = 0x4000, write the C statements (using rsp->regs[...]) that compute the target address and perform the load. Assume the raw 12-bit immediate field vimm = 12.

Click to reveal solution

int64_t  imm = sign_extend(vimm, 11);          // 12 -> 12 (positive, no change)
uint64_t target_address = rsp->regs[rs1] + imm; // 0x4000 + 12 = 0x400C

rsp->regs[rd] = *((uint32_t *) target_address); // 4-byte load into t0

`rs1` selects `sp` and `rd` selects `t0`. The `(uint32_t *)` cast makes the dereference a 4-byte (word) read. For `ld`, change the cast to `(uint64_t *)`.

Problem 2: Reconstruct an S-type Immediate¶

A store instruction has imm[11:5] = 0b0000010 and imm[4:0] = 0b00100. What is the byte offset it encodes?

Click to reveal solution

vimm = (imm11_5 << 5) | imm4_0
     = (0b0000010 << 5) | 0b00100
     = 0b0000010_00000 | 0b00100
     = 0b0000010_00100
     = 0b000001000100
     = 0x44
     = 68

The sign bit (bit 11) is 0, so sign extension leaves it as `+68`. The store writes to `mem[base + 68]`.

Problem 3: Direct-Mapped Slot and Tag¶

For a 4-slot, 1-word-block direct-mapped cache, which slot does byte address 36 map to, and what is its tag?

Click to reveal solution

addr       = 36
addr_word  = 36 / 4 = 9
slot_index = 9 % 4  = 1            (or (36 >> 2) & 0b11 = 9 & 0b11 = 1)

tag = addr >> 4 = 36 >> 4 = 2      (drop 2 offset bits + 2 index bits)

Byte address 36 maps to **slot 1** with **tag 2**. (Byte 4 → slot 1 tag 0, byte 20 → slot 1 tag 1, byte 36 → slot 1 tag 2, all sharing slot 1.)

Problem 4: Block Base Address¶

A 4-word-block cache misses on byte address addr = 24. Compute the byte address of the first word of its block, both arithmetically and with a mask.

Click to reveal solution

Arithmetic:

addr_word = 24 / 4 = 6
rem       = 6 % 4  = 2
block_start_word = 6 - 2 = 4
block_start_byte = 4 * 4 = 16

Bit manipulation (16-byte block → clear low 4 bits):

24 = 0b...0001 1000
mask = 0xFFFFFFFFFFFFFFF0
24 & mask = 0b...0001 0000 = 16

Both give **byte address 16**. The block holds words at bytes 16, 20, 24, 28.

Problem 5: Set-Associative Lookup¶

A 4-way set-associative cache has 8 sets and 1-word blocks. For byte address addr = 88, compute the set index and tag.

Click to reveal solution

With 8 sets, the set index is 3 bits; byte offset is 2 bits.

set_index = (addr >> 2) & 0b111   # word address, low 3 bits
          = (88 >> 2) & 0b111
          = 22 & 0b111
          = 6

tag = addr >> 5                   # drop 2 offset + 3 set-index bits
    = 88 >> 5
    = 2

(Note: layouts that count the set-index field directly above the byte offset use `addr >> 3` for the set index when offset+index occupy the low bits as `set_index[4:2]`; the guide's pseudocode uses `(addr >> 3) & 0b111`. Either is acceptable as long as you are consistent about which bits are offset vs. index — here, treating the cache as word-granular, `(addr >> 2) & 0b111` selects among 8 sets.) The address maps to **set 6**; compare its tag against all 4 ways in set 6.

Problem 6: Why Separate I-cache and D-cache?¶

Project 04 simulates only the instruction cache. Give two reasons the instruction cache is simpler to simulate than a data cache.

Click to reveal solution

1. **Read-only.** Instruction memory is not modified while the program runs, so there are no stores — no write-through/write-back policy and no dirty-block tracking. 2. **Word-aligned, fixed width.** Every instruction is a 32-bit word at a multiple-of-4 address, so the byte offset is always zero. A data cache must handle byte, halfword, word, and doubleword accesses at arbitrary alignments. A bonus reason: splitting I- and D-caches lets the processor fetch an instruction and access data in the same cycle without contention.

Summary¶

Loads and stores form a target address as base register + sign-extended immediate, then access memory through a real pointer dereference — never by indexing the emulated stack array directly.
I-type loads carry a single contiguous 12-bit immediate in bits [31:20]; S-type stores split the immediate into imm[11:5] and imm[4:0], which must be reassembled before use.
The pointer cast selects the access width: (uint32_t *) for lw/sw, (uint64_t *) for ld/sd. Load and store differ only in direction and width; the rest of the logic is shared.
Sign extension, two's complement, and masking are the bit-level tools behind both immediates and cache indexing — and they are core midterm material.
Caches exploit locality: temporal (reuse soon) and spatial (neighbors soon). An address splits into tag, index, and offset; converting byte → word → block is the key arithmetic.
Direct-mapped caches map each address to one slot via addr_word % #slots (equivalently a mask), store a tag and valid bit, and miss when the tag mismatches.
Block size brings in several words per fill to capture spatial locality; on a miss you compute the block-aligned base by clearing the low block-index bits and load the whole block.
Set-associative caches map an address to a set, then search all ways in that set, using LRU to choose evictions — the practical compromise between direct-mapped and fully-associative, and Project 04's most advanced cache. Project 04 simulates the read-only, word-aligned instruction cache, which is the simplest case.

Project 04 and Midterm Review¶

Overview¶

Learning Objectives¶

Prerequisites¶

1. Midterm Logistics and Study Strategy¶

2. Loads and Stores in the Emulator¶

3. The I-type Format and Load Word (lw)¶

4. The S-type Format and Store Word (sw)¶

5. Sign Extension, Two's Complement, and Masking¶

6. Cache Memory: Motivation and the Address Layers¶

7. Direct-Mapped Cache¶

8. Instruction Cache vs. Data Cache¶

9. Cache Block Size and Spatial Locality¶

10. Finding the Block Base on a Miss¶

11. Set-Associative Cache¶

Key Concepts¶

Practice Problems¶

Problem 1: Implement lw¶

Problem 2: Reconstruct an S-type Immediate¶

Problem 3: Direct-Mapped Slot and Tag¶

Problem 4: Block Base Address¶

Problem 5: Set-Associative Lookup¶

Problem 6: Why Separate I-cache and D-cache?¶

Further Reading¶

Summary¶

3. The I-type Format and Load Word (`lw`)¶

4. The S-type Format and Store Word (`sw`)¶

Problem 1: Implement `lw`¶