Lab: Processor ALU

# Lab: Processor ALU

## CS 315 Computer Architecture

---

## What We Are Building

<div class="mermaid">
flowchart LR
    PC["PC Register"] --> IM["Instruction Memory"]
    IM --> ID["Instruction Decode"]
    ID --> RF["Register File"]
    RF -->|"RD0"| ALU["ALU"]
    RF -->|"RD1"| MUX["ALUSrcB MUX"]
    IMM["Immediate"] --> MUX
    MUX --> ALU
    ALU -->|"R"| RF
    style ALU fill:#f9f,stroke:#333,stroke-width:2px
    style RF fill:#bbf,stroke:#333,stroke-width:2px
</div>

**Today's focus**: Register File + ALU

---

## Learning Objectives

- Map a RISC-V instruction onto Register File signals (`WR`, `RR0`, `RR1`)
- Distinguish **combinational** vs **sequential** logic
- Build a **decoder with enable** for the write path
- Design the ALU as parallel functional units + output MUX
- Implement subtraction as `A + (~B) + 1`
- Handle 64-bit vs 32-bit widths with splitters
- Build an N-bit equality comparator from XNOR gates

---

## Combinational vs Sequential

| Property | Combinational | Sequential |
|----------|---------------|------------|
| Holds state? | No | Yes |
| Needs `CLK`? | No | Yes |
| Output depends on | Current inputs only | Inputs **and** state |
| Examples | ALU, decoder, MUX | PC, register file |

<div class="highlight-box">
The <strong>ALU has no CLK</strong> — it is pure combinational logic. Give it A, B, ALUOp and R settles after gate delays.
</div>

---

## From Instruction to Register Signals

```text
        rd    rs1   rs2
 add    t0,   t1,   t2    ->   R = RD0 + RD1
         |     |     |
         v     v     v
        WR    RR0   RR1
```

For `add t0, t1, t2` the processor:
1. **Reads** `t1` and `t2` via `RR0`/`RR1` → `RD0`/`RD1`
2. **Computes** `RD0 + RD1` in the ALU
3. **Writes** result into `t0` via `WR`/`WD` when `WE=1`

---

## Register File Interface

| Signal | Width | Meaning |
|--------|-------|---------|
| `RR0`, `RR1` | 5 bits | Select registers to read |
| `RD0`, `RD1` | 64 bits | Read data outputs |
| `WR` | 5 bits | Destination register |
| `WE` | 1 bit | Write enable |
| `WD` | 64 bits | Data to write |
| `CLK` / `CLR` | 1 bit | Clock / clear |

<div class="info-box">
5-bit selectors because 2<sup>5</sup> = 32 registers. Two read ports + one write port = one instruction per cycle.
</div>

---

## Write Path: Decoder with Enable

A plain 5-to-32 decoder drives exactly one output high — but it *always* drives one output. We must gate each line with `WE`:

```text
 WR (5) ──►┌──────────┐ 0 ──►[ AND ]──► r0 (EN for x0)
           │  5-to-32 │ 1 ──►[ AND ]──► r1
           │  decoder │ . . .
           │          │31 ──►[ AND ]──► r31
           └──────────┘          ▲
 WE ───────────────────────────────┘
```

Result: **gated one-hot** — at most one line high, only when `WE=1`

---

## Decoder with Enable (Diagram)

---

## Common Write-Path Mistakes

<div class="highlight-box">
<strong>Forgetting WE gate</strong>: one register gets overwritten every clock cycle, even during branches or stores.
</div>

- **No WE gate** → selected register updates on every edge
- **Writing to x0** → x0 must always be zero; never enable its register
- **Wrong selector width** → 4 bits only addresses 16 registers; need 5

---

## Read Path: MUX Selection

To produce `RD0`: 32-input, 64-bit-wide MUX with `RR0` as select. Same for `RD1`/`RR1`.

<div class="mermaid">
flowchart LR
    Z["Constant 0 (64-bit)"] --> M["32-input 64-bit MUX"]
    X1["x1 register"] --> M
    X2["x2 register"] --> M
    X31["x31 register"] --> M
    RR0["RR0 (5 bits)"] -->|"select"| M
    M --> RD0["RD0 (64 bits)"]
</div>

**x0 is hardwired to constant 0** — tied to ground, not a register

---

## Hardwired Constants in Verilog

```verilog
// x0 hardwired to zero — not a register, just ground
wire [63:0] x0 = 64'd0;

// A 4-bit constant 5 is fixed at construction:
wire [3:0]  five = 4'b0101;
```

Constants are **not programmable** — fixed when the circuit is built. That is why `x0` can never be anything but zero.

---

## ALU Interface

```text
            ALUOp (3)
              |
        +-----+-----+
  A --->|           |
 (64)   |    ALU    |---> R (64)
  B --->|           |
 (64)   +-----------+
```

| `ALUOp` | Operation | RISC-V instructions |
|---------|-----------|---------------------|
| `000` | `A + B` (add) | `add`, `addi`, address calc |
| `001` | `A - B` (sub) | `sub`, comparisons |
| `010` | `A * B` (mul) | `mul` |
| `011` | `A << B` (sll) | `sll`, `slli` |
| `100` | `A >> B` (srl) | `srl`, `srli` |

---

## ALU Internal Structure

<div class="mermaid">
flowchart LR
    A["A (64)"] --> ADD & SUB & MUL & SLL & SRL
    B["B (64)"] --> ADD & SUB & MUL & SLL & SRL
    ADD["ADD"] --> MUX["Result MUX"]
    SUB["SUB"] --> MUX
    MUL["MUL"] --> MUX
    SLL["SLL"] --> MUX
    SRL["SRL"] --> MUX
    OP["ALUOp (3)"] -->|"select"| MUX
    MUX --> R["R (64)"]
</div>

All units compute in parallel; MUX selects the result we want.

---

## ALU in C (the Software Analog)

```c
#define ALU_ADD 0b000
#define ALU_SUB 0b001
#define ALU_MUL 0b010
#define ALU_SLL 0b011
#define ALU_SRL 0b100

uint64_t alu(uint64_t A, uint64_t B, uint8_t ALUOp) {
    switch (ALUOp) {
        case ALU_ADD: return A + B;
        case ALU_SUB: return A - B;
        case ALU_MUL: return A * B;         // low 64 bits
        case ALU_SLL: return A << (B & 63);
        case ALU_SRL: return A >> (B & 63);
        default:      return 0;
    }
}
```

The `switch` is the software analog of the output MUX.

---

## Subtraction = A + (~B) + 1

No separate subtractor needed — reuse the adder via two's complement:

```text
  A - B  =  A + (~B) + 1
             invert B    carry-in = 1
```

<div class="highlight-box">
Classic bug: forgetting the +1 carry-in gives A + ~B = A - B - 1 (off by one).
</div>

In Digital: simply use the built-in **Sub** component, which handles this internally.

---

## Multiplication: The 64-bit Trap

**Wrong approach**: feed full 64-bit operands into a multiplier.

- Two 64-bit values produce a **128-bit** product — doesn't fit in `R`
- Building a custom 64x64 multiplier is wasteful and error-prone

**Correct approach**: multiply only the lower 32 bits of each operand.

---

## Multiplication in C

```c
// Multiply low 32 bits; 32x32 -> 64 always fits in R
uint64_t mul_low32(uint64_t A, uint64_t B) {
    uint32_t a = (uint32_t)A;   // low 32 bits of A
    uint32_t b = (uint32_t)B;   // low 32 bits of B
    return (uint64_t)a * (uint64_t)b;
}
```

Use **splitters** in Digital to extract `A[31:0]` and `B[31:0]` from the 64-bit wires.

---

## Width Management with Splitters

Mixing 32-bit and 64-bit wires is the #1 source of red error wires in Digital.

| Situation | Tool |
|-----------|------|
| Extract bit range (64→32) | splitter |
| Combine narrow into wide (32→64) | splitter/merger |
| Component data-bit mismatch | adjust component settings |

<div class="info-box">
A 32-bit MUL output cannot drive a 64-bit wire directly without widening — always match widths.
</div>

---

## Equality Comparator: XNOR

Two values are equal when **every** bit position matches. XNOR tests bit equality:

| a | b | XOR (a≠b) | XNOR (a=b) |
|---|---|-----------|------------|
| 0 | 0 | 0 | 1 |
| 0 | 1 | 1 | 0 |
| 1 | 0 | 1 | 0 |
| 1 | 1 | 0 | 1 |

`XNOR(a, b) = 1` iff `a == b`

---

## 4-bit Equality Comparator

<div class="mermaid">
flowchart LR
    A["A (4)"] --> SA["splitter"]
    B["B (4)"] --> SB["splitter"]
    SA --> X0["XNOR bit0"]
    SB --> X0
    SA --> X1["XNOR bit1"]
    SB --> X1
    SA --> X2["XNOR bit2"]
    SB --> X2
    SA --> X3["XNOR bit3"]
    SB --> X3
    X0 & X1 & X2 & X3 --> AND["AND (4-input)"]
    AND --> EQ["eq"]
</div>

```text
eq = XNOR(a0,b0) AND XNOR(a1,b1) AND XNOR(a2,b2) AND XNOR(a3,b3)
```

---

## Comparator Scales to N Bits

An N-bit comparator is N XNOR gates feeding one N-input AND.

```text
eq = AND of XNOR(aᵢ, bᵢ) for all i in 0..N-1
```

<div class="info-box">
You will reuse this comparator (and sub-based cousins for <, >=) in the Branch Unit for beq, bne, blt, bge.
</div>

---

## Top-Level Circuit: lab09.dig

Inputs exposed: `CLK`, `CLR`, `RR0`, `RR1`, `WR`, `WE`, `ALUSrcB`, `ALUOp`, `Imm`

Outputs: `T0` (= x5), `T1` (= x6)

---

## Running a Program by Hand

No control unit yet — you play the role of control logic. Example: `addi t0, t0, 1`

```text
1. CLR=1, pulse CLK  -> all registers = 0
2. RR0  = 5 (t0)    -> RD0 = 0
3. ALUSrcB = 1       -> B comes from Imm
4. Imm = 1
5. ALUOp = 000       -> R = 0 + 1 = 1
6. WR = 5, WE = 1    -> destination is t0
7. Pulse CLK         -> t0 = 1
8. Observe T0 = 1
```

---

## Autograder Test Programs

| Program | Key Signals | Expected Result |
|---------|-------------|-----------------|
| `addi t0, t0, 1` | `ALUSrcB=1, Imm=1, ALUOp=000` | `T0 = 1` |
| `li t1, 2` | (`addi t1, x0, 2`) `Imm=2` | `T1 = 2` |
| `addi t0, t0, -1` | `Imm=-1, ALUOp=000` | `T0 = 0xFFF...F` |
| `li t0,1; li t1,1; sub t0,t0,t1` | two `li`, then `ALUOp=001, ALUSrcB=0` | `T0 = 0` |

<div class="highlight-box">
<code>li t1, 2</code> is a pseudo-instruction: really <code>addi t1, x0, 2</code> — add immediate to the always-zero x0.
</div>

---

## Key Concept Reference

| Concept | Key Point |
|---------|-----------|
| Combinational vs Sequential | ALU has no CLK; register file does |
| Register file signals | `rd→WR`, `rs1→RR0`, `rs2→RR1` |
| Decoder with enable | One-hot outputs ANDed with `WE` |
| x0 hardwired | Constant 0, never written |
| ALU structure | Parallel units, output MUX by `ALUOp` |
| Subtraction | `A + (~B) + 1`, carry-in = 1 |
| Multiply | Use `A[31:0] * B[31:0]` → 64-bit result |
| Equality | N XNOR gates + N-input AND |

---

## Practice: Map the Instruction

For `sub a2, a0, a1` — identify `WR`, `RR0`, `RR1`, `ALUOp`:

```text
sub  a2,  a0,  a1
      rd   rs1  rs2

WR    = x12  (a2, destination)
RR0   = x10  (a0, rs1 -> RD0)
RR1   = x11  (a1, rs2 -> RD1)
ALUOp = 001  (subtract: R = RD0 - RD1)
ALUSrcB = 0  (B from RD1, not Imm)
WE    = 1
```

---

## Practice: Trace the ALU

Given `A = 7`, `B = 3`:

```text
ALUOp 000 (add): R = 7 + 3 = 10  (0x0A)
ALUOp 001 (sub): R = 7 - 3 = 4   (0x04)
ALUOp 010 (mul): R = 7 * 3 = 21  (0x15)
ALUOp 011 (sll): R = 7 << 3 = 56 (0x38)
ALUOp 100 (srl): R = 7 >> 3 = 0  (0x00)
```

Check `sll`: `7 = 0b111`, shift left 3 = `0b111000 = 56`

Check `srl`: all three set bits shift off the bottom → 0

---

## Summary

1. **ALU is combinational** (no CLK); register file is sequential (has CLK)

2. **R-type signals**: `rd→WR`, `rs1→RR0`, `rs2→RR1`; write when `WE=1`

3. **Decoder with enable**: 5-to-32 decoder, each output ANDed with `WE`

4. **Read path**: 32-input MUX per port; `x0` hardwired to constant 0

5. **ALU**: parallel functional units (ADD/SUB/MUL/SLL/SRL) + output MUX by `ALUOp`

6. **Subtraction**: `A + (~B) + 1`; forgetting `+1` causes off-by-one

7. **Multiply**: use `A[31:0] * B[31:0]` — 64-bit result fits `R`

8. **Equality comparator**: per-bit XNOR, AND all bits — reused in branch unit