Key Concepts from Labs and Projects¶

Lab01 - Hello World in the RISC-V Dev Env¶

Console based editing: micro or vim
git and GitHub usage
Makefiles
Hello World in C
The Autograder

Project01 - RISC-V VM Access and Number Conversion in C¶

Dev Environment Setup
Local RISC-V VM using qemu-system-riscv64
Remote RISC-V VM on euryale
ssh keys, ssh config, ssh authorized_keys
Shell configuration
- bash: ~/.profile, ~/.bash_profile, ~/.bashrc, ~/.bash_aliases
- zsh: ~/.zprofile, ~/.zshrc
C Basics
Functions
Data (global, stack, heap)
Statements and expressions
Data types
- Primitives - int, char, float, uint8_t, uint32_t, int32_t, uint64_t, int64_t
- Composite - arrays and structs
- Pointers - int *, char *, uint32_t *, uint64_t *
Pointer operations
- & address of, e.g., int x; int *p; p = &x;
- * dereference, e.g., x = *p;
Strings in C
- Array of chars (bytes)
- Null \0 (0) terminated
- ASCII character codes
The C stack
- Local variables
- Allocated on function entry, deallocated on function exit (return)
Signed (int, int32_t) vs unsigned (unsigned int, uint32_t) values
Command line arguments with int argc and char *argv
Layout of argv array in memory
Array of pointers first, argv[0], argv[1], argv[2], NULL (0)
String arrays second, ./foo + \0, -p + \0, etc.
Parsing command line arguments
Number systems
Decimal (base 10, digits: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
Binary (base 2, digits: 0, 1, C literal prefix: 0b)
Hexadecimal, "hex" (base 16: digits: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F, C literal prefix: 0x)
- A = 10, B = 11, C = 12, D = 13, E = 14, F = 15
base conversions
C string as decimal, binay, hexadecimal to int
- Place value: 2 * (10^2) + 5 * (10^1) + 4 * (10^0)
int to C string as decimal, binary, hexadecimal
- Modulus to find digit
- Divide to get value for next digit
- Repeat until divide gives 0

Lab02 - Introduction to RISC-V Assembly Programming¶

Instructions, registers, labels, directives
See RISC-V Assembly Guides
Assembling vs compiling
Assembling (as) assembly source (.s) to object code (.o)
Compiling (gcc) c source (.c) to object code (.o)
The object files must be linked, we can use gcc for this
- as -o foo.o foo.s
- gcc -o main main.c foo.o
The C compiler can also assemble
- gcc -o main main.c main.s
Instructions have a name (mnemonic), like add or mul
Assembly programming model
A RISC-V processor has 32 registers plus the PC registers
- x0, x1, ... x31
- ABI names: sp, ra, a0, a1, ..., t0, t1, ..., s0, s1, ...
The PC points to next instruction in memory to execute
Load an instruction from memory into processor
Execute the instruction, usually updates one or more register values, but can also update memory
Update PC, often just the next instruction PC + 4, but can also be a diffenent address in the case of a branch or jump (control instruction)
Most instructions have three operands
One target (destination)
Two source
add t0, t1, t1 means t0 = t1 + t2
On RISC-V 64 bit registers are 64 bits (8 bytes)
Data processing instructions
Immediate values
Control instructions
conditional branchs: beq, bne, blt, bge
jumps: j
return: ret
Assembly arguments passed in a0, a1, a2, a3, etc.
Return value is put into a0

Basic assembly function

.global func_name
func_name:
    add a0, a0, a1
    ret

if/then/else in assembly

C:

int x, y, r;
r = 0;
if (x > y) {
   r = r + 1;
} else {
   r = r + 99;
}

Asm: Assume a0 - int x, a1 - int y, t0 - int r

    li t0, 0
    bgt a0, a1, else
    addi t0, t0, 1
    j endif
else:
    addi t0, t0, 99
endif:

for loops in assembly

C:

int r, i, n;
r = 0;
n = 10
for (i = 0; i < n; i++) {
    r = r + i;
}

Assume t0 - int r, t1 - int i, t2 - int n

    li r, 0
    li n, 10
    li i, 0
loop:
    bge t1, t2, loopend
    addi t0, t0, t1
    addi t1, t1, 1
    j loop
loopend:

Array access in assembly
lw (load word) lw t0, (a0) means t0 = *a0
Load word at address a0 into register t0
Pointer based array access
- To get to next element in array of words (ints): addi a0, a0, 4

Project02 - RISC-V Assembly Language¶

RISC-V ABI Function Calling Conventions
Only one set of registers (32), so we need a way to coordinate usage between functions
Caller-saved registers (a0, a1, ..., a7, t0, t, ... t6, ra)
- These must be preserved on the stack by the caller of a function
- You only need to preserve the registers you will be using after the function call
Callee-saved registers (sp, s0, s1, ... s11)
- These must be preserved on the stack by the callee, on entry to the function.
- They should be restored on exit.
- The stack pointer is preserved by subtracting (stack allocation) on function entry and adding (stack deallocation) on function exit the name number of bytes.
- The sp should be a multiple of 16 (for performance compatibility with some instructions)
Functions that don't call other functions do not need to allocate stack space.
Only need stack space if the function uses caller-saved registers

Function call template when using stack:

.global func_name
func_name:
    addi sp, sp, -N    # Allocate N bytes on stack, N must be a multiple of 16
    sd ra, (sp)         # Save ra onto stack
                        # Save any callee-saved registers if needed
    ...
    ld ra, (sp)
    addi sp, sp, +N
    ret

Indexed-based array access

Assume t0 is int i, a0 is base address of array int arr[]

li t1, 4
mul t1, t1, t0
add t1, a0, t1
lw t2, (t1)

Altneratively use a shift instead of multiply

slli t1, t0, 2
add t1, a0, t1
lw t2, (t1)

Recursive functions
Nothing special, just that the func calls itself
Same rules apply
Presere any caller-saved registers on stack before the recursive call
Restore any caller-saved registers from stack after the recursive call
If a recursive function simply returns after the recursive call, then it is called tail recursive and the recursive call can be turned into a jump.

Project03 - RISC-V Assembly Language Part 2¶

Working with C strings in assembly
Use lb (load byte) and sb (store byte) to access characters in a string
Calling C functions from Assembly
Just add a .global func_name for the function you want to use
E.g., .global strlen
Then set arg registers and use call instruction as normal
Make sure to preserve any caller-saved registers before call
Memory
Byte addressable
Smallest addressable unit is a byte (8 bits)
A int (word) is 4 bytes, need to decide how to layout a word in memory

Two byte orderings

Big endian (put the most significant byte first)

Little endian (put the least significant byte first)

int x = 0xFFAA1122
4 |    |   4 |    |
3 | 22 |   3 | FF |
2 | 11 |   2 | AA |
1 | AA |   1 | 11 |
0 | FF |   0 | 22 |
Big        Little
Endian     Endian

Two's Complement (2's Complement)
Need a way to represent negative integer values
To get the two's complement negative representation of a positive value:
- twos = invert_bits(val) + 1

3-bit two's complement values

Two's complement properties
- The msb (most significant bit) is the sign bit (0 is pos, 1 is neg)
- Only one representation of 0
- Because of this, there is one more neg value than pos values
- Simple grade school arithmetic work on two's complement numbers
Bit manipulation and bitwise operators
Zero (0) / One (1)
- Other names: unset/set, low/higg, off/on, false/true
A byte is 8 bits
```
msb | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | lsb
```
- msb - most significant bit
- lsb - least significant bit

A word is 4 bytes or 32 bits

msb | 31      24|23      16|15       8|7        0| lsb
    |   byte 3  |  byte 2  |  byte 1  |  byte 0  |

Bitwise Operaters (op, C, ASM)
- AND , & , and/andi
- OR , | , or/ori
- NOT , ~
- XOR , ^ ,
- Shift Left Logical
- Shift x to the left by n bits
- Fill lower bits with 0
- uint32_t x
- x << n
- sll/slli
- Shift Right Logical
- Shift x to the right by n bits
- Fill upper bits with 0
- uint32_t x
- x >> n
- srl/srli
- Shift Right Arithmetic
- Shift x to the right by n bits
- Fill upper bits with sign bit value
- int32_t x
- x >> n
- sra/srai
Masking
- It is often need to pull out a sequence of bits from a larger value
- To get a sequence of bits:
- Shift the bits to the far right
- Mask the bits to zero out any upper bits we don't want
- Example: assume we want the middle 4 bits of an 8 bit value (bits 2-5)
```
uint8_t x = 0b11011010;
uint32_t v = x >> 2     // Shift the 4 bits to the beginning
uint32_t v = v & 0b1111 // Mask off upper bits remaining in v
```
Combine individual bit sequences with right shift and OR (|)
Sign extension
- Assume you have a 4-bit two's complement number you want to sign extend to be a 32-bit two's complement number.
```
int32_t v = 0b1110 // -2 in 4 bits
v = (v << 28) >> 28;
```
- Shift all the way to the left and then do shift right arithmetic all the way back

Lab03 - RISC-V Machine Code Emulation¶

RISC-V Machine Code and Emulation
Instructions are 32 bits (1 word), there is also a 16 bit compressed format
Each instruction encodes source register(s), a destination registers, and target address or offset
We decode the iw (instruction word) in steps
- Look at opcode to determine instruction format type
- Decode the iw based on opcode
- Further decode iw by looking at other fields such as funct3 and funct7
The base RISC-V emulator includes the emulated state as a struct:
```
struct rv_state {
    uint64_t regs[NREGS];
    uint64_t pc;
    uint8_t stack[STACK_SIZE];
};
```
- NREGS is 32 (because RISC-V has 32 registers)
- The pc is the program counter
- STACK_SIZE is 8192, but for prorgrams with lots of function calls and/or local vars this may need to be bigger.
Basic emulation
- We will emulate assembly programs that have been linked with the emulator code
- Initialize the rv_state with
- A pointer to the function we want to emulate
- The first 4 arguments as uint64_t values to send to the function
- Get iw: iw = *pc
- Get opcode from iw
- Based on opcode call emulation function, like:
- emu_r_type(rsp, iw)
- rsp is struct rv_state *rsp
- The emu functions will further decode the iw
- Use shift and mask (get_bits()) to get each field
- Update the state (registers and memory based on opcode, funct3, and funct7)
- Update the pc, usually pc += 4 to go to the next instruction
  - For control instructions, we will compute either an offset and add to pc
  - Or, for jalr use a register value to update the pc
RISC-V Instruction Formats
R-type |funct7[31:25]|rs2[24:20]|rs1[19:15]|funct3[14:12]|rd[11:7]|opcode[6:0]|
- For register instructions like add, sub, sll
I-type |imm_11_0[31:20]|rs1[19:15]|funct3[14:12]|rd[11:7]|opcode[6:0]|
- For immediate instruction like addi, slli, lw, lb
- Immediate must be constructed from parts and sign extended
- int64_t imm = sign_extend(imm_11_0, 11)
B-type |imm_12[31]|imm_10_5[30:25]|rs2[24:20]|rs1[19:15]|funct3[14:12]|imm_4_1[11:8]|imm_11[7]|opcode[6:0]|
- For branches, like beq, bne, blt, bge
- int64_t imm = sign_extend(imm_12 << 12 | imm_11 << 11 | imm_10_5 << 5 | imm_4_1 << 1)
- If branch taken pc += imm, if not taken pc += 4
J-type |imm_20[31]|imm_10_1[30:21]|imm_11[20]|imm_19_12[19:12]|rd[11:7]|opcode|
- For 'jal' jump and link, also 'j'
- int64_t imm = sign_extend(imm_20 << 20 | imm_19_12 << 12 | imm_11 < 11 | imm_10_1 << 1)
- pc += imm
RISC-V Pseudo Instructions
The assembler provide many convenience instructions that do not exist as real instructions
Examples
- li a0, 99 is addi a0, zero, 99
- j offset is jal zero, offset
- call offset is jal ra, offset
- bgt r1, r2, offset is blt r2, r1, offset

Project04 - RISC-V Emulation - Analysis - Cache Simulation¶

Additional RISC-V Instruction Formats
S-type |imm_11_5[31:25]|rs2[24:20]|rs1[19:15]|funct3[14:12]|imm_4_0[11:7]|opcode[6:0]|
- For store instuctions, like sb, sw, sd
- Immediate must be constructed from parts and sign extended
- int64_t imm = sign_extend(imm_11_5 << 5 | imm_4_0, 11)
U-type |imm_31_12[31:12]|rd[11:7]|opcode[6:0]|
- For lui (load upper immediate) and auipc add upper immediate to pc
- Our code dosn't use these
Note that load instructions are an I-type form
Dynamic Analysis
Because we emulate each instruction we can do the following
- Count the total number of instructions executed
- Count the different types of instructions executed
- I/R count, Loads, Stores, Jumps, Branches Taken, Branches Not Taken
- Counts added to struct rv_state as a new struct struct rv_analysis
- Need to update counts accurately in the emulation functions
Cache Memory and Simulation
A Cache is a small and fast memory that holds recently accessed data from main memory
Main memory is slow relative to CPU speed
A cache can work because program exeuction adheres to two principles of locality
- Principle of spatial locality - If code accesses a value, there is good chance it will access nearby values
- Principle of temporal locality - If code accesses a value now there is a good chance it will do so again soon
We send an address to a cache an request a value back
- Hit - value is in the cache for the requested address
- Miss - value is not in the cache for the requested address
- Hit Rate = #hits / #refs
- Miss Rate = #misses / #refs
- In modern processes caches can achieve very high hit rates, > 95%
Main cache ideas
- For a given memory address, how to locate value in the cache?
- How to replace a value?
- How many bytes do we load at at time into the cache (block size)?
Three main cache designs
- Direct Mapped
- Fully Associative
- Set Associative
Direct Mapped
- Each address maps to one slot in the cache
- On miss, just repace slot value with new value from memory
Fully Associative
- Each address can map to any slot in the cache
- Use a form of LRU (least recently used) to choose which slot to evict on miss
Set Associative
- An address maps to a set of slots, and can map to any slot in the set
- Use LRU within the slots in a set to choose a slot to evict
Block Size
- To support spatial locatlity a cache will read in more than the word requested
- We can think of memory as an array of blocks
- We need to find the block for a give address
- Find the base of the block
- The load the entire block into the cache slot

Lab05 - Introduction to Digital Design¶

Digital Design Fundamentals
Analog circuits vs. Digital circuits
- Analog: continuous voltage signals (0V to 5V)
- Digital: discrete voltage levels (high/1 and low/0)
Digital logic levels
- High: represented as 1 (approximately 5V)
- Low: represented as 0 (approximately 0V)
- Binary state system
Hardware Abstraction Hierarchy
Wires and devices abstracted into gates
Gates combined into larger components
Hierarchical design approach (building complex circuits from simpler components)
Basic Logic Gates
AND gate
- Symbol and circuit representation
- Truth table: output 1 only when all inputs are 1
OR gate
- Symbol and circuit representation
- Truth table: output 1 when any input is 1
NOT gate (inverter)
- Symbol and circuit representation
- Truth table: inverts the input
XOR (Exclusive OR)
XNOR (Exclusive NOR)
Boolean Algebra and Logic Notation
C code notation: r = a & b (AND), r = a | b (OR), r = ~a (NOT)
Boolean algebra notation: r = a · b (AND), r = a + b (OR), r = ā (NOT)
Logic notation: r = a ∧ b (AND), r = a ∨ b (OR), r = ¬a (NOT)
Circuit Composition
Building complex expressions from basic gates
Example: r = (a · b) + (c · d) - combining AND and OR operations
Circuits are directed acyclic graphs (DAGs) - no feedback loops in combinational logic

Lab08 - Combinational Logic¶

Sum-of-Products Method
Systematic approach to designing combinational logic circuits
Three-step process:
1. Create truth table
2. Identify rows with output '1'
3. Build equation from those rows
Product terms: AND operations combining inputs (with inversions as needed)
Sum: OR operation combining all product terms
Example: Even/odd parity detector
- Input: 3-bit number (n₂, n₁, n₀)
- Output: Determines if number of '1' bits is even or odd
Arithmetic Circuits - Adders
1-bit Half Adder
- Two inputs (a, b)
- Two outputs: sum and carry out (cout)
- Basic building block for addition
1-bit Full Adder
- Three inputs: a, b, and carry in (cin)
- Two outputs: sum and carry out (cout)
- Used for multi-bit addition
8-bit Ripple-Carry Adder
- Combines eight 1-bit full adders
- Carry propagates through the chain
- Cin initially set to 0
- Produces 8-bit sum output plus final cout
Subtraction using Adders
A - B = A + (-B)
Using two's complement: invert B then add 1
Invert B and set initial carry-in to 1

Lab09 - Digital Logic Components¶

Comparators
Equality comparator: checks if a == b
- Equation: eq = (ā·b̄) + (a·b)
1-bit, 2-bit, 8-bit, and N-bit comparators
Built hierarchically using smaller comparators
Multiplexers (MUX)
1-bit 2-input MUX: selects between two 1-bit inputs based on select signal
32-bit 2-input MUX: selects between two 32-bit inputs
1-bit 4-input MUX: selects one of four inputs using 2 select signals (S₁, S₀)
- Select encoding: 00→a, 01→b, 10→c, 11→d
Implementation options:
- Direct implementation using AND-OR logic
- Tree-based implementation using smaller MUXes
ALU (Arithmetic Logic Unit)
Combines multiple operations (addition, subtraction, multiplication, etc.)
Uses MUXes to select between different operation results
Inputs: A and B (operands), ALU-op (select signal for operation)
Outputs: Result of selected operation
Decoders
2-to-4 Decoder
- Takes 2 input bits (S1, S0)
- Produces 4 output lines (r0, r1, r2, r3)
- Only one output is active (high) at a time
- Truth table relationships:
- r0 = (S̄1·S̄0), r1 = (S̄1·S0), r2 = (S1·S̄0), r3 = (S1·S0)
Encoders
Basic Encoder: Converts one-hot input to binary output
Priority Encoder: Handles cases where multiple inputs may be active
- Assigns priority to inputs (higher numbered inputs have priority)
- Outputs binary encoding of highest priority active input

Lab09 - Sequential Logic¶

Sequential Logic Fundamentals
Circuits that maintain state over time
Combine state storage with combinational logic
Clock Signals
Clock period: time duration of one complete cycle
Frequency measured in Hz (cycles per second)
Rising edge: transition from low (0) to high (1)
Falling edge: transition from high (1) to low (0)
Storage Elements Hierarchy
SR Latch → D Latch → D Flip-Flop → 1-bit Register → N-bit Register → Counter
SR Latch (Set/Reset Latch)
Stores a 1-bit value
Two inputs: S (Set) and R (Reset)
Two outputs: Q and Q̄ (complement)
Built using cross-coupled NOR gates
Truth table:
- R=0, S=0: Hold current state
- R=0, S=1: Set Q to 1
- R=1, S=0: Reset Q to 0
- R=1, S=1: Undefined/invalid state
D Latch
Level-sensitive storage
Single data input (D) and clock input (CLK)
Passes D to Q when CLK is high (transparent)
D Flip-Flop
Edge-triggered storage
Constructed from two D Latches in master-slave configuration
Captures input only on clock edge (rising or falling)
More reliable than latches for synchronous circuits
N-Bit Register
Multiple 1-bit registers operating in parallel
Shared control signals: CLK, CLR, EN
All bits update synchronously on same clock edge
Counter Design
8-bit register stores current count value
8-bit adder increments the count
Feedback loop: adder output connects to register input
Increments on each clock cycle when enabled

Project05 - Instruction Memory and Decoding¶

ROM (Read-Only Memory)
Memory that can only be read from, not written to
Uses address bits (A) to select memory locations
Outputs instruction words (IW) of 32 bits
Example: 8-bit address can access 2^8 = 256 memory locations
Instruction Memory Architecture
Program Counter (PC) contains 64-bit address pointing to current instruction
PC feeds into instruction memory
Instruction word (IW) output is 32 bits
Address Calculation
Memory addressed by bytes (byte addressing)
Words consist of 4 bytes
Word address = byte address / 4
addr_word = addr_byte >> 2 (divide by 4 using right shift by 2)
Instruction Decoder (InstDecoder)
Analyzes 32-bit instruction word and determines instruction type
Input: 32-bit instruction word (IW)
Output: Instruction number (inum) indicating instruction type
Instruction Type Mapping:
- 0: i-type, 1: r-type, 2: load, 3: s-type
- 4: b-type, 5: jal, 6: j, 7: jalr
Instruction Decoding Process
Bit Splitter: Extracts specific bit ranges from 32-bit instruction
Opcode Recognition using equality comparators
- Example opcodes: 0010011 (i-type), 0110011 (r-type)
Priority encoder for instruction type selection

Lab10 - Processor Components¶

Computer Architecture Layers
Instruction Set Architecture (ISA)
- Software-hardware interface
- Specification of machine code, registers, and memory
Microarchitecture
- Processor implementation of the ISA
- Digital logic implementation
Digital Design
- Schematic entry (visual) or HDL (Hardware Description Language)
- Languages: Verilog, VHDL (IEEE standards)
Moore's Law
The number of transistors doubles every 1.5 years for processors
Processor Types
Single-Cycle Processor: Each instruction completes in one clock cycle
Multi-Cycle Processor: Instructions broken into multiple cycles
Pipelined Processor: Multiple instructions execute simultaneously
Single-Cycle Processor Components
Program Counter (PC): 64-bit register, adder for PC+4
Instruction Memory (ROM): Stores program instructions
Instruction Decode: Decodes fetched instructions
Register File: Connected to ALU
Immediate value (Imm): Extracted from instruction
Data Memory (RAM): For load/store operations
Register File Architecture
32 registers, each 64-bit wide
Register names: X0, X1, ..., X31
X0 (zero register) always contains zero
Operations:
- Read up to two register values in a single clock cycle
- Write to a single register in a clock cycle
Interface Signals:
- RR0, RR1: Read register selectors (5 bits each)
- RD0, RD1: Read data outputs (64 bits each)
- WR: Write register selector (5 bits)
- WD: Write data input (64 bits)
- WE: Write enable (control signal)
- CLK: Clock signal, CLR: Clear signal
Decoder with Enable
Select signal controls which output is activated
Enable signal (WE) gates the decoder operation
Converts binary select input to one-hot encoded output
ALU Structure
Takes two 64-bit inputs: A and B
ALU-op control signal (3 bits) selects operation
Produces 64-bit result
Internal functional units: ADD, SUB, MUL
Final multiplexer selects output based on ALU-op

Project06 - Processor Instruction Decoding¶

Register Decoder (RegDecoder)
Takes instruction word (IW) as input
Outputs register addresses: rs1, rs2, rd
Bit fields:
- rd: bits 11-7
- rs1: bits 19-15
- rs2: bits 24-20
Immediate Decoder (ImmDecoder)
Takes instruction word (IW) as input
Outputs all immediate types (I-type, S-type, B-type, U-type, J-type)
S-Type Immediate construction:
- Bits 11-7 for imm[4:0] and bits 31-25 for imm[11:5]
- Sign extension required
Sign Extension Hardware
Converts 12-bit immediate to 64-bit signed value
Uses multiplexer to select between 0 and -1 for upper bits
Sign bit determines fill value for upper bits
Instruction Detection Logic
ADDI: opcode 0010011 (7-bit comparator)
ADD: opcode 0110011, funct3 000, funct7 0000000
AND gates combine comparator outputs to identify specific instructions
Control Signals Generated
RFW (Register File Write): Controls whether to write to register file
ALUOp (ALU Operation): Specifies which ALU operation to perform
ALUSrcB (ALU Source B): Selects second operand for ALU
- 0: use register value, 1: use immediate value
ROM for Control Signals
Input: 3-bit encoded instruction type (inum)
Output: Control signal bundle (ALUSrcB, ALUOp, RFW)
Adding New Instructions (5-Step Process)
Pick an instruction or group of instructions
Modify or add components to processor
Modify or add to the datapath (usually need to extend or add muxes)
Extend instruction decoder control lines
Test the implementation

Project06 - Control Flow Instructions (JAL, JALR)¶

JAL (Jump and Link) Instruction
J-type format
Register Update: rd = PC + 4 (stores return address)
PC Update: PC = PC + imm (PC-relative addressing)
JALR (Jump and Link Register) Instruction
I-type format
Register Update: rd = PC + 4 (stores return address)
PC Update: PC = rs1 + imm (register-based addressing)
Special Case - RET: imm is 0, rd is x0, jumps to address in rs1
PCsel Signal (PC Select)
Multiplexer selects between PC+4 (sequential) and JTA (jump target address)

Project06 - Branch Instructions¶

Branch Mechanics
Syntax: beq rs0, rs1, label
Two-step process:
1. Compute branch target address (BTA = PC + imm-b)
2. Determine if branch should be taken based on comparison
Branch Unit Architecture
Comparison logic: equality (=), inequality (!=), less than (<), greater than or equal (>=)
Outputs: take_branch signal, PCbr (branch target address)
Branch Instructions Supported
beq: Branch if equal
bne: Branch if not equal
blt: Branch if less than
bge: Branch if greater than or equal

Project06 - Data Memory¶

Memory Architecture
RAM for data storage
Memory size calculation: 2^address_bits × bytes_per_word
Example: 7 address bits, 8 bytes per word = 1024 bytes
Load/Store Instructions
ld (load doubleword): loads 64-bit data from memory
sd (store doubleword): stores 64-bit data to memory
lw (load word): loads 32-bit data from memory
sw (store word): stores 32-bit data to memory
lb (load byte), sb (store byte)
Memory Address Calculation
Target address = base register + offset
Example: ld t0, 8(sp) - load from stack pointer + 8 byte offset
RAM Interface Signals
Din (Data In): 64-bit input for write operations
Dout (Data Out): 64-bit output for read operations
str (store signal): control for write enable
ld (load signal): control for read enable
clk (clock): synchronous operation
Control Signals for Memory
WDsel or Mem2Reg: selects memory output to write back to register file
MSZ (Memory Size): controls data size for load/store operations
Preparing Programs for Processor Execution
Remove ".global" directives
Use jal instead of call for function calls
Must include a main function
Initialize SP register (e.g., li sp, 1024)
Use unimp instruction to halt execution

Project07 - Processor Pipelining¶

Pipelining Fundamentals
Multiple instructions execute simultaneously in different stages
Laundry analogy: overlapping wash, dry, fold, put away operations
Theoretical speedup: n-stage pipeline can speed up by factor of n
Five-Stage RISC-V Pipeline
IF (Instruction Fetch): Fetch instruction from memory using PC
DR/DE (Decode and Register File Read): Decode instruction, read registers
EX (Execute): Execute operation using ALU, Branch Unit
MEM (Memory): Memory store/load operations
WB (Write Back): Write results back to Register File
Pipeline Registers
Placed between stages (IF/DE, DE/EX, EX/MEM, MEM/WB)
Hold intermediate values
Synchronize pipeline flow
Pipeline Performance
Total time = (stages × stage_time) + ((tasks - 1) × stage_time)
First item latency + (n-1) additional items at stage intervals

Project07 - Pipeline Hazards¶

Types of Pipeline Hazards
Data Hazards: Instructions depend on results from previous instructions
Control Hazards: Caused by jumps and branches
Data Hazard Solutions
Inverting the Clock to the RegFile
- DR (Data Read) and WB (Write Back) timing optimization
- Eliminates one NOP instruction
- Reduces cycles (e.g., 10 cycles to 9 cycles)
Forwarding (Data Forwarding)
- Forward data directly between pipeline stages
- Bypass register file for data dependencies
- Further cycle reduction (e.g., 10 cycles to 6 cycles)
Load Stalling (LD Stall)
- Specific handling for load-use dependencies
- Pipeline stall insertion when necessary
- Hazard detection unit implementation
- Enable signal control for stalling
Control Hazard Solutions
Branch and jump handling
Managing control flow changes in the pipeline
Hazard Detection Unit
ld-stall detection
PC enable control during stalls
Combinational logic: (ld-stall OR en) = stl-en
Pipeline Stage Tracking
Visualizing concurrent instruction execution
Identifying which stage each instruction occupies per cycle
Counting total cycles for instruction sequences

Advanced Architecture Topics¶

Speculative Execution
Executing instructions before confirming they're needed
Related to branch prediction
Out-of-Order Execution
Executing instructions in non-sequential order for performance
Allows independent instructions to execute while waiting for dependencies
Example: ld t0, (a0) followed by independent add t1, t2, t3
Majority Function Circuit
3-input majority function: output is 1 when majority of inputs are 1
SOP expression: majority = (a·b·c̄) + (a·b̄·c) + (ā·b·c) + (a·b·c)