Programming lesson
Abstraction and Tradeoffs in Computer Architecture: A Lab Guide for CDA 4205L
Explore how abstraction layers in computer architecture impact power, performance, and area. This tutorial covers ISA, microarchitecture, and tradeoffs using multiplication examples, with timely analogies to AI accelerators and gaming consoles.
Understanding Abstraction in Computer Architecture
In computer architecture, abstraction allows us to hide implementation details while providing a clean interface. Just as a mobile app developer doesn't need to know the transistor-level design of a smartphone's CPU, a programmer working with an ISA can write code without worrying about the underlying microarchitecture. This lab focuses on the layers between the operating system and gate-level circuits: the ISA, microarchitecture, and register transfer level.
The ISA: The Contract Between Hardware and Software
The Instruction Set Architecture defines what operations a processor can execute. For example, a 'multiply' instruction tells the hardware to multiply two numbers, but it doesn't specify how the multiplication is carried out. This abstraction enables software compatibility across different microarchitectures. In this lab, you'll work with an ISA that includes a multiply operation, but you'll implement it in two different ways.
Two Microarchitectures: Dedicated Multiplier vs. Repeated Addition
uArch_1 uses a dedicated hardware multiplier, which is fast but larger and more power-hungry. uArch_2 performs multiplication via repeated addition (e.g., adding 4 to itself 5 times for 4*5), which takes multiple cycles but uses less area. This classic tradeoff mirrors real-world decisions: modern GPUs include dedicated tensor cores for AI workloads, while low-power IoT devices might use software multiplication to save silicon area.
Implementing the Multiply Function in RARS
You'll write assembly code for both microarchitectures using the RARS simulator. The goal is to compute 550 * 21 = 11550. For uArch_1, use the mul instruction. For uArch_2, implement a loop that adds the multiplicand repeatedly. Below is a skeleton for uArch_1:
_multiply:
# a0 = multiplicand, a1 = multiplier
mul a0, a0, a1
ret
For uArch_2, you'll need a loop. Remember that repeated addition requires a1 iterations:
_multiply:
li t0, 0 # product = 0
li t1, 0 # counter = 0
loop:
beq t1, a1, done
add t0, t0, a0
addi t1, t1, 1
j loop
done:
mv a0, t0
ret
Analyzing Performance and Energy
After running both programs, use the Instruction Statistics tool to count total instructions and their types. For uArch_1, you'll see a single mul instruction. For uArch_2, the loop will execute many add and branch instructions. This directly impacts energy: each instruction consumes power, and more instructions mean more energy, even if each individual instruction is cheap.
Area Overhead Calculation
Given uArch_1 area = 8 μm² and uArch_2 area = 7 μm², the area overhead for the multiplier is 1 μm². The percent overhead = (1/7)*100 ≈ 14.3%. This is a small price for the speed gain in many applications.
Energy Tradeoff: When is uArch_1 Better?
Assume each multiply in uArch_1 consumes 500 pJ, while other ALU instructions (add, addi) consume 4 pJ each. In uArch_2, a multiply requires a1 additions (since the multiplier is 21, you need 20 additions? Actually, repeated addition of 550 21 times uses 20 adds? Let's calculate: to multiply 550 by 21, you add 550 to itself 21 times, which requires 20 additions (since the first add gives 2*550, etc.) plus a final move? Actually, the loop body executes a1 times, each time doing an add and a branch. So total ALU instructions = a1 (adds) + a1 (branches) + some overhead. For simplicity, consider the loop body: each iteration has an add and a branch. So for multiplier = 21, you have 21 adds and 21 branches? The branch is taken 20 times and not taken once, but each branch is an instruction. So total ALU instructions ≈ 21 adds + 21 branches? But branches are not ALU in the energy sense? The problem says 'other ALU instructions' exclude branches. So only adds count. Then uArch_2 uses 21 adds (since the loop adds a0 each iteration, and the initial product is 0, so you need 21 additions to get 21*550). So energy for uArch_2 = 21 * 4 pJ = 84 pJ. uArch_1 uses 500 pJ for the multiply, plus maybe some overhead? But the multiply instruction itself consumes 500 pJ, and no other ALU instructions. So uArch_1 consumes 500 pJ. That means uArch_2 is more energy-efficient for this small multiplier. But as the multiplier grows, the number of additions increases linearly, while the dedicated multiplier's energy is constant. So the breakeven point is when 4 * multiplier = 500, i.e., multiplier = 125. So for multipliers greater than 125, uArch_1 is better in energy.
Energy-Delay Product (EDP)
EDP = total energy * total delay. For uArch_1, delay = 1 cycle (assuming multiply takes 1 cycle) * 500 ps = 500 ps. Energy = 500 pJ. EDP = 500 * 500 = 250,000 pJ·ps. For uArch_2, delay = 21 cycles (since each addition takes 1 cycle, plus loop overhead? Actually, the loop runs 21 iterations, each iteration has add and branch, but the branch might be pipelined. Assume 1 cycle per iteration, so 21 cycles. Energy = 84 pJ. EDP = 21 * 500 ps * 84 pJ = 21*500*84 = 882,000 pJ·ps. So uArch_1 has lower EDP, meaning it's more energy-efficient when considering both energy and performance. This is why modern processors use dedicated multipliers despite the area cost.
Real-World Connections: AI and Gaming
These tradeoffs are central to today's tech. For example, Apple's M1 chip uses a dedicated neural engine for AI tasks, similar to uArch_1's multiplier, while a low-power microcontroller might implement AI inference in software. In gaming, consoles like the PS5 use custom silicon with dedicated ray tracing units (uArch_1) versus software ray tracing on older GPUs (uArch_2). The choice depends on the workload: for real-time rendering, performance is critical, so dedicated hardware wins. For battery-powered devices, area and energy may dominate.
Lab Deliverables
Complete T1-T4 during lab: take screenshots of console output and instruction statistics for both uArch_1 and uArch_2. For the final report, answer T5-T7 with calculations and explanations. Submit your assembly files and report as a .zip file.