Assignment Chef icon Assignment Chef
All English tutorials

Programming lesson

From Assembly to Machine Code: Building Your ISA Assembler in Python (Lab 4 Part 2)

Learn how to convert preprocessed RISC‑V assembly into machine code using Python in JupyterLab. Step‑by‑step guide for CDA 4205L Lab #4, covering instruction encoding, the rv32im_isa.csv reference, and binary output.

ISA assembler design RISC-V machine code conversion CDA 4205L lab 4 part 2 assembly to binary Python rv32im_isa.csv encoding get_2c_binary function R-type I-type S-type B-type U-type J-type JupyterLab assembler tutorial preprocessed assembly to machine code RISC-V instruction formats assembler toolchain 2026 Python bit manipulation two's complement immediate binary file output .bin CDA 4205L lab report help RISC-V assembler from scratch

Introduction: Why Assemblers Matter in 2026

As we move deeper into 2026, the explosion of custom silicon for AI accelerators, RISC‑V based IoT devices, and edge computing has made understanding the assembler toolchain more relevant than ever. In this lab, you will complete the second half of the ISA assembler design by converting preprocessed assembly code into machine code. This mirrors what happens inside every compiler toolchain—from a high‑level language down to bits that control the processor.

What You'll Build: A RISC‑V Assembler in Python

In Part 1, you preprocessed assembly source files. Now, your assembler must read a formatted .txt file and output binary machine code. The core task is to parse each instruction, identify its type (R, I, S, B, U, J), and encode it according to the RISC‑V specification. You will use the provided rv32im_isa.csv file as a lookup table for opcodes, funct3, funct7, and instruction formats.

Understanding the rv32im_isa.csv File

Think of this CSV as the “instruction set dictionary.” Each row defines an instruction mnemonic (e.g., lw, add, beq) along with its opcode, funct3, funct7, and format. Your assembler must read this file and map each mnemonic to its encoding pattern. For example, lw is an I‑type instruction with opcode 0x03, funct3 0x2. Without this mapping, you cannot generate the correct binary.

Step‑by‑Step: Converting Assembly to Machine Code

  1. Load the preprocessed assembly file (e.g., example1_out1.txt). Each line contains an instruction in the format: mnemonic rd, rs1, rs2 or mnemonic rd, offset(rs1) etc.
  2. Parse each line to extract the mnemonic and operands. Handle different instruction formats: R‑type (add, sub), I‑type (addi, lw), S‑type (sw), B‑type (beq, bne), U‑type (lui, auipc), J‑type (jal).
  3. Look up the instruction in the CSV to get the opcode, funct3, funct7, and format.
  4. Encode each operand into its bit field: register numbers (5 bits each), immediate values (sign‑extended as needed). For immediate fields, you may need the get_2c_binary function to convert negative numbers to two’s complement binary.
  5. Assemble the 32‑bit instruction by shifting and OR‑ing the fields together. Output the binary as a 32‑bit string or as bytes to a .bin file.

Key Function: get_2c_binary

When an immediate value is negative (e.g., addi x5, x6, -4), you must represent it using two’s complement. The get_2c_binary function converts a decimal integer (positive or negative) into a binary string of a specified width. For example, -4 in 12‑bit two’s complement is 111111111100. This function is essential for I‑type, S‑type, and B‑type instructions where immediates can be negative.

Instruction Formats You'll Encounter

In this lab, you must handle at least six formats: R, I, S, B, U, J. All formats share the same 32‑bit length and the opcode in bits [6:0]. The differences lie in how the remaining bits are allocated to source/destination registers and immediates. This impacted the code by requiring separate encoding functions or a unified function with format‑specific bit masks. A clean approach is to use a dictionary that maps each format to a list of bit field definitions.

Why Preprocessing Matters

Preprocessing (Part 1) handled pseudo‑instructions, labels, and comments, producing a clean list of real RISC‑V instructions. Without it, your assembler would need to resolve labels and expand pseudo‑ops, complicating the encoding step. Preprocessing separates concerns, making the assembler modular and easier to debug—just like modern compilers use multiple passes.

Putting It All Together: Example Walkthrough

Let’s encode lw x6, 8(x5). This is an I‑type load. From the CSV: opcode=0x03, funct3=0x2. The immediate is 8 (12‑bit, zero‑extended), rs1=x5 (01001), rd=x6 (00110). The binary becomes: 000000001000 01001 010 00110 0000011 = 0x0082A303. Your assembler should produce exactly this bit pattern.

Testing with Provided Examples

You will test your assembler on example1_out1.txt and example2_out1.txt. The expected output is a .bin file containing the machine code bytes. Compare your output with the sample shown in the lab description to verify correctness. If your binary matches, you’re on the right track.

Common Pitfalls and Tips

  • Register numbering: Remember that x0 is 00000, x1 is 00001, … x31 is 11111. Off‑by‑one errors are common.
  • Immediate encoding: B‑type and J‑type immediates are split across multiple bit fields. Use bit masking and shifting carefully.
  • Endianness: The lab expects little‑endian byte order in the .bin file. Convert your 32‑bit integer accordingly.
  • CSV parsing: Use Python’s csv module or manual split. Ensure you handle comments or empty lines.

Conclusion: From Lab to Real‑World Applications

Assemblers are the bridge between human‑readable code and machine execution. In 2026, with RISC‑V gaining traction in everything from cloud servers to smartwatches, understanding this process gives you a deeper appreciation of how software controls hardware. Completing this lab will prepare you for more advanced topics like linker scripts, relocation, and even writing your own compiler backend.