DNA Pattern Matching in C and RISC-V

Introduction to DNA Pattern Matching

Pattern matching in DNA sequences is a fundamental problem in bioinformatics, but it also appears in modern tech trends like AI-powered genomic analysis and personalized medicine. In this tutorial, you'll build a program that finds all occurrences of a short DNA pattern (e.g., "GCTTTT") in a long sequence of nucleotides (A, C, G, T). This mirrors how apps like 23andMe or CRISPR tools scan genomes for specific markers. By the end, you'll have a working C implementation and understand how to translate it to RISC-V assembly for ECE2035 Project 1.

Understanding the Problem

Your task is to write a function Match(pattern, patLen, seq, seqLen) that stores indices of all pattern occurrences in a global array MatchIndices, terminated by -1. Overlapping matches count separately. For example, in sequence "AAAAAA" with pattern "AAA", indices 0,1,2,3 are all valid. This is similar to searching for a trending hashtag in a tweet stream—each occurrence, even overlapping, is a hit.

C Implementation Strategy

Brute-Force Approach

The simplest method is to slide the pattern over the sequence one character at a time and check for a match. This is O(n*m) but works well for short patterns (3-10 characters) and sequences up to 10,240 characters. Start with this to ensure correctness.

void Match(char *pattern, int patLen, char *seq, int seqLen) {
    int idx = 0;
    for (int i = 0; i <= seqLen - patLen; i++) {
        int match = 1;
        for (int j = 0; j < patLen; j++) {
            if (seq[i+j] != pattern[j]) {
                match = 0;
                break;
            }
        }
        if (match) {
            MatchIndices[idx++] = i;
        }
    }
    MatchIndices[idx] = -1;
}

Optimization Tips

To improve performance for the assembly translation, consider early exit: if the first character doesn't match, skip the inner loop. Also, use pointer arithmetic instead of indexing for speed. Remember the DEBUG flag to print debug output only during development.

From C to RISC-V Assembly

Once your C code works, you'll translate it to RISC-V. The provided shell generates a packed DNA sequence (2 bits per nucleotide) and a pattern. You must unpack bits, compare, and store indices. The baseline metrics are: 44 instructions static, 45,740 dynamic, 743 words storage. Your goal is to beat these for credit.

Key Assembly Concepts

Packed Data: Each word holds 16 nucleotides in lower 16 bits, right-to-left. Use shifts and masks to extract.
ECALL 512 generates the sequence and pattern.
ECALL 513 verifies your solution—call it at the end.
Memory: Use .alloc for MatchIndices (128 words) and keep stack usage minimal.

Optimization Strategies

To reduce dynamic instruction count, unroll loops for the pattern length (3-7) and use early termination. For static size, reuse code blocks. For storage, minimize global variables—use registers where possible. Think of this like optimizing an AI model's inference speed: every instruction counts.

Trend-Inspired Example: Gaming Leaderboard Search

Imagine you're searching for a player's tag in a gaming leaderboard (e.g., "Ninja") across thousands of entries. Overlapping searches happen if tags can share characters. Your DNA search algorithm is the engine behind such lookups in high-performance databases.

Common Pitfalls

Off-by-one errors in loop bounds (remember sequence length - pattern length + 1).
Forgetting to terminate MatchIndices with -1.
In assembly, misaligning packed data or incorrect shift amounts.

Testing Your Code

Use the provided shell's random tests. For C, compile with gcc -Wall -o p1-1 P1-1.c and run. For assembly, use the RISC-V simulator. Debug with ECALL 552 to highlight a character at a given offset.

Conclusion

By mastering this DNA search project, you'll gain skills in algorithm design, C programming, and low-level assembly optimization—all crucial for careers in AI, genomics, and embedded systems. Good luck, and remember to keep your code original and efficient!