OpenMP Image Processing Tutorial: Cosc407 A3 Parallelization Without parallel for

Introduction: Why Parallel Processing Matters in 2026

In the age of real-time AI image generation and high-resolution video streaming, processing speed is everything. Whether you're enhancing photos on your phone or running computer vision algorithms for autonomous drones, parallel computing is the key to performance. This tutorial focuses on OpenMP, a popular API for shared-memory parallel programming in C/C++. You'll learn how to divide image processing tasks among multiple threads without using the convenient #pragma omp parallel for directive—a common constraint in assignments like Cosc407 a3.

Understanding the Assignment: Cosc407 A3 (20 Marks)

Your task is to take an existing sequential image processing program and parallelize it using OpenMP. The program reads an input image, applies a filter, and writes the output. The goal is to reduce processing time by distributing the workload across 2, 4, 8, and 16 threads. You must report the execution times using omp_get_wtime() and verify correctness by comparing output images.

Key constraint: You cannot use #pragma omp parallel for or #pragma omp for. Instead, you must manually partition the work among threads using #pragma omp parallel and #pragma omp sections or by assigning loop iterations manually.

Setting Up Your Project

Download the provided zip file from Canvas, unzip, and import the contents into a new C project (Eclipse or your preferred IDE). Read the readme.txt for copyright info and library background. The first three statements in main() define input/output images and processing parameters. Experiment with these constants to understand their effect.

Why Not Use `parallel for`?

While #pragma omp parallel for automatically distributes loop iterations among threads, using manual parallelization gives you finer control over workload distribution and helps you understand thread management. This is a common educational exercise to deepen your grasp of OpenMP.

Step-by-Step Parallelization Without `parallel for`

1. Identify the Computationally Intensive Part

Look for loops that process each pixel of the image. For example, a nested loop iterating over rows and columns. This is where most time is spent.

2. Partition the Work Manually

Instead of letting OpenMP distribute iterations, you will assign each thread a contiguous block of rows (or columns). Use omp_get_thread_num() and omp_get_num_threads() to compute the start and end indices for each thread.

#pragma omp parallel
{
    int id = omp_get_thread_num();
    int num_threads = omp_get_num_threads();
    int rows_per_thread = total_rows / num_threads;
    int start_row = id * rows_per_thread;
    int end_row = (id == num_threads - 1) ? total_rows : start_row + rows_per_thread;
    
    for (int i = start_row; i < end_row; i++) {
        for (int j = 0; j < total_cols; j++) {
            // process pixel (i,j)
        }
    }
}

3. Handle Edge Cases

Ensure the last thread processes any remaining rows if total_rows is not evenly divisible by num_threads.

4. Timing the Execution

Use omp_get_wtime() to measure elapsed time. Place double start = omp_get_wtime(); before the parallel region and double end = omp_get_wtime(); after. Print the difference.

double start = omp_get_wtime();
#pragma omp parallel
{
    // parallel work
}
double end = omp_get_wtime();
printf("Time: %f seconds\n", end - start);

Comparing Performance Across Thread Counts

Run your program with 2, 4, 8, and 16 threads. Record the times and include them as comments in your code. For example:

// Performance results:
// 2 threads: 1.234 sec
// 4 threads: 0.678 sec
// 8 threads: 0.456 sec
// 16 threads: 0.389 sec

You may notice diminishing returns as thread count increases due to overhead. This is a classic observation in parallel computing.

Verifying Correctness

Compare the output image produced by your parallel program with the sequential version. They should be identical (or nearly identical, up to floating point rounding). Use image comparison tools or check pixel values programmatically.

Trend Connection: Real-Time AI Filters

Think of this assignment as building a simplified version of the real-time filters used in apps like TikTok or Instagram. Those apps process millions of pixels per frame, often using parallel techniques similar to OpenMP but on GPUs. Understanding manual thread distribution prepares you for more advanced parallel programming in AI and computer vision.

Common Pitfalls

Data races: Ensure threads do not write to the same pixel simultaneously. Since each thread works on distinct rows, this is avoided.
False sharing: When threads access adjacent memory locations, cache lines may bounce between cores. Pad data structures or use local variables to minimize this.
Overhead of thread creation: For small images, the overhead of creating threads may outweigh benefits. Experiment with image size.

Conclusion

By manually partitioning work among OpenMP threads, you gain a deeper understanding of parallel computing. This skill is invaluable for optimizing real-world applications, from video processing to scientific simulations. Keep experimenting with different thread counts and image sizes to see how performance scales.

Additional Tips for Cosc407 Students

Read the textbook section 2.6.4 for details on omp_get_wtime().
Use Eclipse's command line arguments to pass thread count or image filename.
Document your code clearly, especially the timing results.

Good luck with your assignment! Remember, resubmissions overwrite old ones, so test thoroughly before final submission.