Manual Parallelization Tutorial for CAB401: High Performance Computing

Introduction to Manual Parallelization for CAB401

In the CAB401 assignment, you are tasked with selecting a real-world software application and manually parallelizing it. This tutorial walks you through the entire process, from analyzing the sequential code to achieving optimal speedup. Whether you are targeting a multi-core CPU, a GPU, or a cluster, the principles remain the same. By the end, you will understand how to identify parallelism, use profiling tools, and overcome performance bottlenecks.

Step 1: Selecting and Analyzing the Sequential Application

Choose an application that is computationally intensive. For example, a video processing tool, a physics simulation, or a financial Monte Carlo simulator. Avoid I/O-bound applications like word processors. Analyze the application's structure using call graphs and class diagrams. Identify hot spots using profilers like gprof or Valgrind. Look for loops and control flow constructs that can be parallelized.

Consider the analogy of a popular AI image generator like DALL-E: generating a single image requires heavy matrix operations, which are perfect for parallelization. Similarly, in your chosen app, focus on loops that perform independent calculations.

Data and Control Dependencies

Before parallelizing, map out data dependencies. Use tools like Intel Advisor or manual code inspection. For instance, a loop that updates a shared variable in each iteration has a data dependency that must be handled with synchronization or algorithm replacement. Control dependencies occur when the execution path depends on a condition; these can often be parallelized by speculative execution or restructuring.

Determine which parallelism is scalable—meaning it can utilize many cores efficiently. Amdahl's Law tells us that the sequential portion limits speedup. Aim to parallelize at least 90% of the execution time.

Step 2: Choosing Parallel Hardware and Frameworks

You can use any parallel hardware: multi-core CPU (OpenMP), GPU (CUDA or OpenCL), or distributed memory (MPI). For this tutorial, we'll focus on multi-core CPU using OpenMP, which is widely available and easy to start with. If you have access to an NVIDIA GPU, CUDA can yield massive speedups for data-parallel tasks.

Consider the current trend of cloud gaming like NVIDIA GeForce Now: games are rendered on remote GPUs. Similarly, your parallel code can offload heavy computation to GPUs for real-time performance.

Step 3: Profiling and Identifying Hotspots

Use profiling tools to measure execution time of functions. For example, compile with -pg and run gprof. Generate a flat profile and call graph. Look for functions that consume the most CPU time. These are your candidates for parallelization.

Example profiling output might show that a function compute_pixels() takes 80% of the time. That's where you focus.

Step 4: Mapping Computation to Processors

Decide how to distribute work. For OpenMP, use #pragma omp parallel for to parallelize loops. Ensure that loop iterations are independent. If there are data dependencies, use reduction clauses or critical sections. For example, a sum reduction can be done with #pragma omp parallel for reduction(+:sum).

For GPU, you'd map threads to data elements. For instance, each thread computes one pixel value. This is called data parallelism.

Step 5: Synchronization and Avoiding Race Conditions

When multiple threads access shared data, use synchronization mechanisms. In OpenMP, you can use #pragma omp critical or atomic operations. However, excessive synchronization kills performance. Try to restructure code to avoid sharing, e.g., use private variables or local copies.

Consider the example of a viral social media app like TikTok: recommendations are computed by parallel servers that avoid conflicts by partitioning user data. Similarly, partition your data so each thread works on its own chunk.

Step 6: Testing for Correctness

After parallelizing, verify that the output matches the sequential version exactly. Use a test suite with multiple input sizes. For floating-point operations, small differences may arise due to non-associative arithmetic; define a tolerance for comparison. Use assertions or diff tools.

Step 7: Measuring Speedup and Scaling

Run your parallel program with 1, 2, 4, 8, etc., threads. Measure wall-clock time. Plot a speedup graph (speedup = T_sequential / T_parallel). Compare to ideal linear speedup. Use tools like time or built-in OpenMP timers.

If your speedup plateaus, investigate barriers like load imbalance or memory bandwidth. For example, if one thread finishes early, redistribute work dynamically using schedule(dynamic).

Step 8: Overcoming Performance Barriers

Common barriers include:

Load Imbalance: Use dynamic scheduling or chunk sizes.
Memory Contention: Use private arrays or false sharing avoidance.
Granularity: Merge small parallel sections to reduce overhead.
Data Dependencies: Replace algorithms (e.g., use parallel prefix sum).

Document your journey: what barriers you hit and how you solved them. This is crucial for the report.

Step 9: Documenting Your Work

Your final submission includes a report (10-15 pages) and source code. The report must cover:

Description of the original application and its architecture.
Analysis of potential parallelism and dependencies.
Mapping strategy and synchronization details.
Profiling results and speedup graph.
Correctness testing approach.
Tools used (compilers, profilers, libraries).
Story of overcoming barriers.
Code changes with line counts.
Reflection on lessons learned.

Include before and after code snippets to highlight changes.

Example: Parallelizing a Mandelbrot Set Generator

Let's walk through a concrete example. The sequential Mandelbrot generator computes each pixel independently—perfect for parallelization. Using OpenMP, you can parallelize the outer loop:

#pragma omp parallel for schedule(dynamic) private(x, y, iter) for (int i = 0; i < height; i++) { for (int j = 0; j < width; j++) { // compute pixel } }

Profiling shows 95% of time in the pixel loop. With 4 cores, speedup is ~3.8x. Memory bandwidth is not a bottleneck because each pixel is independent. This achieves excellent speedup.

Conclusion

Manual parallelization is a skill that combines analysis, tool use, and creativity. By following these steps, you can succeed in CAB401. Remember to reflect on your process—what worked, what didn't, and what you'd do differently. Good luck!