Assignment Chef icon Assignment Chef
All English tutorials

Programming lesson

Mastering OpenMP: Find the Maximum Neighborhood Average in C++

Learn how to use OpenMP to parallelize a C++ program that finds the cell with the highest neighborhood average in a large 2D array. Step-by-step tutorial with code examples and performance tips.

OpenMP tutorial C++ parallel computing neighborhood average maximum average cell CS286 assignment OpenMP reduction multithreaded C++ dynamic memory allocation high resolution timer C++11 OpenMP speedup parallel programming examples GCC OpenMP cache optimization critical section OpenMP matrix averaging C++ competitive programming

Introduction to OpenMP and Neighborhood Averaging

In this tutorial, you will learn how to write a multithreaded C++ program using OpenMP to find the cell with the highest neighborhood average in a large two-dimensional array. This is a common assignment in parallel computing courses (like CS286) and a great way to understand shared-memory parallelism. The problem involves reading a matrix from a file, computing the average of each cell's 3x3 neighborhood (including itself), and reporting the cell(s) with the maximum average. We'll use OpenMP directives to parallelize the computation and speed it up, aiming for the fastest execution time—just like competing in a speedrun challenge.

Imagine you're analyzing a heatmap of a popular multiplayer game like Fortnite to find the hottest drop zone. The matrix represents player density, and you need the cell with the highest average density in its 9-cell neighborhood. That's exactly what we'll code, but with random numbers. Let's dive in!

Setting Up the Project

You'll need a C++11 compiler with OpenMP support (GCC 4.3+). Create a file named matAverager.cpp and a Makefile as shown below. The program should accept command-line arguments: either a filename or the keyword rand followed by thread count, rows, columns, and seed.

// Makefile
all:
    g++ -fopenmp -ggdb -std=c++11 matAverager.cpp -o matavg

Compile with make. The -fopenmp flag enables OpenMP, and -std=c++11 ensures we have the high-resolution timer.

Reading the Input Matrix

The input file's first line contains two integers: rows and columns. Subsequent lines contain the matrix elements. Use dynamic memory allocation (new) because the array can be huge (e.g., 10000x20000). We'll read the data into a 1D array for better cache performance, but treat it as 2D with index [i * cols + j].

unsigned int *M = new unsigned int[rows * cols];
// Read numbers from file or generate random

When using the rand option, generate random numbers using srand(seed) and rand(). The sample output shows seed 0 producing reproducible results.

Computing the Neighborhood Average

For each cell (i,j), the neighborhood includes all cells from i-1 to i+1 and j-1 to j+1 that are within bounds. The average is the sum of these 9 (or fewer at edges) values divided by the count. Here's the serial version:

double computeAverage(unsigned int *M, int rows, int cols, int i, int j) {
    double sum = 0.0;
    int count = 0;
    for (int di = -1; di <= 1; ++di) {
        for (int dj = -1; dj <= 1; ++dj) {
            int ni = i + di;
            int nj = j + dj;
            if (ni >= 0 && ni < rows && nj >= 0 && nj < cols) {
                sum += M[ni * cols + nj];
                ++count;
            }
        }
    }
    return sum / count;
}

We'll compute this for every cell and track the maximum average and its position.

Parallelizing with OpenMP

OpenMP makes parallelization straightforward. We'll use #pragma omp parallel for to distribute rows among threads. Since each cell's average is independent, we can compute them in parallel. However, we need to update the global maximum safely. Use a critical section or a reduction with a custom struct. The simplest approach: each thread finds its local max, then combine in a critical section.

double globalMax = -1.0;
int bestRow = -1, bestCol = -1;
#pragma omp parallel for num_threads(numThreads) schedule(static)
for (int i = 0; i < rows; ++i) {
    double localMax = -1.0;
    int localRow = -1, localCol = -1;
    for (int j = 0; j < cols; ++j) {
        double avg = computeAverage(M, rows, cols, i, j);
        if (avg > localMax) {
            localMax = avg;
            localRow = i;
            localCol = j;
        }
    }
    #pragma omp critical
    {
        if (localMax > globalMax) {
            globalMax = localMax;
            bestRow = localRow;
            bestCol = localCol;
        }
    }
}

This works, but the critical section can become a bottleneck. For better performance, use an array of per-thread maxima and combine after the loop.

Handling Ties and Output

If multiple cells have the same maximum average, the assignment requires reporting only one. Our code naturally picks the first encountered (lowest row, then column) because we update only when avg > localMax (strictly greater). To match the sample output, we must output exactly one cell. The format is:

largest average: 7.66667 found at cells: (0,1)

Note: the sample shows two cells for the tie case, but your program should output only one. The assignment says: "If there is a tie, your program must report only a single cell that ties the maximum value." So we output one cell.

Include timing using the stopwatch class from the provided skeleton. Print elapsed time (optional for grading but required for the competition).

Performance Tips for the Speed Competition

To win the prize, you need the fastest program. Here are some optimization strategies:

  • Use dynamic scheduling: If rows have varying workloads (e.g., edges have fewer neighbors), schedule(dynamic) may help.
  • Minimize critical sections: Use an array of double for per-thread maxima and combine after the parallel loop.
  • Cache optimization: Access memory sequentially. Our row-major loop already does that.
  • Compiler optimizations: Use -O2 or -O3 in addition to -fopenmp.
  • Reduce function call overhead: Inline computeAverage or hand-code the loops.

Test with large matrices like 10000x20000. The sample shows a serial time of 2.05s vs parallel 0.336s with 10 threads—a 6x speedup. Aim for near-linear speedup!

Complete Example Code

Below is a simplified but functional version. You'll need to add the stopwatch class and command-line parsing. The full assignment skeleton is provided; integrate this logic.

#include <iostream>
#include <fstream>
#include <cstdlib>
#include <omp.h>
using namespace std;
int main(int argc, char *argv[]) {
    if (argc < 3) {
        cerr << "usage: exe [input data file] [num threads]\n";
        return 1;
    }
    int numThreads = atoi(argv[2]);
    int rows, cols;
    unsigned int *M;
    // Read file or generate random...
    // Assume M is allocated and filled.
    double globalMax = -1.0;
    int bestRow = -1, bestCol = -1;
    #pragma omp parallel num_threads(numThreads)
    {
        double localMax = -1.0;
        int localRow = -1, localCol = -1;
        #pragma omp for schedule(static)
        for (int i = 0; i < rows; ++i) {
            for (int j = 0; j < cols; ++j) {
                double avg = 0.0;
                int count = 0;
                for (int di = -1; di <= 1; ++di) {
                    for (int dj = -1; dj <= 1; ++dj) {
                        int ni = i + di, nj = j + dj;
                        if (ni >= 0 && ni < rows && nj >= 0 && nj < cols) {
                            avg += M[ni * cols + nj];
                            ++count;
                        }
                    }
                }
                avg /= count;
                if (avg > localMax) {
                    localMax = avg;
                    localRow = i;
                    localCol = j;
                }
            }
        }
        #pragma omp critical
        {
            if (localMax > globalMax) {
                globalMax = localMax;
                bestRow = localRow;
                bestCol = localCol;
            }
        }
    }
    cout << "largest average: " << globalMax << " found at cells: (" << bestRow << "," << bestCol << ")\n";
    delete[] M;
    return 0;
}

Compile and test with the provided sample files. Ensure output matches exactly.

Conclusion

You've learned how to parallelize a neighborhood averaging algorithm using OpenMP. This technique is widely applicable in image processing, scientific simulations, and data analysis. Just like optimizing a machine learning model for faster inference, parallel computing lets you handle larger datasets efficiently. Now go win that competition!