Parallel Matrix Multiplication in C++: 2025 Tutorial with Threads & Mutex

Introduction: Why Parallel Matrix Multiplication Matters in 2025

Matrix multiplication is a cornerstone of modern computing, powering everything from AI neural networks to 3D graphics in gaming. In 2025, with the rise of generative AI apps and real-time data processing, understanding how to parallelize this operation is more relevant than ever. This tutorial guides you through converting a sequential C++ matrix multiplication program into a parallel one using threads, mutexes, and shared_mutex, based on the classic assignment from CSCN73000.

Understanding Matrix Multiplication

Before diving into parallelism, ensure you understand the sequential algorithm. Given two matrices A (M x K) and B (K x N), the product C (M x N) is computed as:

for (int i = 0; i < M; i++) {
    for (int j = 0; j < N; j++) {
        C[i][j] = 0;
        for (int k = 0; k < K; k++) {
            C[i][j] += A[i][k] * B[k][j];
        }
    }
}

This triple-nested loop is computationally intensive, making it an ideal candidate for parallelization.

Designing the Parallel Version

We will create two functions: parallel_matrix_multiply() and parallel_helper(). The first function divides the work among threads; the second contains the code each thread executes.

Step 1: Decide on Data Decomposition

The most common approach is to split the output matrix rows among threads. Each thread computes a contiguous block of rows. For example, with 4 threads and 1000 rows, thread 0 handles rows 0-249, thread 1 rows 250-499, etc.

Step 2: Implement parallel_helper()

void parallel_helper(int thread_id, int num_threads, int M, int N, int K,
                     double* A, double* B, double* C) {
    int rows_per_thread = M / num_threads;
    int start_row = thread_id * rows_per_thread;
    int end_row = (thread_id == num_threads - 1) ? M : start_row + rows_per_thread;

    for (int i = start_row; i < end_row; i++) {
        for (int j = 0; j < N; j++) {
            C[i * N + j] = 0;
            for (int k = 0; k < K; k++) {
                C[i * N + j] += A[i * K + k] * B[k * N + j];
            }
        }
    }
}

Step 3: Implement parallel_matrix_multiply()

void parallel_matrix_multiply(int M, int N, int K, double* A, double* B, double* C, int num_threads) {
    std::vector<std::thread> threads;
    for (int t = 0; t < num_threads; t++) {
        threads.emplace_back(parallel_helper, t, num_threads, M, N, K, A, B, C);
    }
    for (auto& t : threads) {
        t.join();
    }
}

Measuring Performance and Efficiency

Time both the sequential and parallel versions using std::chrono. Compute efficiency as:

Efficiency = (Sequential_Time / Parallel_Time) / Number_of_Threads

For example, if sequential time = 12.5 s, parallel time with 4 threads = 3.8 s, then efficiency = (12.5 / 3.8) / 4 ≈ 0.82 (82%). Ideal efficiency is 1.0, but overhead reduces it.

Experimenting with Thread Count

Run your program with different numbers of threads (e.g., 1, 2, 4, 8) and observe the effect on efficiency. On a 4-core CPU, using more than 4 threads may decrease efficiency due to context switching. This mirrors real-world scenarios like server load balancing during major esports tournaments or AI model training.

Adding Synchronization: std::mutex

In some parallel designs, threads may write to shared data (e.g., the result matrix). To protect against race conditions, use a mutex:

std::mutex mtx;

// Inside parallel_helper, when writing to C:
std::lock_guard<std::mutex> lock(mtx);
C[i * N + j] = ...;

However, locking for every write can severely degrade performance. For a 1000x1000 matrix, measure the average efficiency over 10 runs. You might see efficiency drop to 10-20%.

Switching to std::shared_mutex

A std::shared_mutex allows multiple readers or one writer. Since our threads only write to their assigned rows, we can use a shared mutex per row (or a single mutex with shared locking) to improve concurrency. But note: the standard shared_mutex is not designed for fine-grained locking on individual elements. A better approach is to avoid locking altogether by ensuring threads write to disjoint memory (which we already do). This experiment highlights the cost of unnecessary synchronization.

Example using a shared_mutex for the entire matrix (not recommended):

std::shared_mutex sh_mtx;

// For write:
std::unique_lock<std::shared_mutex> lock(sh_mtx);
C[i * N + j] = ...;

// For read (if needed):
std::shared_lock<std::shared_mutex> lock(sh_mtx);
... = C[i * N + j];

Average efficiency with shared_mutex may be slightly better than mutex, but still far from optimal because the lock is coarse-grained.

Optimizing the Design

To achieve near-linear speedup, ensure each thread works on independent data with no synchronization. Our initial row-wise decomposition already does that. The best efficiency is obtained by not using any locks when writing to disjoint rows. If you must share data, consider using atomic operations or lock-free techniques.

For instance, in a gaming leaderboard update system, you might use atomic integers to avoid locks. Similarly, in matrix multiplication, if you need to accumulate partial sums from different threads, use a local buffer per thread and merge at the end.

Conclusion

Parallel matrix multiplication is a classic exercise that teaches you thread management, data decomposition, and synchronization trade-offs. By following this tutorial, you've built a parallel solution, measured efficiency, and observed the impact of mutex vs. shared_mutex. These skills are directly applicable to high-performance computing in AI, finance, gaming, and beyond.

Remember to test with different matrix sizes and thread counts to fully understand scalability. Happy coding!