Parallel Matrix-Vector Multiplication and Power Method Tutorial

Introduction: Why Matrix-Vector Multiplication Matters in 2026

Matrix-vector multiplication is a fundamental operation in scientific computing, machine learning, and graphics. In 2026, with AI models scaling to billions of parameters, efficient parallel computation of matrix-vector products is more critical than ever. This tutorial walks you through a CS140-style assignment: writing a parallel program to multiply a matrix by a vector and using it in the power method to find the largest eigenvalue. We'll use MPI for parallelism and draw analogies from trending topics like AI training, game physics, and social media algorithms.

Understanding the Power Method

The power method is an iterative algorithm to find the dominant eigenvalue (largest in absolute value) of a matrix. It repeatedly multiplies a vector by the matrix and normalizes. Think of it like a trending topic on social media: the most popular post (eigenvector) gets amplified each time it's shared (matrix multiplication), and its popularity growth rate (eigenvalue) emerges.

In AI, eigenvalues help in dimensionality reduction (PCA) and understanding the stability of neural network training. In gaming, they're used in physics engines to detect system stability. The power method is simple but requires many matrix-vector multiplications, making parallelization essential for large matrices.

Parallel Matrix-Vector Multiplication with MPI

We'll use MPI (Message Passing Interface) to distribute the matrix rows among processes. Each process computes its portion of the result vector, then we gather the results. This is a classic 'data parallel' approach.

Algorithm Outline

Each process gets a chunk of rows (say, local_rows).
Broadcast the input vector to all processes.
Each process computes its local product: local_result[i] = sum over j (A[i][j] * x[j]).
Use MPI_Gather to collect all local results into the full result vector on the root process.

This pattern is similar to how a distributed AI training job splits a batch of data across GPUs, computes gradients locally, then aggregates them.

Implementation Steps

1. Matrix Generation

Write a function to generate a test matrix. For the power method, use a symmetric positive-definite matrix to guarantee convergence. A simple choice: A[i][j] = (i+j) % N + 1 or a random matrix. In production, you might use a sparse matrix from a real-world problem (e.g., PageRank, neural network weight matrix).

2. Parallel Multiply Function

void mat_vec_mult(double *local_A, double *x, double *local_result, int local_rows, int N, MPI_Comm comm) {
    MPI_Bcast(x, N, MPI_DOUBLE, 0, comm);
    for (int i = 0; i < local_rows; i++) {
        local_result[i] = 0.0;
        for (int j = 0; j < N; j++) {
            local_result[i] += local_A[i*N + j] * x[j];
        }
    }
}

3. Power Method Routine

double power_method(double *local_A, int local_rows, int N, double *eigenvector, int max_iter, double tol, MPI_Comm comm) {
    double *x = malloc(N * sizeof(double));
    double *local_y = malloc(local_rows * sizeof(double));
    double *y = NULL;
    int rank;
    MPI_Comm_rank(comm, &rank);
    if (rank == 0) {
        y = malloc(N * sizeof(double));
        // Initialize x with random values
        for (int i = 0; i < N; i++) x[i] = 1.0;
    }
    double lambda = 0.0;
    for (int iter = 0; iter < max_iter; iter++) {
        mat_vec_mult(local_A, x, local_y, local_rows, N, comm);
        MPI_Gather(local_y, local_rows, MPI_DOUBLE, y, local_rows, MPI_DOUBLE, 0, comm);
        if (rank == 0) {
            // Compute norm and normalize
            double norm = 0.0;
            for (int i = 0; i < N; i++) norm += y[i]*y[i];
            norm = sqrt(norm);
            for (int i = 0; i < N; i++) x[i] = y[i] / norm;
            // Estimate eigenvalue (Rayleigh quotient)
            lambda = 0.0;
            for (int i = 0; i < N; i++) lambda += y[i] * x[i];
            // Check convergence
            // ... (compare lambda with previous)
        }
        MPI_Bcast(x, N, MPI_DOUBLE, 0, comm);
    }
    if (rank == 0) {
        for (int i = 0; i < N; i++) eigenvector[i] = x[i];
    }
    free(x); free(local_y); if (rank == 0) free(y);
    return lambda;
}

Timing Experiments

Measure the execution time for different matrix sizes (e.g., N=1000, 2000, 4000) and number of processes (1, 2, 4, 8). Use MPI_Wtime() for accurate timing. Plot speedup and efficiency. In practice, you'll see near-linear speedup for large matrices because communication overhead is small relative to computation.

This is analogous to how AI training scales with more GPUs: the computation (matrix multiply) dominates, but communication (all-reduce) becomes a bottleneck at scale. Optimizing the broadcast/gather pattern (e.g., using MPI_Allreduce for the norm) can improve performance.

Real-World Connections

The power method is used in Google's PageRank algorithm (eigenvector centrality), in facial recognition (eigenfaces), and in modern AI for spectral clustering and graph neural networks. In 2026, as AI models grow, efficient eigenvalue computation is key for model compression and understanding training dynamics.

For example, when training a large language model, the Hessian matrix's eigenvalues indicate the curvature of the loss landscape. The power method can quickly estimate the largest eigenvalue to adjust learning rates (like in the Adam optimizer).

Conclusion

Parallelizing matrix-vector multiplication is a foundational skill in high-performance computing. By implementing the power method, you not only learn MPI but also gain insight into eigenvalue computation used across many fields. Experiment with different matrix sizes, process counts, and even try sparse matrices for extra challenge.