Penalized Spline Regression and Kernel Methods Tutorial

Introduction to Computational Statistics with Splines and Kernels

In this tutorial, we explore key techniques from MAST90083 Assignment 2: penalized spline regression, k-nearest neighbor averaging, kernel regression, and expectation-maximization for Gaussian mixtures. These methods are fundamental in data science, used in applications from smoothing financial time series to clustering user behavior in apps. Using the WarsawApts dataset from the HRW library, we'll implement each method step by step. The dataset contains apartment prices in Warsaw, which we can relate to current trends in real estate analytics and AI-driven pricing models.

Setting Up the WarsawApts Dataset

Load the required libraries and data:

library(HRW)
data(WarsawApts)
x <- WarsawApts$construction.date
y <- WarsawApts$areaPerMzloty
n <- length(x)

The variable x represents construction dates, and y is area per million zloty. We'll use these to model non-linear relationships.

Question 1: Penalized Spline Regression

1.1 Generating Knots for Linear Spline Basis

We generate 20 knots using quantiles, excluding extremes to match the data range:

probs <- seq(0, 1, length = 22)[-c(1, 22)]
k <- quantile(x, probs = probs)

This ensures knots are spread across the data distribution, similar to placing anchors in a game level design.

1.2 Constructing the Z Matrix

Create a matrix of linear spline basis functions (x - k_i)+:

Z <- outer(x, k, function(x, k) pmax(x - k, 0))
matplot(x, Z, type = "l", ylim = c(-1, 2), main = "Linear Spline Basis (20 knots)")

Each column is a basis function that activates after its knot, like level thresholds in a game.

1.3 Building C and D Matrices

Construct the design matrix C with a column of ones, x, and Z:

C <- cbind(1, x, Z)

The penalty matrix D is diagonal with ones except first two entries zero:

D <- diag(22)
D[1:2, 1:2] <- 0

This penalizes only the spline coefficients, not the intercept and linear term.

1.4 Tuning Parameter and Fitted Curves

Define 100 lambda values and compute fitted curves:

lambda_seq <- seq(0, 50, length = 100)
beta_hat <- solve(t(C) %*% C + lambda * D) %*% t(C) %*% y
y_hat <- C %*% beta_hat

For each lambda, compute RSS, degrees of freedom, and GCV:

RSS <- sum((y - y_hat)^2)
df <- sum(diag(C %*% solve(t(C) %*% C + lambda * D) %*% t(C)))
GCV <- RSS / (n - df)^2

GCV balances fit and complexity, like tuning a recommendation algorithm to avoid overfitting.

1.5 Optimal Lambda and Fit

Find lambda minimizing GCV and plot the fit:

lambda_opt <- lambda_seq[which.min(GCV)]
y_hat_opt <- C %*% solve(t(C) %*% C + lambda_opt * D) %*% t(C) %*% y
plot(x, y, main = "Penalized Spline Fit (20 knots)")
lines(x[order(x)], y_hat_opt[order(x)], col = "red", lwd = 2)

Increasing knots to 60 improves flexibility. Compute MSE:

MSE_20 <- mean((y - y_hat_opt)^2)
MSE_60 <- mean((y - y_hat_opt_60)^2)

Typically, MSE decreases with more knots but risks overfitting, similar to increasing model parameters in AI.

Question 2: Non-parametric Regression

2.1 k-Nearest Neighbor Averaging

Implement k-NN for k = 3, 7, 11, 15, 19, 23:

k_values <- seq(3, 23, length = 6)
y_hat_knn <- matrix(NA, n, length(k_values))
for (j in seq_along(k_values)) {
  for (i in 1:n) {
    d <- abs(x[i] - x)
    indices <- order(d)[1:k_values[j]]
    y_hat_knn[i, j] <- mean(y[indices])
  }
}

Plot results in a 3x2 grid and compute MSE:

par(mfrow = c(3, 2))
for (j in seq_along(k_values)) {
  plot(x, y, main = paste("k-NN k =", k_values[j]))
  lines(x[order(x)], y_hat_knn[order(x), j], col = "blue", lwd = 2)
  MSE_knn[j] <- mean((y - y_hat_knn[, j])^2)
}

Small k captures noise, large k oversmooths. Compare MSE with spline: k-NN may perform better for local patterns.

2.2 Kernel Regression with Six Kernels

Write a function that computes six kernel values (e.g., Gaussian, Epanechnikov, etc.):

kernel_fun <- function(u, type) {
  switch(type,
    "gaussian" = dnorm(u),
    "epanechnikov" = 0.75 * (1 - u^2) * (abs(u) <= 1),
    "uniform" = 0.5 * (abs(u) <= 1),
    "triangular" = (1 - abs(u)) * (abs(u) <= 1),
    "biweight" = (15/16) * (1 - u^2)^2 * (abs(u) <= 1),
    "triweight" = (35/32) * (1 - u^2)^3 * (abs(u) <= 1)
  )
}

With bandwidth h = 2, compute Nadaraya-Watson estimator:

h <- 2
f_hat <- matrix(NA, n, 6)
for (i in 1:n) {
  u <- (x - x[i]) / h
  K <- sapply(u, kernel_fun, type = "gaussian")  # repeat for each kernel
  f_hat[i, ] <- colSums(K * y) / colSums(K)
}

Plot fits and compare MSE; Gaussian kernel often performs best.

Question 3: Expectation-Maximization for Gaussian Mixtures

3.1 EM Algorithm for Three Components

We extend the two-component EM to three components with different standard deviations. Use the image data from the assignment (e.g., cameraman). Load and vectorize:

library(png)
img <- readPNG("cameraman.png")
y <- as.vector(img)

Initialize parameters: means, standard deviations, and mixing proportions.

K <- 3
mu <- sample(y, K)
sigma <- rep(sd(y), K)
pi <- rep(1/K, K)

E-step: compute responsibilities

resp <- matrix(0, length(y), K)
for (k in 1:K) {
  resp[, k] <- pi[k] * dnorm(y, mu[k], sigma[k])
}
resp <- resp / rowSums(resp)

M-step: update parameters

for (k in 1:K) {
  Nk <- sum(resp[, k])
  mu[k] <- sum(resp[, k] * y) / Nk
  sigma[k] <- sqrt(sum(resp[, k] * (y - mu[k])^2) / Nk)
  pi[k] <- Nk / length(y)
}

Iterate until convergence. The algorithm is used in clustering image pixels or user segments in apps.

Conclusion

This tutorial covered penalized spline regression, k-NN, kernel regression, and EM for Gaussian mixtures. These techniques are widely applied in data science, from smoothing trends in finance to clustering in AI. Practice with the WarsawApts dataset to master these methods for your assignment.