CAP4613 Homework 1: Activation Functions & Backpropagation Tutorial

Understanding Activation Functions: The Engines of Neural Networks

In deep learning, activation functions determine how neurons fire. For your CAP4613 homework, you'll need to analyze six functions: ReLU, logistic sigmoid, piece-wise linear unit, Swish, ELU, and GELU. Each has distinct gradient regions — fast learning (|gradient| > 0.99), active (0.01–0.99), slow (0–0.01), and inactive (0). Let's break them down with timely examples.

1. ReLU (Rectified Linear Unit)

ReLU is the default choice for hidden layers, defined as f(x) = max(0, x). Its gradient is 1 for x > 0, 0 for x < 0, and undefined at 0 (commonly set to 0). This creates a large fast-learning region for positive inputs, but inactive region for negatives — known as the "dying ReLU" problem. Think of it like a viral app's user engagement: active users (x>0) drive fast learning, while inactive users (x<0) contribute nothing.

2. Logistic Sigmoid

Sigmoid maps any input to (0,1) with gradient f(x)(1-f(x)). Its gradient peaks near 0 (max 0.25) and vanishes for |x|>5. So the fast-learning region is tiny; most inputs fall into active or slow regions. This is why sigmoid is rarely used in hidden layers — like a crowded concert where only the front row hears clearly.

3. Piece-wise Linear Unit

This function is linear between -1 and 1, flat outside. The gradient is 0.2 for |x|>1, 1 for |x|<=1. So it has a fast region inside [-1,1], active region outside. It's like a gaming difficulty curve: easy mode (fast learning) for moderate inputs, harder (active) for extremes.

4. Swish

Swish, f(x) = x * sigmoid(x), has a non-monotonic gradient. It's smooth and often outperforms ReLU. Its gradient is large around x=1.5 and small near -5. This resembles AI model training where certain hyperparameters (like learning rate) accelerate learning.

5. ELU (Exponential Linear Unit)

ELU uses f(x) = x for x>=0, a*(exp(x)-1) for x<0. With a=0.1, its gradient is 1 for x>0, and 0.1*exp(x) for x<0. The negative side has a small but non-zero gradient, avoiding dead neurons. It's like a school grading system where even below-zero effort gets partial credit.

6. GELU (Gaussian Error Linear Unit)

GELU is x * Phi(x), where Phi is the CDF of standard normal. Its gradient is near 0 for large negative x, peaks around x=0.5. GELU is used in modern transformers (like GPT). It's the smooth, probabilistic cousin of ReLU — like a finance model that weights outcomes by probability.

Backpropagation Through the XOR Network

Your homework requires computing forward pass and gradients for a 2-layer network solving XOR. The network: f(x) = sigmoid(w * max(0, Wx + c) + b) with initial weights: W = [[1,1],[1,1]], c = [0,-1], w = [1,-2], b = -0.5. Four samples: (0,0)->0, (0,1)->1, (1,0)->1, (1,1)->0.

Forward Pass Example for (0,0)

Hidden layer: h = max(0, Wx + c) = max(0, [0, -1]) = [0, 0]. Output: z = w·h + b = 0 + (-0.5) = -0.5. Final: y_hat = sigmoid(-0.5) ≈ 0.3775. Loss (cross-entropy) for true label 0: -log(1-0.3775) ≈ 0.474. Repeat for other samples.

Gradient Computation

Using chain rule, compute partial derivatives for each parameter. For sample (0,0): dL/dw = (y_hat - y) * h = (0.3775 - 0) * [0,0] = [0,0]. dL/db = (y_hat - y) = 0.3775. For W and c, backprop through ReLU: gradients are zero because hidden units are zero. This shows how dead ReLU can stall learning — a key insight for robustness.

Implementing and Training in Python

Use PyTorch or TensorFlow to verify hand calculations. Here's a minimal PyTorch implementation:

import torch
import torch.nn as nn
import torch.optim as optim

class XORNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.hidden = nn.Linear(2, 2, bias=True)
        self.output = nn.Linear(2, 1, bias=True)
        # Initialize weights manually
        self.hidden.weight.data = torch.tensor([[1.,1.],[1.,1.]])
        self.hidden.bias.data = torch.tensor([0., -1.])
        self.output.weight.data = torch.tensor([[1., -2.]])
        self.output.bias.data = torch.tensor([-0.5])

    def forward(self, x):
        h = torch.relu(self.hidden(x))
        y = torch.sigmoid(self.output(h))
        return y

net = XORNet()
criterion = nn.BCELoss()
optimizer = optim.SGD(net.parameters(), lr=0.1)

# Training data
X = torch.tensor([[0.,0.],[0.,1.],[1.,0.],[1.,1.]])
y = torch.tensor([[0.],[1.],[1.],[0.]])

# Train 100 epochs
losses = []
for epoch in range(100):
    for i in range(4):
        optimizer.zero_grad()
        output = net(X[i])
        loss = criterion(output, y[i])
        loss.backward()
        optimizer.step()
    losses.append(loss.item())

# Plot losses (use matplotlib)
import matplotlib.pyplot as plt
plt.plot(losses)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training Loss over Epochs')
plt.show()

Adversarial Examples: The Security Angle

An optimal adversarial example (OAE) is the closest input that flips classification. For the initial network, test points near (0,0) with small perturbations. You'll find that adding epsilon to both inputs can push output above 0.5, misclassifying (0,0) as class 1. After training, the decision boundary becomes tighter, so OAE distance reduces — but still exists. This is why modern AI systems need robust activation functions like JumpReLU.

JumpReLU: Boosting Robustness

JumpReLU introduces a threshold q: f(x) = x if x >= q else 0. By setting q>0, small positive activations are zeroed out, reducing sensitivity to noise. In your trained XOR network, replace ReLU with JumpReLU and tune q (e.g., q=0.1). You'll find the OAE distance increases, meaning the network is more robust. This is like adding a spam filter that blocks borderline emails.

Why This Matters for Your Homework

Understanding gradient regions helps you choose activation functions for faster convergence. Backpropagation skills are essential for debugging neural nets. And adversarial robustness is a hot topic in AI safety — think of self-driving cars or facial recognition systems that can be fooled by tiny perturbations. By completing this assignment, you're building foundations for cutting-edge research.

Final Tips

Double-check your gradient regions: plot each function's gradient from -5 to 5.
For the XOR network, manually compute all four samples and compare with code.
Use numpy for plotting; matplotlib is your friend.
Comment on training loss: does it converge? Why or why not?
For extra credit, experiment with JumpReLU's q parameter.

Good luck with your CAP4613 homework! Remember to submit PDF and code separately.