Activation Functions and Backpropagation Tutorial

Introduction to Activation Functions and Backpropagation

In the rapidly evolving world of artificial intelligence, understanding the core components of neural networks is essential. This tutorial is designed to help students tackling CAP4613/5619 homework #1 on activation functions and backpropagation. We'll explore how these elements affect learning, using timely examples from current AI trends like the rise of generative models and efficient deep learning frameworks. By the end, you'll be ready to plot gradients, compute losses, and train simple networks.

Why Activation Functions Matter

Activation functions introduce non-linearity into neural networks, allowing them to learn complex patterns. As noted in your assignment, the learning speed depends on the gradient magnitude. Let's break down the four regions:

Fast learning region: |gradient| > 0.99
Active learning region: 0.01 ≤ |gradient| ≤ 0.99
Slow learning region: 0 < |gradient| < 0.01
Inactive learning region: gradient = 0

These regions help us understand why some activation functions train faster than others. For example, ReLU's gradient is 1 for positive inputs (fast region) and 0 for negative inputs (inactive), which can cause dead neurons. Meanwhile, the logistic sigmoid saturates for large positive or negative values, leading to slow learning.

Plotting Gradients: A Step-by-Step Approach

To complete Problem 1, you'll need to plot the gradient of each activation function from -5 to 5. Here's a quick guide using Python with NumPy and Matplotlib:

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(-5, 5, 1000)

# ReLU gradient
def relu_grad(x):
    return np.where(x > 0, 1, 0)

plt.plot(x, relu_grad(x))
plt.title('ReLU Gradient')
plt.show()

For the logistic sigmoid, the gradient is σ(x)(1-σ(x)). The piecewise linear unit has a gradient of 0.2 for x < -1, 1 for |x| ≤ 1, and 0 for x > 1. Swish (x * sigmoid(x)) has a gradient that is always positive but small for large negative values. ELU with α=0.1 has gradient 1 for x ≥ 0 and 0.1*exp(x) for x < 0. GELU's gradient involves the error function; you can approximate it using the tanh approximation.

Backpropagation Through a Simple XOR Network

Problem 2 involves a neural network for XOR with a sigmoid output. The network is defined as f(x) = σ(w·max(0, Wx + c) + b). Initial parameters: W = [[1,1],[1,1]], c = [0,-1], w = [1,-2], b = -0.5. The four training samples are (0,0)→0, (0,1)→1, (1,0)→1, (1,1)→0.

Let's compute the forward pass for sample (0,0):

z = Wx + c = [0, -1]
h = max(0, z) = [0, 0]
a = w·h + b = -0.5
output = σ(a) = 1/(1+exp(0.5)) ≈ 0.3775
Loss = -[y*log(output) + (1-y)*log(1-output)]; for y=0, loss ≈ -log(0.6225) ≈ 0.474

Repeat for other samples. The gradients for parameters can be computed using chain rule; for w1, ∂L/∂w1 = (output - y) * h1. Since h1=0 for (0,0), gradient is 0. For b, ∂L/∂b = (output - y).

Implementing the Network in a Deep Learning Framework

For Problem 3, use PyTorch or TensorFlow to implement the same network. Here's a PyTorch example:

import torch
import torch.nn as nn
import torch.optim as optim

class XORNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(2, 2, bias=True)
        self.fc2 = nn.Linear(2, 1, bias=True)
        # Initialize manually
        self.fc1.weight.data = torch.tensor([[1.,1.],[1.,1.]])
        self.fc1.bias.data = torch.tensor([0., -1.])
        self.fc2.weight.data = torch.tensor([[1., -2.]])
        self.fc2.bias.data = torch.tensor([-0.5])
        
    def forward(self, x):
        h = torch.relu(self.fc1(x))
        out = torch.sigmoid(self.fc2(h))
        return out

model = XORNet()
criterion = nn.BCELoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

# Training data
X = torch.tensor([[0,0],[0,1],[1,0],[1,1]], dtype=torch.float32)
y = torch.tensor([[0],[1],[1],[0]], dtype=torch.float32)

# Train for 100 epochs
for epoch in range(100):
    for i in range(4):
        optimizer.zero_grad()
        output = model(X[i])
        loss = criterion(output, y[i])
        loss.backward()
        optimizer.step()
    # Record loss at end of epoch
    # Plot loss vs epoch

After training, verify that the initial outputs match your manual calculations. Then plot the training loss; you'll see it decrease over epochs, showing gradient descent's effectiveness.

Finding Optimal Adversarial Examples

An optimal adversarial example (OAE) for a set T is the input with smallest Euclidean distance to any sample in T that changes classification. For the initial network, classify points as class 1 if output > 0.5. You can search over the input space to find an OAE. After training, the decision boundary changes, so the OAE may differ.

JumpReLU and Robustness

JumpReLU is defined as JumpReLU(x) = x if x ≥ θ, else 0, where θ is a learnable parameter. By setting θ > 0, it can filter out small noise, potentially improving robustness. For the trained XOR network, replace ReLU with JumpReLU and optimize θ to minimize loss; then recompute the OAE. You may find that the OAE distance increases, indicating improved robustness.

Conclusion

This tutorial covered key concepts for your CAP4613 homework: activation function gradients, backpropagation, and implementation. By connecting these to real-world AI trends—like the use of GELU in transformers or ELU in deep networks—you can appreciate their practical importance. Remember to plot gradients carefully, double-check your manual calculations, and experiment with frameworks. Good luck!