Programming lesson
Mastering Proximal Gradient and KKT Conditions for Convex Optimization: A Homework Guide for ECE490
A comprehensive tutorial covering the proximal-gradient algorithm for strongly convex functions and KKT conditions for constrained optimization, with timely examples from AI model training and portfolio optimization.
Introduction: Why Proximal Gradient and KKT Matter in 2026
In modern machine learning and AI, optimization is the engine that powers everything from training large language models to fine-tuning recommendation systems. As of May 2026, the demand for efficient optimization algorithms has never been higher, especially with the rise of real-time AI applications like autonomous agents and personalized finance apps. In this tutorial, we break down the key concepts from ECE490 Homework 6: proximal-gradient methods for strongly convex functions and KKT conditions for polyhedral constraints. Whether you're preparing for a career in AI research or quantitative finance, mastering these tools is essential.
Part 1: Proximal-Gradient Algorithm for Strongly Convex Functions
1.1 Setting the Stage: Strong Convexity and Smoothness
Let f be m-strongly convex and L-smooth. This means f satisfies:
- Strong convexity: f(y) ≥ f(x) + ∇f(x)ᵀ(y−x) + (m/2)‖y−x‖²
- Smoothness: ‖∇f(x)−∇f(y)‖ ≤ L‖x−y‖
Define fm(x) = f(x) − (m/2)‖x‖². This function appears often when we want to isolate the smooth part.
1.2 Proving Convexity of fm with (L−m)-Lipschitz Gradients
Since f is m-strongly convex, the function f(x) − (m/2)‖x‖² is convex. Moreover, the gradient of fm is ∇fm(x) = ∇f(x) − mx. Using the smoothness of f, we get:
‖∇fm(x) − ∇fm(y)‖ = ‖∇f(x) − ∇f(y) − m(x−y)‖ ≤ L‖x−y‖ + m‖x−y‖ = (L−m)‖x−y‖Thus, fm is convex with (L−m)-Lipschitz gradients.
1.3 Proximal-Gradient Algorithm for Composite Functions
Consider the composite objective: minx fm(x) + g(x), where fm is smooth and g is convex but possibly nonsmooth. The proximal-gradient update is:
xk+1 = proxαg(xk − α∇fm(xk))where proxαg(z) = argminx g(x) + (1/(2α))‖x−z‖².
1.4 Relationship with Gradient Descent on f
If we choose α = 1/L, then the proximal-gradient step on fm+g may not coincide with gradient descent on f unless g is zero. However, when g is an indicator function of a convex set, the proximal operator becomes projection, and the algorithm reduces to projected gradient descent. In general, there is no single steplength that makes the iterates identical for all g.
1.5 Convergence Rate for Proximal Gradient
For m-strongly convex f, the proximal-gradient method with steplength α = 1/L achieves linear convergence:
‖xk − x*‖² ≤ (1 − m/L)k ‖x0 − x*‖²This mirrors the rate of gradient descent on f itself, showing that the proximal operator does not degrade performance.
Part 2: KKT Conditions for Polyhedral Constraints
2.1 Problem Formulation
Consider minimizing a smooth function f over a polyhedral set:
min f(x) subject to Ex = g, Cx ≥ dwhere E ∈ ℝm×n, g ∈ ℝm, C ∈ ℝp×n, d ∈ ℝp. This is a linearly constrained optimization problem with both equality and inequality constraints.
2.2 Reformulation with Slack Variables
Introduce slack variables s ∈ ℝp to convert inequalities to equalities:
min f(x) subject to Ex = g, Cx − s = d, s ≥ 0Now define X = (x, s), A = [E 0; C −I], b = (g, d). The problem becomes:
min f(x) subject to A X = b, s ≥ 02.3 Applying Theorem 10.5 (Lagrange Multiplier Rule)
Under constraint qualifications (e.g., linear independence of active constraints), there exist Lagrange multipliers λ ∈ ℝm (for equalities) and ν ∈ ℝp (for the equality part of slacks), and μ ∈ ℝp (for the inequality s ≥ 0), such that:
- Stationarity: ∇f(x*) − Eᵀλ − Cᵀν = 0 and −ν − μ = 0
- Primal feasibility: Ex* = g, Cx* − s* = d, s* ≥ 0
- Complementary slackness: μi si* = 0, μi ≥ 0
Eliminating s and ν (note ν = −μ), we obtain the classical KKT conditions:
∇f(x*) − Eᵀλ − Cᵀμ = 0, Ex* = g, 0 ≤ μ ⟂ Cx* − d ≥ 0where ⟂ denotes orthogonality (componentwise).
Real-World Applications and Timely Examples
AI Model Training
In 2026, training large-scale neural networks often uses proximal-gradient methods for regularization (e.g., L1 regularization for sparsity). The KKT conditions are used to verify optimality in support vector machines and portfolio optimization with linear constraints.
Portfolio Optimization
Imagine you're building a robo-advisor app that allocates assets subject to budget and risk constraints. The KKT conditions help identify the optimal portfolio where marginal returns balance constraint penalties.
Conclusion
This tutorial covered the proximal-gradient algorithm for strongly convex composite functions and derived KKT conditions for polyhedral constraints. These concepts are not just academic—they are used daily in AI, finance, and engineering. By understanding them, you're equipped to tackle real-world optimization problems.