Rate Distortion Theory Tutorial: Shannon Lower Bound & Uniform Source with Hamming Distortion

Understanding Rate Distortion Functions

Rate distortion theory is a cornerstone of lossy compression, answering how many bits are needed to describe a source with a given distortion. This tutorial covers three key problems: the Shannon lower bound for continuous sources, rate distortion for a uniform source with Hamming distortion, and the behavior of independent sources. We'll use timely examples from AI model compression (like reducing neural network size for edge devices) and streaming video (balancing quality and bandwidth) to make concepts relatable.

1. Shannon Lower Bound for Continuous Random Variables

Let X be a continuous random variable with mean zero and variance σ². The rate-distortion function R(D) for mean-squared error (MSE) distortion satisfies the Shannon lower bound. This bound states that for any source, R(D) ≥ (1/2) log₂(σ²/D) for 0 ≤ D ≤ σ². This is derived from the fact that the differential entropy of X is at most (1/2) log₂(2πeσ²), and the conditional entropy h(X|X̂) is at most (1/2) log₂(2πeD).

Proof sketch for part (a): Using the chain rule, I(X;X̂) = h(X) – h(X|X̂). Since h(X) is bounded above by the Gaussian entropy, and h(X|X̂) is bounded below by the entropy of the error (which has variance at least D), we get the lower bound. For the upper bound (part b), consider the joint distribution in Figure 1: let Z ~ N(0, σ² – D) independent of X, and set X̂ = (σ² – D)/σ² * X + Z? Actually, the figure suggests a multiplication by a constant. The exact construction yields a joint Gaussian that achieves the lower bound, proving that for Gaussian sources, R(D) = (1/2) log₂(σ²/D) for 0 ≤ D ≤ σ².

Are Gaussian sources harder or easier to describe? Among all sources with the same variance, Gaussian sources require the highest rate for a given distortion because they have the largest entropy. This is analogous to compressing a high-detail 4K video versus a simple cartoon: the 4K video (more random, like Gaussian) needs more bits to achieve the same MSE.

2. Rate Distortion for Uniform Source with Hamming Distortion

Consider X uniformly distributed on {1,2,…,m}. Hamming distortion is d(x,x̂) = 0 if x = x̂, else 1. The rate-distortion function R(D) for this source is:

For D ≥ (m-1)/m, R(D) = 0 because we can always guess a fixed value (e.g., 1) and achieve distortion (m-1)/m.
For 0 ≤ D ≤ (m-1)/m, R(D) = H(D) + (1-D) log₂(m-1) – H(D)?? Wait, let's derive.

Step (a): When D ≥ (m-1)/m, we can set X̂ = 1 always. Then E[d] = P(X ≠ 1) = (m-1)/m ≤ D, so R(D)=0.

Step (b): For D < (m-1)/m, use Fano's inequality. For any joint distribution with E[d] ≤ D, we have H(X|X̂) ≤ H(D) + D log₂(m-1). Since H(X) = log₂ m, we get I(X;X̂) ≥ log₂ m – H(D) – D log₂(m-1). This is the lower bound.

Step (c): Achievability: consider a test-channel where with probability 1-D, X̂ = X, and with probability D, X̂ is uniform on the remaining m-1 symbols. This yields the lower bound exactly.

3. Rate Distortion for Two Independent Sources

Question: Can we compress two independent sources together better than separately? For independent processes {Xi} and {Yi} with individual rate-distortion functions R_X(D1) and R_Y(D2), the joint rate-distortion function R_{X,Y}(D1,D2) satisfies R_{X,Y}(D1,D2) ≥ R_X(D1) + R_Y(D2). This is because mutual information is additive for independent sources, but the distortion constraints are separate, so we cannot do better than the sum. Equality holds if we can code them independently, which is possible because they are independent. So the joint rate is exactly the sum.

4. Distortion-Rate Function D(R)

D(R) is the minimum distortion achievable at rate R. It is decreasing in R (more bits allow lower distortion) and convex in R (diminishing returns). The convexity follows from the fact that the set of achievable (R,D) pairs is convex.

For i.i.d. sources, the distortion-rate function satisfies D(R) ≥ D_{single-letter}(R). The proof uses the data processing inequality and convexity.

Understanding these concepts is crucial for modern applications like training quantized neural networks (where we trade off model size for accuracy) or adaptive bitrate streaming (where video quality adjusts to bandwidth).