Information Theory Tutorial: Minimizing Error, Gaussian Capacity, Joint Typicality

Introduction to Information Theory and Error Minimization

Information theory forms the backbone of modern communication systems, from 5G networks to satellite communications. In this tutorial, we explore fundamental concepts such as minimizing channel probability of error, capacity of Gaussian channels with correlated looks, and joint typicality. These topics are central to EE276 homework #5 and are essential for understanding how data is reliably transmitted over noisy channels.

1. Minimizing Channel Probability of Error

Consider a communication system where a message J is uniformly distributed over {1,2,…,M}. The encoder maps J to an n-length codeword Xn from a codebook cn. The codeword is sent through a memoryless channel with transition probability PY|X, yielding output Yn. The decoder estimates J as Ĵ, which can also output an error symbol. The probability of error Pe = P(Ĵ ≠ J) is minimized by the maximum a posteriori (MAP) decoding rule:

Ĵ(yn) = argmax1≤j≤M P(J = j | Yn = yn)

This rule selects the message that maximizes the posterior probability given the received sequence. Since J is uniform, MAP reduces to maximum likelihood (ML) decoding when the prior is uniform. This result is fundamental in channel coding theory and is used in modern decoders like those in LDPC codes and turbo codes.

Why MAP is Optimal

The proof follows from the law of total probability. For any decoder g(yn), the error probability is minimized by choosing the hypothesis with the highest posterior probability. This is analogous to how a machine learning classifier chooses the class with the highest predicted probability. In the context of AI-driven communication systems, neural networks often approximate the MAP decoder for complex channels.

2. The Two-Look Gaussian Channel

Consider a Gaussian channel with two correlated observations of X: Y1 = X + Z1 and Y2 = X + Z2, where (Z1, Z2) are jointly Gaussian with zero mean and covariance matrix K with correlation coefficient ρ. The input X has power constraint P. The capacity C depends on ρ.

(a) ρ = 1 (Perfectly Correlated)

When ρ = 1, Z1 = Z2 almost surely, so the two observations are identical. The channel effectively becomes a single Gaussian channel with noise variance σ2. Capacity is C = ½ log(1 + P/σ2) bits per channel use.

(b) ρ = 0 (Independent)

When ρ = 0, the noises are independent. This is equivalent to two independent Gaussian channels, each with noise variance σ2. Using optimal combining, the effective SNR is 2P/σ2, so capacity is C = ½ log(1 + 2P/σ2). This is like using multiple antennas in MIMO systems to boost data rates.

(c) ρ = -1 (Perfectly Negatively Correlated)

When ρ = -1, Z2 = -Z1. By averaging the two observations, the noise cancels: (Y1 + Y2)/2 = X. Thus, the channel becomes noiseless, and capacity is infinite (or limited only by input constraints). This extreme case illustrates the power of diversity combining in fading channels.

3. Output Power Constraint

Consider an AWGN channel Y = X + Z with Z ~ N(0, σ2), but now the constraint is on the expected output power: E[Y2] ≤ P. This is different from the usual input power constraint. The capacity is found by maximizing I(X; Y) subject to E[Y2] ≤ P and X independent of Z.

Since Y = X + Z, we have E[Y2] = E[X2] + σ2 ≤ P, so the input power is at most P - σ2. The capacity is C = ½ log(1 + (P - σ2)/σ2) = ½ log(P/σ2). This result is used in power-constrained sensor networks where the transmitter must limit the received signal strength.

4. Gaussian Mutual Information and Markov Chains

Suppose (X, Y, Z) are jointly Gaussian with X → Y → Z forming a Markov chain. Let ρXY = ρ1 and ρYZ = ρ2. Then I(X; Z) = -½ log(1 - ρ12ρ22). This follows from the data processing inequality and the fact that for Gaussian variables, mutual information depends only on correlation. This is key in deep learning where information bottlenecks are used to learn compressed representations.

5. Bottleneck Channel Capacity

Consider a channel X → V → Y where V takes values in {1,…,k}. The overall transition probability is p(y|x) = Σv p(v|x)p(y|v). Show that C ≤ log k. This is a direct consequence of the data processing inequality: I(X; Y) ≤ I(X; V) ≤ log |V| = log k. This bound is tight when V is a sufficient statistic. This concept appears in quantization and feature extraction for machine learning.

6. Joint Typicality with Independent Sequences

Let (Xi, Yi, Zi) be i.i.d. according to p(x,y,z). A triple (xn, yn, zn) is jointly typical if its empirical distributions are close to the true distributions. Now suppose (xn, yn, zn) is drawn from the product of marginals p(xn)p(yn)p(zn), so the sequences are independent but have the same marginals. The probability that they are jointly typical is approximately 2-n(I(X;Y)+I(X;Z)+I(Y;Z)-I(X;Y;Z)), where I(X;Y;Z) is the multivariate mutual information. This result is used in network information theory for analyzing multi-user channels.

Conclusion

Understanding these fundamental concepts—MAP decoding, Gaussian channel capacity, output power constraints, and joint typicality—is crucial for any student of information theory. Whether you are working on EE276 homework or designing the next generation of wireless communication systems, these principles will guide your analysis. For further practice, try applying these ideas to problems involving 5G channel models or satellite links.