Programming lesson
Mastering Entropy in Information Theory: From Joint Entropy to Hamming Codes
A comprehensive tutorial on entropy concepts from joint entropy to Hamming codes and functions of random variables, with step-by-step explanations and trend-inspired examples.
Introduction to Entropy and Information Theory
Entropy is a fundamental concept in information theory, measuring the uncertainty or randomness of a random variable. Whether you're studying for an exam like Ee276 homework #1 p0 or diving into data science, understanding entropy helps you grasp how information is quantified. In this tutorial, we'll explore joint entropy, conditional entropy, mutual information, and applications like Hamming codes—all with timely examples from AI, gaming, and school life.
1. Joint Entropy Example
Consider a joint probability distribution p(x,y) given by a table (not shown here). We want to compute various entropies. Let's break it down.
a) H(X) and H(Y)
The marginal entropies H(X) and H(Y) are computed from the marginal distributions. For a binary random variable, H(X) = -p(0)log p(0) - p(1)log p(1). For example, if p(X=0)=0.5 and p(X=1)=0.5, then H(X)=1 bit. Similarly for Y.
b) H(X|Y) and H(Y|X)
Conditional entropy H(X|Y) measures the uncertainty in X given Y. It is computed as the weighted sum over y of H(X|Y=y). For instance, if Y perfectly determines X, H(X|Y)=0.
c) H(X,Y)
Joint entropy H(X,Y) = -Σ p(x,y) log p(x,y). By the chain rule, H(X,Y) = H(X) + H(Y|X) = H(Y) + H(X|Y).
d) H(Y) - H(Y|X)
This equals I(X;Y), the mutual information.
e) I(X;Y)
Mutual information measures how much knowing one variable reduces uncertainty about the other. I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X).
f) Venn Diagram
A Venn diagram visually shows the relationships: H(X) and H(Y) are two overlapping circles. The intersection is I(X;Y). The left part is H(X|Y), the right part is H(Y|X), and the union is H(X,Y).
This example is typical in homework assignments and exam prep for information theory courses.
2. Entropy of Hamming Code
Hamming codes are error-correcting codes used in AI data transmission and gaming networks. Consider 4 information bits X1-X4 chosen uniformly at random, plus 3 check bits X5-X7 to make parity even. The check bits satisfy equations like X1+X2+X4+X7=0 mod 2.
a) H(X1,...,X7)
Since the 4 information bits are independent and uniform, H(X1,...,X4)=4 bits. The check bits are deterministic functions of the information bits, so H(X1,...,X7)=H(X1,...,X4)=4 bits.
b) Perfect Error Correction
If an error occurs in at most one bit, we can recover the original X from Y = X ⊕ e. The parity checks uniquely identify the error location (syndrome decoding). This is why Hamming codes are used in computer memory and satellite communications.
c) H(X|Y)
Given Y, we can perfectly recover X, so H(X|Y)=0.
d) I(X;Y)
I(X;Y) = H(X) - H(X|Y) = 4 - 0 = 4 bits.
e) H(Y)
Since Y = X ⊕ e, and e is independent with 8 equally likely possibilities (including no error), H(Y) = H(X) + H(e) = 4 + log2(8) = 4 + 3 = 7 bits.
This problem is a classic in error-correcting codes and appears in many university problem sets.
3. Entropy of Functions of a Random Variable
Let X be a discrete random variable. Show that H(g(X)) ≤ H(X). This is intuitive: a function cannot create new information; it can only reduce or maintain uncertainty.
Proof steps:
- H(X, g(X)) = H(X) + H(g(X) | X) = H(X) because g(X) is determined by X.
- Also H(X, g(X)) = H(g(X)) + H(X | g(X)) ≥ H(g(X)) because H(X | g(X)) ≥ 0.
- Thus H(X) ≥ H(g(X)).
This result is used in machine learning feature engineering and data compression.
4. Coin Flips: Geometric Distribution
A fair coin is flipped until first head. Let X be the number of flips. Then P(X=n) = (1/2)^n for n=1,2,... This is the geometric distribution with parameter 1/2. Its entropy H(X) = Σ (1/2)^n log2(2^n) = 2 bits. (You can compute using sum of n/2^n = 2.)
Efficient Questions
To determine X, we can ask yes-no questions like "Is X ≤ 2?" The optimal strategy is binary search, but since the distribution is geometric, a Huffman code gives expected length ≈ H(X) = 2 bits. This illustrates source coding and data compression used in streaming services.
5. Minimum Entropy
What is the minimum entropy over all probability distributions on n outcomes? Since entropy is nonnegative and concave, the minimum is 0, achieved when all probability mass is on one outcome (i.e., p_i=1 for some i, and 0 elsewhere). This corresponds to a deterministic variable.
6. Mixing Increases Entropy
Let p_i > 0. Show that entropy increases when you redistribute probability mass more evenly. For example, the distribution (p1, p2, ..., pm) has lower entropy than (p1, ..., pi-ε, ..., pj+ε, ...) if the latter is more uniform. This is a consequence of Jensen's inequality and explains why diversity increases entropy in ecosystems and investment portfolios.
7. Infinite Entropy (Bonus)
Can entropy be infinite? Yes, for a distribution that decays slowly. Consider P(X=n) = 1/(A n log^2 n) for n≥2. The constant A ensures sum to 1. Then H(X) = Σ P(n) log(1/P(n)) diverges because the sum of 1/(n log n) diverges. This is a theoretical curiosity but important for understanding heavy-tailed distributions in network traffic and social media virality.
Conclusion
Entropy is a powerful tool in information theory, with applications from error-correcting codes to machine learning. By mastering these concepts, you'll be well-prepared for homework solutions and exam success in courses like Ee276. Keep practicing with problem sets and tutorials to deepen your understanding.