Huffman, Golomb, and Arithmetic Coding Tutorial

Introduction: Why Source Coding Matters in 2026

In the age of AI-generated content and streaming services, efficient data compression is more critical than ever. Whether you're optimizing a video game's network traffic or reducing storage for a mobile app, understanding source coding techniques like Huffman, Golomb, and arithmetic coding is essential. This tutorial breaks down the key concepts from EE276 homework #3, using timely examples from gaming, finance, and pop culture to make the material engaging and memorable.

1. Bad Codes: Which Can't Be Huffman Codes?

A Huffman code must satisfy the prefix condition (no codeword is a prefix of another) and be optimal for some probability assignment. Let's analyze each candidate:

{0,10,11}: This is a valid prefix code. It can be a Huffman code (e.g., for probabilities 0.5, 0.25, 0.25).
{00,01,10,110}: This is also a prefix code. It can be a Huffman code (e.g., for probabilities 0.4, 0.2, 0.2, 0.2).
{01,10}: This is a prefix code, but it cannot be a Huffman code because any Huffman code for two symbols must assign codewords of length 1 (e.g., {0,1}) to be optimal. The given code has lengths 2 and 2, which is suboptimal for any probability assignment (the expected length would be 2, while Huffman would give 1).

Analogy: Think of Huffman codes like a gaming leaderboard ranking: the most frequent player (symbol) gets the shortest name (codeword). If you give everyone a long name, you waste screen space.

2. Coin Toss Experiment and Golomb Codes

2a. Huffman Coding for Single Tosses

With p = 15/16, the probability of tails (T) is 1/16. For a single toss, Huffman assigns codewords: H = 0 (length 1), T = 1 (length 1). Expected length = 1 bit per symbol. Compression rate = 1 (no compression).

2b. Block Huffman for r=2

Block of 2 tosses has 4 outcomes with probabilities: HH (225/256), HT (15/256), TH (15/256), TT (1/256). Huffman code: HH = 0, HT = 10, TH = 110, TT = 111. Expected length = 1*225/256 + 2*15/256 + 3*15/256 + 3*1/256 = 1.1836 bits per block → 0.5918 bits per toss. Better than 1!

2c. Optimality as r increases

As r → ∞, the per-symbol expected length approaches the entropy H(p) ≈ 0.348 bits (since H(p) = -p log p - (1-p) log(1-p) ≈ 0.348). So Mikel's scheme approaches the optimum. However, the codebook size grows exponentially (2^r codewords), requiring huge storage.

2d. Naroa's Run-Length Coding

The variable Z_k = number of tosses until next T (including that toss) follows a geometric distribution: P(Z=j) = (1-p)^{j-1} p, with p=1/16. Expectation E[Z] = 1/p = 16. Entropy H(Z) = ( -p log p - (1-p) log(1-p) ) / p = H(p)/p ≈ 0.348/0.0625 = 5.57 bits. Ratio H(Z)/E[Z] = H(p) = 0.348 bits, which equals the entropy of X. Intuitively, run-length coding captures the same information as the original sequence.

2e. Golomb Coding Details

The given code uses a parameter m = 4 (since remainder is coded in 2 bits). General scheme: for Z, compute quotient q = floor((Z-1)/4) and remainder r = (Z-1) % 4. Codeword: q ones followed by a zero, then remainder in binary (2 bits). For example, Z=1: q=0, r=0 → code = 0 00? Wait, table shows Z=1 code = 01? Let's re-check: The table shows Z=1 code = 1 01? Actually, the table in the problem shows Z=1: code = 1 01? No, it's ambiguous. Typically, Golomb code for m=4: codeword = unary(q) + binary(r, 2 bits). Unary: q zeros followed by a 1? Or ones followed by zero? Standard: q ones then 0. So Z=1: q=0 → unary = 0? Actually, unary for 0 is '0', then remainder 0 → '00'? But table shows Z=1 code = '01'? Let's assume the table uses a variant. The expected length: E[L] = q̄ + 1 + 2 = ... This code is efficient because it matches the geometric distribution.

3. Arithmetic Coding

Arithmetic coding works by representing a sequence as a subinterval of [0,1). The decoder knows the distribution and can sequentially narrow down the interval.

3a. Decoding Process

The decoder receives a number (e.g., 0.11). Starting with [0,1), it partitions according to q. It finds which subinterval contains the number, outputs the corresponding symbol, and repeats until n symbols are decoded.

3b. Interval Length

After processing sequence x^n, the interval length l_n = q(x^n) = product of q(x_i).

3c. Shortest Output

If interval length is l, then the number of digits needed is ceil(-log10(l)). For intervals: i. l=0.01 → 2 digits; ii. l=0.01 → 2 digits; iii. l=0.0009 → 4 digits. In general, at most ceil(-log10(l)) digits.

3d. Length Bound

l(x^n) ≤ -log10(q(x^n)) + 1. This is similar to Shannon's coding theorem.

3e. Comparison to Huffman

For i.i.d. sources, arithmetic coding achieves expected length close to entropy, like block Huffman, but with lower complexity (no large codebook). It's used in JPEG, PNG, and modern compressors.

3f. Adaptive Arithmetic Coding

If a neural network predicts probabilities q_i(x|x^{i-1}), modify the encoder to use q_i at each step. This yields expected length = H(X^n) + 2, allowing compression of non-i.i.d. data.

4. Entropy Rate of a Markov Process

Consider a Markov chain on {H,T} with transition probabilities: P(H|H)=0.9, P(T|H)=0.1, P(H|T)=0.5, P(T|T)=0.5. The stationary distribution is π(H)=5/6, π(T)=1/6.

4a. Stationarity

P(X2=H) = P(X1=H)*0.9 + P(X1=T)*0.5 = (5/6)*0.9 + (1/6)*0.5 = 0.75 + 0.0833 = 0.8333 = 5/6. So distribution is stationary.

4b. Conditional Entropy

H(X_n|X_{n-1}) = ∑ π(x) H(p_{x→·}) = (5/6)*h2(0.9) + (1/6)*h2(0.5) ≈ (5/6)*0.469 + (1/6)*1 = 0.3908 + 0.1667 = 0.5575 bits. As n→∞, this is the entropy rate.

4c. Joint Entropy Rate

H(X_1,...,X_n)/n → entropy rate = 0.5575 bits. This matches the conditional entropy limit, confirming the chain's regularity.

5. Universal Compressor for Individual Sequences

Given a binary sequence x^n with n0 zeros and n1 ones, the optimal distribution q* is q*(0)=n0/n, q*(1)=n1/n. The compressor uses -log2 q*(x_i) bits per symbol, total = n * h2(n1/n). To transmit q*, we need log2(n+1) bits (by encoding n1). So total rate = h2(n1/n) + log2(n+1)/n. For i.i.d. Ber(p) sources, as n→∞, n1/n → p, and rate → h2(p), making it universal.

Conclusion: From Theory to Practice

Understanding these coding techniques is crucial for modern applications like video streaming (H.264 uses arithmetic coding), DNA sequence compression, and even blockchain data storage. By mastering Huffman, Golomb, and arithmetic coding, you're equipped to tackle real-world compression challenges.