CLRS 16.3: Huffman codes

The section introduces Huffman coding, a type of code used for data compression.

Suppose we want to encode a long string of characters by encoding each character as a unique binary string. How can we minimise the number of bits? Here is an example of how we might encode each character:

	a	b	c	d	e
freq	30	12	24	6	3
code	0	110	10	1110	1111

The total number of bits required would be \(30*(1) + 12*(3) + 24*(2) + 6*(4) + 3*(4)\).

The encoding above is an example of a Huffman code, which are also prefix codes. Prefix codes are codes in which no codeword is also a prefix of some other codeword. This makes it simple to decode, as the codeword that begins an encoded file will be unambiguous.

Note on prefix codes

The book claims without proof the interesting fact that “a prefix code can always achieve the optimal data compression among any character code”. According to Wikipedia, I think they refer to the fact that “for any uniquely decodable code there is a prefix code that has the same code word lengths”. This is characterised by the Kraft inequality - in an earlier post an exercise from Appendix B asks us to prove a weak version of it, but here is the full statement: Given any uniquely decodable code over an alphabet of size \(r\) with codeword lengths \(l_i\), then \(\sum r^{-l_i} \leq 1\). Conversely, given any set of natural numbers \(l_i\) satisfying the inequality, there exists a uniquely decodable code over an alphabet of size \(r\) with those codeword lengths. Essentially we can show that any uniquely decodable code would satisfy the Kraft inequality, which allows us to construct a prefix code with equivalent codeword lengths. That is really neat and we can focus on optimising prefix codes to maximise data compression without any loss of generality!

We can represent any binary coding scheme with a binary tree, labelling each leaf as an encoded character (refer to diagrams in this section). Then for each character \(c\) in the alphabet \(C\), the number of bits \(B(T)\) required for encoding (i.e. cost) is

\[B(T) = \sum_{c\in C} c.\text{freq}\cdot \text{depth}(c).\]

Intuitively, we want to give the longer codes to the lower-frequency characters, and the Huffman encoding does so optimally. The algorithm greedily merges the lowest-frequency characters (nodes), and returns the binary tree representing the prefix code:

// Maintain a frequency f for each node v, which is the sum
// of frequencies of all leaves in the subtree rooted at v.
// Let Q be a min-priority queue, keyed on frequency
for i = 1 to |C|:
    create new node z
    x = EXTRACT-MIN(Q)
    y = EXTRACT-MIN(Q)
    z.left = x
    z.right = y
    insert z into Q
return EXTRACT-MIN(Q)

Problem 16.3-2 Prove that a binary tree that is not full cannot correspond to an optimal prefix code.

Proof:

A full binary tree is a binary tree in which all internal nodes have exactly 2 children. If any node has only one child, we can replace that node by the child subtree, producing a new valid prefix code with lower depths, thus lower cost.