Lecture 5. Greedy algorithm Huffman Codes Huffman codes用于数据压缩一般可以达到 · Huffman code 压缩20%到90%的效果。 Matroid(拟阵) 什么是编码:编码实质上是用给定的一组 符号集的字符串到另外一组符号集的映 射。 例如用{0,1}字符串到所有英文字符{abc…} 的映射的8位码 清华大学們学院宋恒 清华大学软件学院宋 Prefix codes ·我们只考虑任一个编码字不是另一个编码 Fixed-length codeword 000 001 010 011 100 101 的前缀的编码。 ariable-length codeword 0 101 ·这样的编码叫 Prefix Codes Figure 163 A character-coding problem. A data 下页给出几种编码例子 file of 100,000 characters contains only the characters a-f, with the frequencies indicated. If each character is assigned a 3-bit codeword, the file can be encoded in 300.000 bits. Using the variable-length code shown, the file can be 清华大学学院未恒 编码表示方式 Prefix codes are desirabl because they simplify decoding 利用树来表示,下图表示等长编码 示例:在上图非定长编码下 0001110101唯一代表字符串 aaaeab 清华大学轼字院宋恒 消华大学软件学院宋
1 清华大学软件学院宋斌恒 1 Lecture 5. Greedy Algorithm •Huffman code •Matroid(拟阵) 清华大学软件学院宋斌恒 2 Huffman Codes •Huffman codes用于数据压缩一般可以达到 压缩20%到90%的效果。 •什么是编码:编码实质上是用给定的一组 符号集的字符串到另外一组符号集的映 射。 •例如用{0,1}字符串到所有英文字符{a,b,c,… } 的映射的8位码 清华大学软件学院宋斌恒 3 Prefix codes •我们只考虑任一个编码字不是另一个编码 字的前缀的编码。 •这样的编码叫Prefix Codes。 •下页给出几种编码例子: a b c d e f Frequency (in thousands) 45 13 12 16 9 5 Fixed-length codeword 000 001 010 011 100 101 Variable-length codeword 0 101 100 111 1101 1100 Figure 16.3 A character-coding problem. A data file of 100,000 characters contains only the characters a-f, with the frequencies indicated. If each character is assigned a 3-bit codeword, the file can be encoded in 300,000 bits. Using the variable-length code shown, the file can be encoded in 224,000 bits. 清华大学软件学院宋斌恒 5 Prefix codes are desirable because they simplify decoding. •示例:在上图非定长编码下,串 0001110101唯一代表字符串aaaeab 清华大学软件学院宋斌恒 6 编码表示方式 •利用树来表示,下图表示等长编码: 100 86 58 28 a:45 b:13 c:12 d:16 e:9 f:5 14 14
Given a tree T corresponding to a prefix code, le matter to compute the number of bits An optimal code for a file is quired to encode a file. Let c denote the character always represented by a full in alphabet C, f()denote the frequency of c in the file, drc)the depth of c in the tree. The binary tree number of bits required to encode a file is thus wwhich we define as the cost of the tree T. 清华大学們学院宋恒 清华大学软件学院宋 Constructing a Huffman code Huffman invented a greedy algorithm that 9 constructs an optimal prefix code called a Huffman code c12D3 下面我们给出 Huffman code算法的伪代码 Figure 6. 4. Archs ear res laden to the ceding sehe ans its 清华大学软件学院宋 1. Huffman (C) 2.Q∈c∥ Q is a priority queue=newQ(C) 3. For ie 1 to n-1 (a)[15[e 3. Right(zjyeQ Extract-ing 4. Return Q Extract- Mino 清华大学轼字院宋恒 消华大学软件学院宋
2 清华大学软件学院宋斌恒 7 An optimal code for a file is always represented by a full binary tree. 100 86 58 28 a:45 b:13 c:12 d:16 e:9 f:5 14 14 清华大学软件学院宋斌恒 8 Given a tree T corresponding to a prefix code, it is a simple matter to compute the number of bits required to encode a file. Let c denote the character in alphabet C, f(c) denote the frequency of c in the file, dT (c) the depth of c in the tree. The number of bits required to encode a file is thus cost . ( ) ( ) ( ) which we define as the of the tree T B T f c d c c C å T Î = Figure 16.4 Trees corresponding to the coding schemes in Figure 16.3. Each leaf is labeled with a character and its frequency of occurrence. Each internal node is labeled with the sum of the frequencies of the leaves in its subtree. (a)The tree corresponding to the fixed-length code a=000,…,f=101.(b)The tree corresponding to the optimal prefix code a=0,b=101,…,f=1100. 100 86 58 28 a:45 b:13 c:12 d:16 e:9 f:5 14 14 100 55 30 d:16 25 14 a:45 c:12 b:13 f:5 e:9 清华大学软件学院宋斌恒 10 Constructing a Huffman code •Huffman invented a greedy algorithm that constructs an optimal prefix code called a Huffman code. •下面我们给出Huffman code算法的伪代码 清华大学软件学院宋斌恒 11 1. Huffman(C) 1. nfl|C| 2. QflC //Q is a priority queue =new Q(C) 3. For ifl1 to n-1 1. Do allocate a new node z 2. Left[z]flxflQ.Extract-Min() 3. Right[z]flyflQ.Extract-Min() 4. f[z]flf[x]+f[y] 5. Q.insert(z) 4. Return Q.Extract-Min() 清华大学软件学院宋斌恒 12 f:5 e:9 c:12 b:13 d:16 a:45
t)[e2][13 a45 清华大学們学院宋恒 清华大学软件学院宋 c1a]Q L] 清华大学学院未恒 清华大学软件学院宋 最小频率者最优编码可等长 Let c be an alphabet in which each character c C has frequency fIc]. Let x and y be tw characters in C having the lowest frequencie Then there exists an optimal prefix code for C in which the code words for x and y have the same length and differ only in the last bit 清华大学轼字院宋恒 消华大学软件学院宋
3 清华大学软件学院宋斌恒 13 14 c:12 b:13 f:5 e:9 d:16 a:45 清华大学软件学院宋斌恒 14 14 d:16 a:45 f:5 e:9 25 c:12 b:13 清华大学软件学院宋斌恒 15 30 d:16 25 14 a:45 c:12 b:13 f:5 e:9 清华大学软件学院宋斌恒 16 55 30 d:16 25 14 a:45 c:12 b:13 f:5 e:9 清华大学软件学院宋斌恒 17 100 55 30 d:16 25 14 a:45 c:12 b:13 f:5 e:9 清华大学软件学院宋斌恒 18 最小频率者最优编码可等长 •Let C be an alphabet in which each character c ∈ C has frequency f [c].Let x and y be two characters in C having the lowest frequencies. • Then there exists an optimal prefix code for C in which the code words for x and y have the same length and differ only in the last bit
证明 ·设丁是任意一个最优编码对应的树,a,b是 深度最深的一对兄弟【为什么存在?】 然后交换x、a;y、b;【见下图】 ·T'是最优编码对应的树 Q“ 清华大学們学院宋恒 清华大学软件学院宋 上图不同树的成本之差 Huffman的贪心选择律 B(T)-B(T)=∑fc4(c)-∑f(c)dr(c) ·设C是给定的字符集,f是C中元素c的频 =fx}(x)+flka)-f[x41(x)-fl}1(a) 率。设x,y是C中频率最小的两个元素。令 C=C-{xy}+{z},其中z是一个新的元素 fd,(x)+falr(a-x], (a)-fad,(x) 即z不属于C。我们在C上更新f,使得 Ga-fd,(a)-d,(x) f2]=fx]y]其它值不变。则如果T是C上 的任意一个最优前缀编码对应的树,则把z > 用一棵由x和y为叶子点的树代替而生成的 新树T是C上的一个最优编码对应的树 清华大学学院未恒 清华大学软件学院宋 证明提示 Huffman编码小结 1.树T和T编码成本关系B(TB(T)=f×]+y] ·前缀码 2.反证法,假设T不是最优编码树→ Exists 最优码的特征 T such that BT)<B), which has x and 最小频率的优化码可最长化 y are the deepest siblings, Construct T" to represents a prefix code for ·贪婪选择律 C+T"is a better coding than t,a 贪婪算法 contradiction 清华大学轼字院宋恒 消华大学软件学院宋
4 清华大学软件学院宋斌恒 19 证明 •设T是任意一个最优编码对应的树,a,b是 深度最深的一对兄弟【为什么存在?】, 然后交换x、a;y、b;【见下图】 •T’’是最优编码对应的树 清华大学软件学院宋斌恒 20 y a b x y x b a a b x y 清华大学软件学院宋斌恒 21 上图不同树的成本之差 å å Î Î - = - c C T c C T B(T) B(T') f (c)d (c) f (c)d (c) ' f[x]d (x) f [a]d (a ) f [x]d (x) f[a]d (a ) = T + T - T ' - T ' ( f[a] f[x])(d (a ) d (x)) = - T - T f[x]d (x) f [a ]d (a ) f[x]d (a) f [a]d (x) = T + T - T - T ³ 0, 清华大学软件学院宋斌恒 22 Huffman的贪心选择律 •设C是给定的字符集,f[c]是C中元素c的频 率。设x,y是C中频率最小的两个元素。令 C’=C-{x,y}+{z},其中z是一个新的元素, 即z不属于C。我们在C’上更新f,使得 f[z]=f[x]+f[y],其它值不变。则如果T’是C’上 的任意一个最优前缀编码对应的树,则把z 用一棵由x和y为叶子点的树代替而生成的 新树T是C上的一个最优编码对应的树。 清华大学软件学院宋斌恒 23 证明提示 1. 树T和T’编码成本关系B(T)-B(T’)=f[x]+f[y] 2. 反证法,假设T不是最优编码树Ë Exists T’’such that B(T’’)<B(T), which has x and y are the deepest siblings, Ë Construct T’’’to represents a prefix code for C’ËT’’’is a better coding than T’, a contradiction 清华大学软件学院宋斌恒 24 Huffman编码小结 •前缀码 •最优码的特征 •最小频率的优化码可最长化 •贪婪选择律 •贪婪算法
Matroid(拟阵) 线性代数中的线性无关概念回顾 Theoretical Foundation for Greedy 组向量:a=10,0},b=1,1,0},c=0,20} orithms d={1,1,1},e={0.0.0}.f={0,1,0} application ·所有线性无关向量组构成的集合 l={{a}.{b},@每o}.t{ab}.{ac},{ad a.吾.鲁.…… 的特征:如果{ab}属于l则{a,{b}都属于 清华大学們学院宋恒 清华大学软件学院宋 Matroids 拟阵定义(续) Definition: A matroid is an ordered pair 1. I is a nonempty family of subsets of s, called the independent subsets of s, such that if B∈ I and AcB, then a∈l.Wesa atisfying the following 3 conditions: that I is hereditary if it satisfies this property. Note that empty set is necessarily a member 1. S is a finite nonempty set 2. If A in l. B in s, then there is some element A such that AU x∈l We say that M es the exchange 清华大学学院未恒 清华大学软件学院宋 Concepts detail Example 1. Linear algebra 1. I is a set whose elements are subset of Let S=(a, a2, a3,...,a be a set of n and I =all subsets of s). 2. Every element in I is a set whose elements are independent 1. Proof M=S, I) is a Matroid! 1. The element inI is independent subset of s 3. A subset of independent elements must 1. LetA E I and b is a subset of A, It is trivial that B is in 4. If one independent set is small than another, then it can be expands to the same size with elements in the other set 清华大学轼字院宋恒 消华大学软件学院宋
5 清华大学软件学院宋斌恒 25 Matroid(拟阵) •Theoretical Foundation for Greedy Algorithms. •Its application 清华大学软件学院宋斌恒 26 线性代数中的线性无关概念回顾 •一组向量: a={1,0,0}, b={1,1,0}, c={0,2,0}, d={1,1,1}, e={0,0,0}, f={0,1,0} •所有线性无关向量组 构成的集合 • l ={f,{a}, {b}, {c}, {d}, {f}, {a,b}, {a,c}, {a,d}, {a,f}, {}, {},… … } • l 的特征:如果{a,b}属于l 则{a}, {b}都属于l 清华大学软件学院宋斌恒 27 Matroids • Definition: A matroid is an ordered pair M = (S, l ) satisfying the following 3 conditions: 1. S is a finite nonempty set 清华大学软件学院宋斌恒 28 拟阵定义(续) 1. l is a nonempty family of subsets of S, called the independent subsets of S, such that if B Î l and A Í B, then A Î l . We say that l is hereditary if it satisfies this property. Note that empty set is necessarily a member of l . 2. If A in l , B in l , and |A|<|B|, then there is some element x Î B-A such that A U {x} Î l . We say that M satisfies the exchange property 清华大学软件学院宋斌恒 29 Concepts detail 1. l is a set whose elements are subset of S 2. Every element in l is a set whose elements are independent. 3. A subset of independent elements must be independent too. 4. If one independent set is small than another, then it can be expands to the same size with elements in the other set 清华大学软件学院宋斌恒 30 Example 1. Linear algebra • Let S={a1 , a2 , a3 , … , an } be a set of n independent vectors, and l ={all subsets of S}. 1. Proof M={S, l } is a Matroid! 1. The element in l is independent subset of S. 1. Let A Î l and B is a subset of A, It is trivial that B is in l . 2. If |A|<|B|, then there exists an element b in B-A such that A U {b} is an element of l. 1. Its trivial too!