内容介绍 解决问题:模式匹配,对应实际问题: Lecture11模式匹配 文本文件字符串查找 格式文件中的字符串及其格式查找问题 清华大学 算法: Naive, Rabin Karp,自动机 KMP,BM算法 宋斌恒 Figure 32. 1 The string-matching problem. The goal is to 本章算法及其复杂度列表 find all occurrences of the pattern P= abaa in the text T abcabaabcabac. The pattern occurs only once in the text, at 3. The shifts=3 is said to be a valid shift. Each character of the Algorithm the text, and all matched characters are shown in green color rin pattern is connected by a vertical line to the matching charact Narve O((n-m+l)m) e(m) O(n-m+1)m) Finite automaton O(m2 text ababa bac Knuth Morris Pratt O(m) pattern P 8=3 H al bl al Terminology Lemma 32.1(Overlapping -sufix lemma) ·∑:字符集 ·∑:由Σ生成的所有字符串,包括空串的 Suppose that x, y, and z are strings such 集合以下叙述x,y,z,w表示其元素 that x ]z and y ]z If xs lyl, then x]y Concatenation of x, y: xy If x> lyl, then ]x If x= lyl, then x 前缀:[:w[x→ there exists y, such tha 后缀:]:wk→ there exists y, such that
1 1 Lecture 11 模式匹配 清华大学 宋斌恒 2 内容介绍 • 解决问题:模式匹配,对应实际问题: – 文本文件字符串查找 – 格式文件中的字符串及其格式查找问题 • 算法:Naïve,Rabin Karp, 自动机, KMP,BM算法 3 Figure 32.1 The string-matching problem. The goal is to find all occurrences of the pattern P = abaa in the text T = abcabaabcabac. The pattern occurs only once in the text, at shift s = 3. The shift s = 3 is said to be a valid shift. Each character of the pattern is connected by a vertical line to the matching character in the text, and all matched characters are shown in green color. text T pattern P b a b a a a c a b a a b c a b a c 4 本章算法及其复杂度列表 Algorithm preprocessing time matching time Naïve 0 O((n-m+1)m) Rabin-Karp Θ(m) O((n-m+1)m) Finite automaton O(m|Σ|) Θ(n) Knuth Morris Pratt O(m) Θ(n) BM 5 Terminology • Σ: 字符集 • Σ*:由Σ生成的所有字符串,包括空串的 集合,以下叙述x, y, z, w表示其元素 • Concatenation of x,y: xy • 前缀:[ : w [ x Î there exists y,such that x=wy • 后缀: ] : w ]x Î there exists y,such that x=yw 6 Lemma 32.1 (Overlapping-suffix lemma) Suppose that x, y, and z are strings such that x ] z and y ] z. If |x| ≤ |y|, then x] y. If |x| ≥ |y|, then y ] x. If |x| = |y|, then x = y
Figure 32.3 A graphical proof of Lemma 32.I. We suppose that x ]a andy 1z onnect matching regions(shown shaded) of the strings. (a)If k< b, thenx ly. (b)If]d> b then ]x()If k]=b thenx=y Naive String matching algorithm ■■ Naive-String-matcher(T, P 2.m∈ . length[P 3.Fors∈0ton-m 1. IfP[L,-,m==T[+1,.,stm 1. Print"Pattern occurs with shifts Rabin Karp algorithm Key points of Karp-Rabin Hash Value p of a string P with modulo q How to compute the value efficiently? p=hash(P, m, q(P(mEP+ P(m-121+ P( J2P+ Plm-1JIEp+.+ P[=lm-l)mod( Ls+i=Hash(T[+1,m], m, q) 计算方法:多项式求值方法 ts+i =d(ts-TIsIh)T[s+m+l] mod(g) ·算法描述: h=dm-l(mod q) 给定字符串T和要找的子字符串P d是进位,如果是26个字符,则可以是26, If Hash(P/m, q)=Hash(T(L,m,m,q)c 般来讲取d为 ·继续判断(nave方法) Spurious Hit 其中Ti,m]表示T从开始长度为m的子串 lsl9lo2|34|52|6173992山 ·mod13 89|3l084|50u7gl valid (b)
2 7 Figure 32.3 A graphical proof of Lemma 32.1. We suppose that x ] a and y ] z. The three parts of the figure illustrate the three cases of the lemma. Vertical lines connect matching regions (shown shaded) of the strings. (a) If |x| ≤ |y|, then x ] y. (b)If |x| ≥ |y|, then y ] x. (c) If |x| = |y|, then x = y. 8 Naïve String matching algorithm • Naïve-String-matcher(T,P) 1. nÅlength[T] 2. mÅlength[P] 3. For s Å0 to n-m 1. If P[1,…,m]==T[s+1,…,s+m] 1. Print “Pattern occurs with shift” s 9 Rabin Karp algorithm • Hash Value p of a string P with modulo q: – p=hash(P,m,q)=(P[m]|Σ| 0+ P[m-1]|Σ| 1+ P[m-2]|Σ| 2+…+ P[m-i]|Σ| i +…+ P[1]|Σ| m-1) mod(q) – 计算方法:多项式求值方法, • 算法描述: – 给定字符串T和要找的子字符串P, – 从i=1到Length[T]-Length[P] • If Hash(P,m,q)!=Hash(T[i,m],m,q) continue • 继续判断(naïve方法) • 其中T[i,m]表示T从i开始长度为m的子串 10 Key Points of Karp-Rabin • How to compute the value efficiently? – ts+1 = Hash(T[s+1,m],m,q) – ts+1 =(d(ts-T[s+1]h)+T[s+m+1]) mod(q) – h=dm-1 (mod q) – d是进位,如果是26个字符,则可以是26, 一般来讲取d为|Σ| • Spurious Hit 11 2 9 0 2 3 1 4 1 5 2 6 7 7 3 5 3 9 9 2 1 12 2 9 0 2 3 1 4 1 5 2 6 7 7 3 5 3 9 9 2 1 8 9 3 11 0 1 8 4 5 10 11 7 9 11
Rabin-Karp-Matcher(T, P,d, q) 1.n= length们 ld 2. m=length[Pl digit shift 3. h=dm-l(mod q) 4.p=0 6. For i=l to m 8(mod13) 1, p=(dptPo)(mod q) 2. to=(dto+T)(mod q) 7. For s=0 to n-m PI1,…,mF=s-…m+ sh print" pattem occurs I. td(t-Ts+Ih+T[s+m+l] mod(q) String matching with finite 有限自动机示例:同 有限自动机M:5uple(Q40A,x,6) Q is a finite set of states a go in Q is the start state A contains inQ is a distinguished set of accepting states 2 is a finite input alphabet 6 is a function from q×∑ into Q called the abana accepted transition function of m abbbaaa accepte abba rejected Final-state function String matching automata 中Σ*→Q M=(Qq0A,∑。) where P(E=go ∑= the alphabet in P ¢wa=6(φw,a) for w in 2*,ainΣ m: Integer s means the current state has exactly s characters match the patter m=length(PI A={m} 6:5(q1a=G(Pa)2o(x=max{kPk2x,P表示 的前k个字符,称为后缀函数。【研究!】
3 13 3 1 4 1 5 2 7 8 14152≡(31415-3·10000) · 10+2(mod 13) ≡(7-3 · 3) · 10+2(mod 13) ≡8(mod 13) 14 • Rabin-Karp-Matcher(T,P,d,q) 1. n=length[T] 2. m=length[P] 3. h=dm-1 (mod q) 4. p=0 5. T0=0 6. For i =1 to m 1. p=(dp+P[i]) (mod q) 2. t0=(dt0+T[i]) (mod q) 7. For s = 0 to n-m 1. If p=ts 1. If P[1,…,m]=T[s,…,m+s], print “pattern occurs at ” s 2. If s<n-m 1. ts+1=(d(ts-T[s+1]h)+T[s+m+1] mod(q) 15 String matching with finite automata • 有限自动机 M:5-tuple (Q,q0,A,Σ,δ) • Q is a finite set of states • q0 in Q is the start state • A contains in Q is a distinguished set of accepting states • Σ is a finite input alphabet • δ is a function from Q×Σ into Q called the transition function of M 16 有限自动机示例: 0 1 1 0 0 0 1 0 state a b input δ ? Σ ? A 1 q ? 0 Q ? abaaa accepted abbbaaa accepted abbaa rejected 17 Final-state function • φ: Σ∗ ÆQ • φ(ε)=q0 • φ(wa)=δ(φ(w),a) for w in Σ∗, a in Σ 18 String matching automata • Given pattern P, to construct an automaton: • M= (Q,q0,A,Σ,δ) where • Σ = the alphabet in P • Q = {0,1,2,…,m: integer s means the current state has exactly s characters match the pattern}, m=length[P] • q0=0 • A={m} • δ : δ(q,a)=σ(Pqa), σ(x)=max{k:Pk =x}, Pk表示P 的前k个字符,称为后缀函数。【研究!】
Automaton for“ ababa 9ab④⑤⑥ 其它转移如何做? 自动机工作的一个例子 1 3 13 02040402 34567891011 b a b a b a c a b 51 68 (T)01234545672 71 算法描述: 后缀函数σ的属性: 【引定理32.2/3/4】 Finite-Automaton- Matcher(T, 8, m) If q=o(x)then o(xa=o(Poa) 6q,T) 如果ψ是模式P决定的自动机的最终状态 2. If q==m then print"pattern occurs at"i-m 函数,则(T=0(T
4 19 Automaton for “ababaca” 0 1 2 3 4 5 6 7 其它转移如何做? 转移函数? 20 0 1 2 3 4 5 6 7 21 1 0 0 1 2 0 3 0 0 1 4 0 5 0 0 1 4 6 7 0 0 1 2 0 22 自动机工作的一个例子 i — 1 2 3 4 5 6 7 8 9 10 11 T[i] — a b a b a b a c a b a stateφ(Ti ) 0 1 2 3 4 5 4 5 6 7 2 3 23 算法描述: • Finite-Automaton-Matcher(T,δ,m) 1. n=length[T] 2. q=0 3. For i = 1 to n 1. q=δ(q,T[i]) 2. If q==m then print “pattern occurs at” i-m 24 后缀函数σ的属性: 【引/定理32.2/3/4】 • σ(xa)≤ σ(x)+1 • If q= σ(x) then σ(xa)=σ(Pqa) • 如果φ是模式P决定的自动机的最终状态 函数,则φ(Ti )=σ(Ti )
■ ■■口 Figure 32.8 An illustration for the proof of Lemma 32.2. The figure shows thats o(x)+ l, where r= o(xa) Figure 32.9 An illustration for the proof of Lemma 32.3 The figure shows thatr=o(Pa), where q=o(x) and r=o(xa) COMPUTE-TRANSITION-FUNCTION(P, 2) Knuth-Morris-Pratt Algorithm 7 mlength[PI 2forq←0tom ·中心思想:克服nave算法和有限自动机 3 do for each character a∈∑ 算法的重复工作! do k- min(m+1, 9+2) ·主要步骤:计算辅助函数π{1,2,,m它只 repeat k←k 与 Pattern相关,用来说明 Pattern的重复情 until Pk]P4【是后缀! 7 6q,a)←k 8 return 5 在某S处开始有5个字符匹配,第六个不匹配 这时找一个短一些的匹配串:即从第S+2处开 始的长度为3的地方匹配见下页 balc babr bac bababaa babr H blal lacal aa bl a ca
5 25 Figure 32.8 An illustration for the proof of Lemma 32.2. The figure shows that r ≤ σ(x) + 1, where r = σ(xa). pr−1 pr 26 Figure 32.9 An illustration for the proof of Lemma 32.3. The figure shows that r = σ(Pqa), where q = σ(x) and r = σ(xa). pq pr 27 COMPUTE-TRANSITION-FUNCTION(P, ∑) 1 m←length[P] 2 for q ← 0 to m 3 do for each character a ∈ ∑ 4 do k ← min(m + 1, q + 2) 5 repeat k ← k - 1 6 until Pk ] Pqa【是后缀!】 7 δ(q, a) ← k 8 return δ 28 Knuth-Morris-Pratt Algorithm • 中心思想:克服naïve 算法和有限自动机 算法的重复工作! • 主要步骤:计算辅助函数π[1,2,…,m]它只 与Pattern相关,用来说明Pattern的重复情 况。 29 a a b a a b c b a b a b a a b c b a b b c a 在某S处开始有5个字符匹配,第六个不匹配, 这时找一个短一些的匹配串:即从第S+2处开 始的长度为3的地方匹配,见下页 30 a a b a a b c b a b a b a a b c b a b b c a