DNA Sequence Comparison Alignment Target frequencies and mismatch penalties Eukaryotic gene structure Comparative genomics applications Pipmaker(2 species comparison) Phylogenetic Shadowing(many species) Intro to DNA sequence motifs See Ch. 7 of Mount
DNA Sequence Comparison & Alignment • Target frequencies and mismatch penalties • Eukaryotic gene structure • Comparative genomics applications: • See Ch. 7 of Mount - Pipmaker (2 species comparison) - Phylogenetic Shadowing (many species) Intro to DNA sequenc e motifs
DNA Sequence Alignment V How is n related to the score matrix? n is the unique positive solution to the equation pip: e= p frequency of nt i, Si= score for aligning an i,j pair What kind of an equation is this? What would happen to n if we doubled all the scores? What does this tell us about the nature of n? Karlin Altschul 1990
i DNA Sequence Alignment V How is λ related to the score matrix? λ is the unique positive solution to the equation*: ∑ p pjeλsij = 1 i i,j p = frequency of nt i, sij = score for aligning an i,j pair What kind of an equation is this? What would happen to λ if we doubled all the scores? What does this tell us about the nature of λ? *Karlin & Altschul, 1990
DNA Sequence Alignment VI What scoring matrix to use for dNA? Usually use simple match-mismatch matrices Gmm CGT mmm mmm m="mismatch penalty(must be negative
DNA Sequence Alignment VI What scoring matrix to use for DNA? Usually use simple match-mismatch matrices: i j: A C G T A 1 m m m C m 1 m m si,j : G T m m m m 1 m m 1 m = “mismatch penalty” (must be negative)
DNA Sequence alignment Vll How to choose the mismatch penalty? Use theory of High Scoring Segment composition High scoring alignments will have composition qi= pp ei where q = frequency of i j pairs(target frequencies") pp - req of i, j bases in sequences being compared What would happen to the target frequencies if we doubled all of the scores? *Karlin Altschul. 1990
DNA Sequence Alignment VII How to choose the mismatch penalty? Use theory of High Scoring Segment composition* High scoring alignments will have composition: qij = pi pj e λ sij where qij = frequency of i,j pairs (“target frequencies”) p , p = freq of i, j bases in sequences being compared i j What would happen to the target frequencies if we doubled all of the scores? *Karlin & Altschul, 1990
DNA Sequence alignment Vlll Still figuring out how to choose the mismatch penalty m Target frequencies: qi=p,p e/ij =In(q;/p:p )/A If you want to find regions with R% identities r=R/100q=r4q=(1n)12() Set s=1 Then m=Si=S/Si=In(q /p pi ))/(In(qi/pip1)/ (]) →m=n4(1)/3)n(4
DNA Sequence Alignment VIII Still figuring out how to choose the mismatch penalty m Target frequencies: qij = pi pj e λ sij sij = l n ( qij / pi pj ⇒ )/ If you want to find regions with R% identities: r = R /100 qii = r/4 qij = (1-r)/12 (i,j) Set sii = 1 Then m = sij = sij/sii = ln(qij / pi pj )/ λ) / (ln(qii / pi pi )/ λ (i ≠j) ⇒ m = ln(4(1-r)/3)/ln(4r) λ