791/736/BE490 Lecture 2 Feb.26,2004 DNA Sequence Comparison 8 Alignment Chris burge
7.91 / 7.36 / BE.490 Lecture #2 Feb. 26, 2004 DNA Sequence Comparison & Alignment Chris Burge
Review of Lecture 1: Genome Sequencing dna se sequence Analysis The Language of genomics CDNAS, ESTS. BACS Alus. etc Dideoxy Method Shotgun Sequencing The 'shotgun coverage equation(Poisson) Flavors of blast BLASTIPNXJ, TBLASTINXI Statistics of High Scoring Segments
Review of Lecture 1: “Genome Sequencing & DNA Sequence Analysis” • The Language of Genomics • • Flavors of BLAST • Statistics of High Scoring Segments - cDNAs, ESTs, BACs, Alus, etc. Dideoxy Method / Shotgun Sequencing - The ‘shotgun coverage equation’ (Poisson) - BLAST[PNX], TBLAST[NX]
Shotgun sequencing a bac or a genome 200 kb(NIH 3 Gb(Celera) Sonicate. Subclone Subclones Sequence, Assemble What would cause problems W Shotgun Contigs assembly?
Shotgun Sequencing a BAC or a Genome 200 kb (NIH) 3 Gb (Celera) Sequence, Assemble Sonicate, Subclone Subclones Shotgun Contigs What would cause problems with assembly?
DNA Sequence Alignment IV Which alignments are significant? ttgacctagatgagatgtcgttcacttttactgagctacagaaaa 45 S: 403 ttgatctagatgagatgccattcacttttactgagctacagaaaa 447 Identify high scoring segments whose score S exceeds a cutoff X using dynamic programming Scores follow an extreme value distribution P(S>x=1-exp[-Kmn e-XI For sequences of length m, n where K, n depend on the score matrix and the composition of the sequences being compared (Same theory as for protein sequence alignments
DNA Sequence Alignment IV Which alignments are significant? Q: 1 ttgacctagatgagatgtcgttcacttttactgagctacagaaaa 45 |||| |||||||||||| | ||||||||||||||||||||||||| S: 403 ttgatctagatgagatgccattcacttttactgagctacagaaaa 447 Identify high scoring segments whose score S exceeds a cutoff x using dynamic programming. Scores follow an extreme value distribution: P(S > x) = 1 - exp[-Kmn e - λ x] For sequences of length m, n where K, λ depend on the score matrix and the composition of the sequences being compared (Same theory as for protein sequence alignments)
From M yaffe Notes cont Lecture #2 Probability values for the extreme value distribution(A)and the normal distribution(B). The area under each curve is I The random sequence alignment scores would give rise to an"extreme value distribution -like a skewed gaussian Called Gumbel extreme value distribution or a normal distribution with a mean m and a variance o, the height of the curve is described by Y1/ov2) exp[-(x-m)2/2021 For an extreme value distribution, the height of the curve is described by Y=exp[-x-e-x].and P(S>x)=1-exp[-e-xx-ul)l where u=(In Kmn)/n Can show that mean extreme score is-log2 (nm), and the probability of getting a score that exceeds some number of standard deviations"X is P(S>X)- Kmne-x ***K and n are tabulated for different matrices *** For the less statistically inclined E- Kmne-us
From M. Yaffe Notes (cont) Lecture #2 • The random sequence alignment scores would give rise to an “extreme value” distribution – like a skewed gaussian. • Called Gumbel extreme value distribution For a normal distribution with a mean m and a variance σ, the height of the curve is described by Y=1/(σ√2π) exp[-(x-m)2/2σ2] For an extreme value distribution, the height of the curve is described by Y=exp[-x-e-x] …and P(S>x) = 1-exp[-e-λ(x-u)] where u=(ln Kmn)/λ Can show that mean extreme score is ~ log2(nm), and the probability of getting a score that exceeds some number of “standard deviations” x is: P(S>x)~ Kmne-λx. ***K and λ are tabulated for different matrices **** For the less statistically inclined: E~ Kmne -λS -2 -1 0.2 Yev 0.4 -4 4 0.4 B. Yn Probability values for the extreme value distribution (A) and the normal distribution (B). The area under each curve is 1. 0 1 2 X X A. 3 4 5