Database searching FASTA: Basic Idea 2- Repeat for all possible k-tuples i.e. cV=2-tuple Seq 1: AHFYRWNKLCV Seg 2: DRWNLFCVATYWE 3-Make a Hash Table Hashing) that has the position of each k-tuple in each sequence 2-tuple pos in Seg1 pos in Seg2 offset (pos1-p0s2 RW 5 2 CV 10 AH
Database Searching FASTA: Basic Idea 2- Repeat for all possible k-tuples i.e. CV = 2-tuple Seq 1: AHFYRWNKLCV Seq 2: DRWNLFCVATYWE 3- Make a Hash Table (Hashing) that has the position of each k-tuple in each sequence 2-tuple pos. in Seq1 RW 5 CV 10 AH 1 i.e. pos in Seq 2 Offset (pos1-pos2) 2 3 7 3 ---- ----
Database searching Seq 1: AHFYRWNKLEV Seg 2: DRNLFCVATYWE 3.Make a Hash Table HAshing)that has the position of each k-tuple heach sequence e 2-tuple pos in Seq1 pos in Seg ofset(pos1-pos2 RW CV 10 3 AH 4-Look for words(k-tuples) with same offset These are in-phase and reveal a region of alignment between the two sequences 5-Build a local alignment based on these, extend it outwards Seg 1: AHFYRWNKLCV Seg 2: DRWNLFCVATYWE
Database Searching Seq 1: AHFYRWNKLCV Seq 2: DRWNLFCVATYWE 3- Make a Hash Table i.e. (Hashing) that has the position of each k-tuple in each sequence 2-tuple pos. in Seq1 pos in Seq 2 Offset (pos1-pos2) 3 3 RW 5 2 CV 10 7 AH 1 ---- ---- 4- Look for words (k-tuples) with same offset These are in-phase and reveal a region of alignment between the two sequences. 5- Build a local alignment based on these, extend it outwards Seq 1: AHFYRWNKLCV Seq 2: DRWNLFCVATYWE
Database searching With hashing, number of comparisons is proportional To the average sequence length i.e. an o(n) problem) Not an o(mn)problem as in dynamic programming Proteins-ktup=1-2 Nucleotides, ktup=4-6 One big problem- low complexity regions Seq 1: AHFYPPPPPPPPFSER Seq 2: DVATPPPPPPPPPPPNLFK
Database Searching With hashing, number of comparisons is proportional To the average sequence length (i.e. an O(n) problem), Not an O(mn) problem as in dynamic programming. Proteins – ktup = 1-2, Nucleotides, ktup=4-6 One big problem – low complexity regions. Seq 1: AHFYPPPPPPPPFSER Seq 2: DVATPPPPPPPPPPPNLFK
Database searching BLAST Same basic idea as fasta but faster and more sensitivel How? BLAST searches for common words or k-tuples, but limits the search for k-tuples that are most significant by using the log-odds values in the Blosum62 amino acid substitution matrix . e. look for WHK and might accept WHR but not HFK as a possible match(note 8000 possibilities) Repeat for all 3-tuples in the query Search the database for a match to the top 50 3-tuples that match the first query position in the sequence, the second query position, etc Use any match to seed an ungapped alignment (old BLasT)
Database Searching BLAST Same basic idea as FASTA, but faster and more sensitive! How? BLAST searches for common words or k-tuples, but limits the search for k-tuples that are most significant, by using the log-odds values in the Blosum62 amino acid substitution matrix i.e. look for WHK and might accept WHR but not HFK as a possible match (note 8000 possibilities) Repeat for all 3-tuples in the query Search the database for a match to the top 50 3-tuples that match the first query position in the sequence, the second query position, etc. Use any match to seed an ungapped alignment (old BLAST)
Database searching Word length is fixed: 3-tuple for proteins 11-tuple for nucleotides By default, filters out low complexity regions Determine if the alignment is statistically significant calculates the probability of observing a score greater than or equal to your alignment based on extreme value distribution Calculates an E-value expectation value This is the probability of finding an unrelated sequence that shows this good an alignment just by chance. Remember if p=.0001 and my database has 500,000 sequences, I will have an E=50!(normal starting E=10
Database Searching Word length is fixed: 3-tuple for proteins 11-tuple for nucleotides By default, filters out low complexity regions. Determine if the alignment is statistically significant. calculates the probability of observing a score greater than or equal to your alignment based on extreme value distribution. Calculates an E-value = expectation value: This is the probability of finding an unrelated sequence that shows this good an alignment just by chance. Remember if p=.0001 and my database has 500,000 sequences, I will have an E=50! (normal starting E=10)