《生物信息学》(第二版)(樊龙江主编,2021)配套PPT3-1 3. Analysis and alignment of sequences 3.1 Compositional bias in biological sequences 3.2 Alignment of pairs of sequences 3.3 Database searching for similar sequences 3. 4 Multiple sequence alignment and domain finding
3. Analysis and alignment of sequences • 3.1 Compositional bias in biological sequences • 3.2 Alignment of pairs of sequences • 3.3 Database searching for similar sequences • 3.4 Multiple sequence alignment and domain finding 《生物信息学》(第二版)(樊龙江主编,2021)配套PPT3-1
CTACATTCCTATCCACTGGTGCATATCTAGO ETATCTITCTCTAACCTTAACACACITTAAGITCACAAAATTA 31c。mp。st。 aabbs in bfolocicalsecuences 我vM以EN TACATTTT GGAATCAGGGC://15 AGi ISoSoweai eolaFdistrbutione he石 CGTTGTT AAAATAATIGTCATAA合e
CACTAGTCTCTGTACTAGCCACTAGAAGTACTAACCTTTCACACTAATATATCTATCTCCTGCTGCATTTAGTACACAAGTTCATAAAAGCACCCTATTTCTATAAAAAAAATACGGTAAATGTA GCAACTTAC TAGTACCATAAGAAATTTTGCTGATCTAGCTAACTTATTACTAGCTACTTGCTAGGTCTGAACACTATTAAAATGTAACAATACACTTACCTCCTTGATCTGTGCAGCCCTGTTCTCACGCTGGCTTCTATGG TGCGAGTAGTATTCCTAGGTTTTCGTAGGCTTTTATAGCAACAGCTTTCTTCGGACCGAATGAGACACCTGCCTTGTTTATGAGAGGGATGGATAGCTTTCACCTGCTGGACATTTATTTGTTTTTTTTTACT GGTCACTACATTCCTATCCACTGGTGCATATCTATCCTATCCCCTTTGGTCAGTAAAATATACTGCCTCCCCCATTCTCTTTCTTTCTCTATCTTTCTCTAAGCTTAACACACTTTAAGTTCACA AAATTATTAT TATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTAGCAGGCTTCCCTCCTTTAGAAATTTCATCGTCGAAATTATTATACCTTGGTGATGGAAAA ACTGAGGCTAGT TTTTTCTGGAGATCATCTTCCTTCTCCCATGTGGCCTCATCCATGGTGTGATGACTCCATTGTACCTTTAAAAATCTAATTGTTTGGTTCCTTGTTTTTAGATCTTTAATATCCAAGATACAAACAGGATATTC CTGATATGTCAAATCGTTATGCAACTCAGCCATAGGAATTTCAACTTAATCACTTGGCCTCCGAAGGCATTTACGAAGCATGGAGATGTGGAATACATCATGTACCCCGGTGAAAGCATCTGGTA GCTTTA GCATGTAAGGCACTTCTCCTATTTGCTTAACAATTGTAAATGGTCCAACATATCTGGAACTTATTTTTTTTCCAAGTCCGAATCGCTTAATTCCCTTTATAGGTGATACTTTTAAATATACCCAGTCACCTATAT CAAAGTTAAGATCCCTTCTCCTATTATCTGCATAACTTTTTTTGTCTATTTTGAGCTGTTTGCAGTCGTTCCCGTATCAGTCGTATTGTTTCTTCTATCTGTTGTATTATATCCGGTCCTAACAA TTTTCTTTCTC CTACTTCGTTCCAGCAAACAGGTGTTCTGCATTTCCTTCCATATAAGGCTTCATACGGAGCCATTTGTATACTAGATTGATAACTATTGTTATATGCAAATTCTGCTAATGGCATAAATTCTTTCCATGATCCT TTAAATTCTAGGATGCAAGATCGTAAAATATTTTCAATTATTTGATTCACCCTTTCAGTTTGTCCATCGGTTTGGGGGTGATACGCTGCACTGAAATCTAATGTTGTTCCCACGGGCTTGTGTAGTCTTTTCT AGAAATTGGACAGAAACTGTGTATCTCTGTCTGACACAATCCTTCTTGGAACACCATGTAAAGATACTATTTCTTTGACATATAGTTTAGCTAACCTTTCCAAAGAAAATTTGCTTTTAACGGGTATGAAATGA GCAGATTTTGTTAACCGATCCACTATTCAGATACTATCATTTCCTGGAGGTGTGGTAGGTAATCCTTGAACAAAGTCCATACTGATTTCTTCTCATTTCCATAGTGGAATACTTAAGGGTTGTAA CAGTCTTG CCGGCCTTTGATGTTCAACTTTTACGCATTGGCAGATATCACATTCTGCAATGAATTTTGCAATTTCTATTTTCATGATACATTTTGGTACTTCCTGGATGTATGGTATAGGGAGAGAAATGTGA TTCTTCCAA TATTCTCTGTTTTAAATTAGGGTCGTTAGGCACACACAATCTATTTTTGAAACATATAGCACCATTATGATCAATTCGAAATTCAGACACCTTCCCTTCTTCAATATTTTTCTTTGCCTTTTGCA ATCCACTGTC GTCTCTTTGTTTCTCTAGAATATTTTCTTCTAAAGTAGGCTTTATTTGAAGCACGGGTAATAATACTCTGGGTTCATGGATCTTTAATTCCACATCCAATCTTTCCAAGTCTCTAAGTATATGTTGATCCTGTGT GATCTGAATAGCCATATTACAAAGAGCTTTTCGACTTAGAGCATCTTCCACAATGTTGGCTTTCAGAGGGTGATAATGAATATTCAAATCATAATCTTTCAATAATTCTAACCATCCCCTTTATCTCATATTCA ATTCCTTCTGAGTAAATATGTACTTTAAACTTTTGTGGTCAGTAAATATTTCACAATGCTCACCATATAGGTAATGTCTCCAGATTTTTAAGGCAAAAATAACAGCAGCTAATTCCATATCATGGGTTGGATAA TTTTGCTCGTATGGCTTTAATTGACGCGAAGCATAGGCAATTACCTTAGCTTTTTGCATGAGAACACAACCTAATCCAATTTTTGAAGCATCACAGTAAATAGTAAATTCTTCTCCCATTATAGGCAAGGCAA GAATAGTAAATTCTTTGCAATTCTGAGTCCACTCATATTTTACTCCCTTTTGTGTCAACCGGGTTAGAGGAGCTGCAATTCTAGCGAAGTTACTAATAAATCGACGGTAATATCCCGCCAACCCA AGAAAAC TTCGTATCTCGGTTACCGATGAGGGCCTTTTCCACTCTGAGACGGTTTTGACCTTTTCAGGGTCCACTGATATACCTTCACCCGAAATAACATGACCAAGCAAAAATACTTTATCCATCCAGAAA TCGCATT TCTTTAATTTGGCAAATAGTTTATGATCTCGCAATGTCTTGTAGTACTATTCTCAAATGATTTGCATGATCTTCCTTAGTCTTGGAATATATCAAAATATCATCTATATATAAATACAACTACAAATTAATCAAGA TAAGGCTTGAATTTACGATTCATTAAATCCATAAAAGCTGCCGGTGCATTAGTCAAACCAAATGGCATTACTAGATATTCATAGTGTCCATAGCATGCACGGAAAGCAGTCTTGGGTATATCACTAGGTTTA ATCTTTAGTTGATGGTAGCCTGATTGAAGATCAATTTTTGAGAAAACCCGAGCTCCTTGTAGTTGATCAAATAGATCGTCTATCCTTGGTAAAGGATATTTGTTTTTGATAGTCACCTTATTCAGTTCTCGGTA ATCCGTGCATAATCGCATAGTTCCATCCTTTTTCTTGACAAATAGAACAGGAACACCCCCACGGGGAGACACTAGGACAAATGAATCCTTTATCTTCTAATTCTTTTAATTGTACATTTAGTTCC TTTAGCTC AACAGGGGCCATTATGTAGGGTGCCTAATAAATCGGAGTAGTTCCTGGTCCTATTTCAATACCAAATTCAATCTCTCGATCTAGTGCTAATCCTGGTAATTCAGCTGGAAAAACTGGAAACTCATTCACAAT TGGCATTCCTTCCCAACTTGCTTCCTTTCTCATGATTTCTGCCACTAAAGGTCTTGGTAAATTGTTTTAATCTCCATGGTAAGTAATTTGGTTTTGATCCCATGGTTTAAGTGTAATTTGTTTTT CATGGCAATC AATATTTGCTTTGTTCTTACATAACCAATCCATACCAAGTATAATATCAAAATCATGCATATCCAAGGGTATGAGGTCAGCAGTTAATTCCCATCCATCAATAGTAATTGGACACAATTTGCAAA TTAAATTAG TTATTTGGCTATCCAAAGGAGTTTCTATGCAAATCCTTTCTTTTAATTGACTAGTAGGGATGGTGTATTTTCTCACGAAGTTGGTGGAGATAAACGAATGTGTTGCGCCAGAATCAAATAAAACTTTACCAGG ATAAGAGCACACTAAGACATTACCTGTAACCACGGTGTTGGATTTTTCGGCTGTGCTCTTAGTTAAGTTGTATACCCCAAGCGCGATTCCCACCTTGTGAATTATTCGACCGTATTCCTCATGTA GTATTAG TATTTGCAGGTGGCTTTCCTTGATTTGGCCCATTATTATTTGCTGAAGATGGTCCAGGTAAATAAAGCGACGGTACTGAAGTCAATACTTTAGTACTTGGCTGAGTAGTTCAATTAACTCGATTT TTACCCTT CTGTAACAGAGGACAAAGGTATCTAGTATGTCCTGCTTCTCCACACTCAAAGCACCTTCCCCACCGATTAGGACAAATTGATGGAACATGGCCACCTTGGCATATTGGACATTTTCTGTCTTGATTTTCTAA AGATTCCCTCTACATTTTTCCAGAGTAGTTTCCACGGAATCTTCCCTGGTTTTGTTGATTATTTGTCTTGAATTTCTTTTGGGGTTGTCCGTGTTCTATTCTTTGTTCATGATACCCCTTCTCAA GAAGTTGTG CTTTACTTACTACCTCCCTGAATATGGTTAATTCAAAGGCTTCGACACACCTTTTGAGAGGTTGGCGTAATCCACTTTCAAATCGTCGAGCTTTAGAGCCGTCCGTTTGTACAAATTCAGGAGCA AATCTTG CAAGTCTCGAAAATTCTATTTCATATTCTACTACAGATTTATTACCTTACTTAAGCTCTAGAAATTCCTTCTTCATTCTCTTCACACTTTCTGGAAAATATTTCTTGTAAAAAGCTTCTTTGAATATTTCCCATGT AATAGAGATACGTTCCGAATATGACTTTTTGTGAGCATCCCACCATTCAAAAGCACTAGACTGAAGCATATAGGTAGCATATGTAATCTTTTCTTTATCTGTACAACCCATAGCTTCAAATGCCT TTTCCATT GCTACTATCCAAACTTCCGCTTCAAGTGGATTGGTAGTTCCTGAAAGGAAAAAGTATGAATTACCCCCTGAACTATTGCGAGAGTATGAATTACCCCCCCCCCCCAAAACCACAAAACCAGACATATTAAAC CTCAAACTATTGAAATCGGATTACCCCCCCTGATTCAATCCGGAGCGGTTTGGTCCTACGTGGCATACACGTGGCACCGCCATGGAAATCCAATCAGCAATATTAGGTGGTCCCACATGTCATGA TCATGT ATTTCTTCCACTTTCCCCTCTCTTCATCTCCTCCAGGGCAAATAGAAAGCGGCGCGGTGGTGGCGCTCTCCAGGGCGGCCGGGGGAAGCGGCGGCGGCGGCGTCCAGGGCGGGTGGGGGAAGCGGC GGCGTCCAGGGCGGCTGCGGAAGCGACGGCGGCGTCCAGGGTGGGCTAGGGAAGCGGCGGCTTCTAGGGCAAGCTGGGGAAGTGGCGGCGGTGGCGGCGACGGCGGCGTCCAGGGCGGGCTGG GGAAGCAGCGGCGTCCAGGGCAGGCGGGGAAGTGGCGGTGATGACGGCGCCCTCCAGGTCGAACTGGGGTGGTGGCGGGGAAGTGACGGCAGCGACGGCGCCCTCCAGGGCAGGTAGGGGAAGC GGTGGCGGCGGGTGTGGCGGGAGCGCTCGTGCGGTGGGCGCGGCGGGAGCGGGAGCGGGCGCGGCGAGGAGCAGGCGCTTGTGCTCCTCCTCCGTGGCGCCAGAGATGGAGCGGGCGCTCGTG AGCGGGTCGGCCGCCGCTGCGAGCTCGCCGTGGAGGCGGCGAGAATCGAGATCGACGGCGAGCTCCACGGAGATGGAGAGAAGAAGGGAAGGGGCAAAGAGGAGGGGGAGAAGAGGAGGGTTGG GCAGACAGTGGGCCCCACCATATTTATTTGTTGTGGCTGACAAGTGGGTCCTATATATTTTTCTTTTGTTTTAGCTGACCAGACTGCCACATGGGCATCCACGTAGGACCGAAACCACCCTATATCGATCTA GGGGGTAATTCATCCGGTTTGTAAAGTTCAGGGTTAAAAATAACTGGTATTGGAGTTCAGGGTTAAAAATCGGACGACCGTAATTGTTGAGGGGGTAATTCGTACTTTTTCCTTCTTGAAAATGTTGGTGG CTTCAATTTCTGAAATTCCCCAAGTCCATTCCGGTTAGCATCACTTTTAGTAGTACGTTCTAAAATCTCCATCTATCGTTGTTGGGTTTCCTGTTGCTTGCCCAATATATTCGCGAGTAAGTTAGCCCAAGGG TCTTGACTACTTGCACTAGGTATTATTGATCCAGTGGCACCATTACTAGTATTATTTCCATCCTGACTAGTACCATTGTTGTCGTTGTTTTGCTCCATCTATCATATTCAACTCATTAGCCAGAA TACATAAAT GATCATTGGATGGATCTCAAAATGGTAACAAAAATCAGATTTACTATAAAATATTCAATATAGGTAATATTAAAATAAAACTATTTAGTTATATTATCATCATTATACTTTTCTCTTCTTATTTTAGTCTTATCATT ATTCTTAACATGCACCAGTTAAAAAATAAATAAATAAAATTAGTACAAACCACAAGCACCACAGCACTAGTGCATTACGGTCATGTTTAGATTCAAATTTTTTTCTTCAAACTTCTAACTTTTCCGTCACATCAA ATGTTTGGACACATGCATGGAGCATTAAATGTGGAGAAAAAAACAATTGCACAGTTTGCATGTAAATTGTGAGACGAATCTTTTGAGCCTAATTACACCATGATTTGACAATGTGATGCTATAGTAAACATTT GTTAATGATAGATTAATTAGTCTTAATAAATTCATCTCGCAGTTTACAGGTGAAATCTGTAATTTGTTTTGTTATTAGTCTACATTTAATACTTCAAATGTATATCCATATACTTGAAAAAAAATTTGGCACACG AACTAAACACAGCCTACTTCGACGAAAAGAAAGTGCAGGAGCCTATCATGCTACACAAACACTAAGGCAAACACCTACTGGTGTACTAGTGCCACATACAGAGCTCTGGTTGTTTACACAAGATGTCTAGA AAGACATCACCATGAGTTCTGATGTTAACTCTTCAGTTCTAAAAGCTCCTTTGGCTGTCTCGTGACCCATCCACACATGCTACTAACACTAAGGGTGTGTAGGGTGTGTTTAGTTCACACCAAAA TTGAAAG TTTGGTTGAAATTGAAACGATGTGACGGAAAAGTTGAAGTTTACGTGTGTAGGAGAGTTTTGATGTGATGAAAAAGTTAAAAGTTTGAAGAAAAATTTTGGAACTAAACTCAGCCTAAAGGACTTATTATAGT GGAGTACATCCCATCCCAAGGGAAAACAAAACCCATACTGACACCACTCCTACATCTCACACACTGCCACTAGAGCTGTCACTACCCCCAACCCCACTCTGCAGAACAGTAAATGGTTTCACTCA GGTAG CAGACGCGGTGGTACAGGCGATAGGTGAGGCGCTCCAGAAACATAGGCTGTGTTTAGATGGTGGAAAAGTTGGGAGGTTGGGAGAAAGTTAGTAGTTTGGAGAAAAAGTTGGTAGTTTATGTGTGTACG AAAGTTTTCGATGTGATGTGATGTGATGGAAAGTTAGGAATTTGGGGGGAACTAAACACGGCCATAACTTCATTCTCACTGGAGCGAACAATAGTCGGCAGTTATTTTTATATACATATTTGTTA AAGAAGA AATATTACTGTCCATGGATATTAATGGCCGATAAATAGTATAAAAAACATTAAATATAGTAAGTGATTTAAATACATTCTGCAGAGGTATTAAAATAATTGTCATAATCTCGTTCCTTCAATCCA TTTTTTTCCA ACTAGTGATACCTCATCTGAGAATCACGGCGCCGAATTCCCTACTTGTGTGAGGCATTCCTTCTCTCACACTGATATCAGCCGACCCGATATCGTTGTTTCAGGTATCGGCCGTCTCAGGCTAAGTATCAA AATCATGTTCCATGATTATGACGTTATTATTCTCACTGATAAAATCATCAATCAATTATTCGGGAGTTAATAATATTTACCGTTAGATCGTTAGTATCATCATCCCAATATATAATACAGGTAAGCGAATTTAGT TAGAGATGATTAAGTAAAATAGTTGATGGACACAGTCTTGCCTTCTCTTTTGTTGTTCTTCCTCTGCATCCCACCTAATCAAATATACATGTCTTTGGTATTAATTTATATCTATATTTGTTATG CAGGACATTA GCTACTGGAACCAGCTACTAGGACCATAGATAGCTAGTTGATGTGACTCTACTGGAGAAAGAAAACCAACATGTAGGCCTAGTTTATTTCCCCCAAAATTTTTCCCAAAAACATCACATTGAATCTTTGGAC ATATGCATGGAGCATTAAATATAGATTAAAAAAACTAATTGCACAGTTAGGGGGAAAATCACGAGACGAATCTTTTGAGCCTTATTAATCCATGATTAGCCATAAGTGCTACAGTAATGCCAGCTGGGCGAG GAGAGGTGGCAGTGGTGGTGAGCCCAGCTGGGTGGATGTGTGGAGGGTGGAGAGGAGACGGGGAGGGAGGGAGGGAGGGAGAGAGGACTAGG 3.1 Compositional bias in biological sequences An obvious first summary of a DNA sequence is just the distribution of the four base types. Almost all empirical studies show an unequal distribution of the four bases
Promoter sequences Base content as a function of CDNA position, relative to the start of transcription sites, and averaged over all cDNAs with a 10-bp sliding window R Ice I-10-A TSS CDNA coord. 100b
Promoter sequences Base content as a function of cDNA position, relative to the start of transcription sites, and averaged over all cDNAs with a 10-bp sliding window 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 -4 -3 -2 -1 0 1 cDNA coord, 100bp I-10-GC I-10-A I-10-T I-10-G I-10-C Rice TSS
Arabidopsis 0.45 a-10-GO 10-A 0.25 a-10-T a-10-G 0.2 a-10-C 0.15 0.1 0.0
Arab_10_A,T,G,C,GC 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 -4 -3 -2 -1 0 1 a-10-GC a-10-A a-10-T a-10-G a-10-C Arabidopsis
Human 0.6 w~4 H-10-GC H-10-A H-10 0.3 H-10-G H-10-C
Human_10_A,T,G,C,GC 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 -4 -3 -2 -1 0 1 H-10-GC H-10-A H-10-T H-10-G H-10-C Human