We can now use the pssm to search a database for other proteins that have the bloCK (or motif). Problem 1- We need to think about what kind of information is Contained within the Pssm -Leads to concepts of Information Content Entropy GDF班YEVv!HG GDAFHYY工!FG GDY班YE!FG. FHYEM: FG ·● CD FHFFAS FG Problem 2-The PSSM must accurately represent the expected BLoCK Sampling of the BLOCK/motif? Is it too narrow because of small datase(? Or motif. . and we have only limited amounts of data! Is it a good statistica Should we broaden it by adding extra amino acids that we choose using Some type of randomization scheme(called adding pseudocounts). If so, How many should we add?
We can now use the PSSM to search a database for other proteins that have the BLOCK (or motif). Problem 1 – We need to think about what kind of information is Contained within the PSSM. →Leads to concepts of Information Content & Entropy ….G D S F H Y FV S HG….. ….G D A F HYY I S FG….. ….G D S Y H Y F L S FG….. …. S D S F H Y FM S FG….. ….G D S F HFFA S FG….. Problem 2 –The PSSM must accurately represent the expected BLOCK Or motif….and we have only limited amounts of data! Is it a good statistical Sampling of the BLOCK/motif? Is it too narrow because of small dataset? Should we broaden it by adding extra amino acids that we choose using Some type of randomization scheme (called adding pseudocounts). If so, How many should we add?
Finding patterns(i.e. motifs and domains in Multiple Sequence Analysis Block Analysis, Position Specific Scoring Matrices(PSSM) BUILD an msa from groups of related proteins BLOCKS represent a conserved region in that msa that is LAcKING IN GAPS-i.e, no insertions/deletions The bLoCKs are typically anwhere from 3-60 amino acids long based on exact amino acid matches -i.e. alignment will tolerate mismatches, but doesn't use any kind of PAM or BLOSUM matrix. in fact they generate the blosuM matrix These blocks may be whole domains, short sequence motifs, key parts of enzyme active sites etc, etc
Finding patterns (i.e. motifs and domains) in Multiple Sequence Analysis Block Analysis, Position Specific Scoring Matrices (PSSM) BUILD an msa from groups of related proteins BLOCKS represent a conserved region in that msa that is LACKING IN GAPS – i.e. no insertions/deletions The BLOCKS are typically anwhere from 3-60 amino acids long, based on exact amino acid matches – i.e. alignment will tolerate mismatches, but doesn’t use any kind of PAM or BLOSUM matrix…in fact they generate the BLOSUM matrix! These blocks may be whole domains, short sequence motifs, key parts of enzyme active sites etc, etc
Position Specific Scoring Matrices PSSM 12345…11 GDSEHQFVSHG SDAFHOY工SEG GDSYWNELSFG SDSFHOFMSEG ·● GDSYWNYASFG This BloCK might represent some small part of a modular protein domain, or might represent a motif for something like a phosphorylation site on the s in position 9 Now build a matrix with 20 amino acids as the columns and 11 rows For the positions in the BLOCK
Position Specific Scoring Matrices PSSM 12345………….11 ….G D S F H Q FV S HG….. …. S D A F HQY I S FG….. ….G D S Y WN F L S FG….. …. S D S F H Q FM S FG….. ….G D S Y WN YA S FG….. This BLOCK might represent some small part of a modular protein domain, or might represent a motif for something …..like a phosphorylation site on the S in position 9 Now build a matrix with 20 amino acids as the columns, and 11 rows For the positions in the BLOCK
Each matrix entry is the Log(frequency of the amino acid occurance) at that position in the blocK 12345…11 GDSEHQFVSHG SDAFHQY工sEG ·● GDSYWNELSFG l。··●● SDSEHQFMSEG GDSYWNYASFG A E F GHI KLS T Log( 3 .og(2 cO+ 12345 Log(5)
Each matrix entry is the Log(frequency of the amino acid occurance) at that position in the BLOCK. Position 12345………….11 ….G D S F H Q FV S HG….. …. S D A F HQY I S FG….. ….G D S Y WN F L S FG….. …. S D S F H Q FM S FG….. ….G D S Y WN YA S FG….. A C D E F G H I K…. S T… 1 Log(3) Log(2) 2 Log(5) 3 4 5
We can now use the PSSM to look for the bLocK (motif in single proteins or- use the PssM to search a database for other proteins that have the blocK (or motif) Problem 1 -The PsSM must accurately represent the expected blocK Or motif. . and we have only limited amounts of data! Is it a good statistical Sampling of the BLOCK/motif? is it too narrow because of small dataset? Should we broaden it by adding extra amino acids that we choose using Some type of randomization scheme(called adding pseudocounts). If so, How many should we add? GDSFHOEVSHG SDAFHOYI BFG GDSYWNELSFG SDSFHOEMSFG GDSYWNYASFG .●● Problem 2-We need to think about what kind of information is Contained within the pssm Leads to concepts of Information Content Entropy
We can now use the PSSM to look for the BLOCK (motif) in single proteins -oruse the PSSM to search a database for other proteins that have the BLOCK (or motif). Problem 1 –The PSSM must accurately represent the expected BLOCK Or motif….and we have only limited amounts of data! Is it a good statistical Sampling of the BLOCK/motif? Is it too narrow because of small dataset? Should we broaden it by adding extra amino acids that we choose using Some type of randomization scheme (called adding pseudocounts). If so, How many should we add? ….G D S F H Q FV S HG….. …. S D A F HQY I S FG….. ….G D S Y WN F L S FG….. …. S D S F H Q FM S FG….. ….G D S Y WN YA S FG….. Problem 2 –We need to think about what kind of information is Contained within the PSSM. →Leads to concepts of Information Content & Entropy