Downloaded from genome. cshlporg on June 23, 2011-Published by Cold Spring Harbor Laboratory Press Sty RESEAREH The genetic code is nearly optimal for allowing additional information within protein-coding sequences Shaley Itzkovitz and Uri Alon Genome Res 2007 17: 405-412 originally published online February 9, 2007 Access the most recent version at doi: 10. 1101gr. 5987307 Supplementalhttp:/genome.cshlp.org/content/suppl/2007/02/09/gr.5987307.Dc1.Html Material References This article cites 36 articles. 14 of which can be accessed free at http:/genome.cshlp.org/content/1714/405.fullhtml#ref-list-1 Article cited in http:/igenome.cshlp.org/content/17141405.full.html#related-urls Open Access Freely available online through the Genome Research Open Access option Email alerting Receive free email alerts when new articles cite this article -sign up in the box at the service top right corner of the article or click here To subscribe to Genome Research go to http:/iGenome.cshlp.org/subscriptions Copyright C 2007, Cold Spring Harbor Laboratory Press
Access the most recent version at doi:10.1101/gr.5987307 Genome Res. 2007 17: 405-412 originally published online February 9, 2007 Shalev Itzkovitz and Uri Alon information within protein-coding sequences The genetic code is nearly optimal for allowing additional Material Supplemental http://genome.cshlp.org/content/suppl/2007/02/09/gr.5987307.DC1.html References http://genome.cshlp.org/content/17/4/405.full.html#related-urls Article cited in: http://genome.cshlp.org/content/17/4/405.full.html#ref-list-1 This article cites 36 articles, 14 of which can be accessed free at: Open Access Freely available online through the Genome Research Open Access option. service Email alerting top right corner of the article or click here Receive free email alerts when new articles cite this article - sign up in the box at the http://genome.cshlp.org/subscriptions To subscribe to Genome Research go to: Copyright © 2007, Cold Spring Harbor Laboratory Press Downloaded from genome.cshlp.org on June 23, 2011 - Published by Cold Spring Harbor Laboratory Press
Downloaded from genome. cshlporg on June 23, 2011-Published by Cold Spring Harbor Laboratory Press Letter The genetic code is nearly optimal for allowing additional information within protein-coding sequences Shaley itzkowitz, 2 and Uri Alon 1,2,3 Department of Molecular Cell Biology, Weizmann Institute of Science, Rehovot 76100, Israel; Department of Physics of Complex Systems, Weizmann Institute of Science, Rehovot 76100, Israel DNA sequences that code for proteins need to convey in addition to the protein-coding information, several different signals at the same time. These "parallel codes"include binding sequences for regulatory and structural proteins, signals for splicing, and RNA secondary structure. Here we show that the universal genetic code can efficiently carry arbitrary parallel codes much better than the vast majority of other possible genetic codes. This property is related to the identity of the stop codons. We find that the ability to support parallel codes is strongly tied to another useful property of the genetic code-minimization of the effects of frame-shift translation errors. Whereas many of the known regulatory codes reside in nontranslated regions of the genome, the present findings suggest that protein-coding regions can readily carry abundant additional information. suPplementalmaterialisavailableonlineatwww.genome.org.] The genetic code is the mapping of 64 three-letter codons to 20 2006). Other codes include splicing signals( Cartegni et al. 2002) amino-acids and a stop signal (Woese 1965; Crick 1968; Knight et that include specific 6-8 bp sequences within coding regions and al. 2001). The genetic code has been shown to be nonrandom in mRNA secondary structure signals(Zuker and Stiegler 1981; at least two ways: first, the assignment of amino acids to codons Shpaer 1985; Konecny et al. 2000; Katz and Burge 2003). The appears to be optimal for minimizing the effect of translational latter often correspond to sequences of several dozen base pairs misread errors. This optimality is achieved by mapping close or longer. Since we do not know all of these additional codes, and codons(codons that differ by one letter) to either the same different organisms can use a vast array of different codes, we amino acids or to chemically related ones(Woese 1965). This tested the ability of the genetic code to support arbitrary se. feature has been attributed to an adaptive selection of a code, so quences of any length in parallel to the protein-coding sequence that errors that misread a codon by one letter would result in We find that the universal genetic code can allow arbitrary minimal effects on the translated protein(Freeland and Hurst sequences of nucleotides within coding regions much better than 1998; Freeland et al. 2000; Gilis et al. 2001; Wagner 2005b). Sec. the vast majority of other possible genetic codes. We further find ond, amino acids with simple chemical structure tend to have that the ability to support parallel codes is strongly correlated more codons assigned to them(Hasegawa and Miyata 1980: Duf- with an additional property-minimization of the effects of ton 1997; Di Giulio 2005). There exist a large number of alternative genetic codes that traits may have helped to shape the universal genetic code. re equivalent to the real code in these two prominent features (Fig. 1). Here we ask whether the real code stands out among these alternative codes as being optimal for other properties Results We consider the ability of the genetic code to support, in addition to the protein-coding sequence, additional information ability to include additional sequences hat can carry biologically meaningful signals. These signals can nclude binding sequences of regulatory proteins that bind We first considered the ability of the genetic code to support, in addition to the protein-coding sequence, additional sequences al. 2001; Kellis et al. 2003). Such binding sites are typically se. that can carry biological signals. For this purpose, we studied the quences of length 6-20 bp. In addition to regulatory proteins, properties of all alternative genetic codes that share the known there are binding sites of structural proteins such as DNA- and optimality features of the real code(Fig. 1).Each alternative code mRNA-binding proteins(Draper 1999). Histones, for example, has the same number of codons per each amino acid and the bind with a code that has a periodicity of about 10 bp over a site same impact of misread errors as in the real code of about 150 bp(Satchwell et al. 1986; Trifonov 1989; Segal et al. trary sequences, denoted n-mers, within protein-coding regions. As an example, consider the 5-mer"UGACA. This sequence may alon@weizmann. ac il: fax 972-8-934125. n date are be a protein-binding site, which should appear within a protei he. org/cgi/doi/10.1101/gr. 5987307. Freely available online coding region. This 5-mer sequence can appear within a coding Genome Research Open Access option sequence in one of the three reading frames: UGAICAN, 7:405-412e2007byColdSpringHarborLaboratoryPress;IsSn1088-9051/07:www.genome.org Genome Research 405
The genetic code is nearly optimal for allowing additional information within protein-coding sequences Shalev Itzkovitz1,2 and Uri Alon1,2,3 1 Department of Molecular Cell Biology, Weizmann Institute of Science, Rehovot 76100, Israel; 2 Department of Physics of Complex Systems, Weizmann Institute of Science, Rehovot 76100, Israel DNA sequences that code for proteins need to convey, in addition to the protein-coding information, several different signals at the same time. These “parallel codes” include binding sequences for regulatory and structural proteins, signals for splicing, and RNA secondary structure. Here, we show that the universal genetic code can efficiently carry arbitrary parallel codes much better than the vast majority of other possible genetic codes. This property is related to the identity of the stop codons. We find that the ability to support parallel codes is strongly tied to another useful property of the genetic code—minimization of the effects of frame-shift translation errors. Whereas many of the known regulatory codes reside in nontranslated regions of the genome, the present findings suggest that protein-coding regions can readily carry abundant additional information. [Supplemental material is available online at www.genome.org.] The genetic code is the mapping of 64 three-letter codons to 20 amino-acids and a stop signal (Woese 1965; Crick 1968; Knight et al. 2001). The genetic code has been shown to be nonrandom in at least two ways: first, the assignment of amino acids to codons appears to be optimal for minimizing the effect of translational misread errors. This optimality is achieved by mapping close codons (codons that differ by one letter) to either the same amino acids or to chemically related ones (Woese 1965). This feature has been attributed to an adaptive selection of a code, so that errors that misread a codon by one letter would result in minimal effects on the translated protein (Freeland and Hurst 1998; Freeland et al. 2000; Gilis et al. 2001; Wagner 2005b). Second, amino acids with simple chemical structure tend to have more codons assigned to them (Hasegawa and Miyata 1980; Dufton 1997; Di Giulio 2005). There exist a large number of alternative genetic codes that are equivalent to the real code in these two prominent features (Fig. 1). Here we ask whether the real code stands out among these alternative codes as being optimal for other properties. We consider the ability of the genetic code to support, in addition to the protein-coding sequence, additional information that can carry biologically meaningful signals. These signals can include binding sequences of regulatory proteins that bind within coding regions (Robison et al. 1998; Stormo 2000; Lieb et al. 2001; Kellis et al. 2003). Such binding sites are typically sequences of length 6–20 bp. In addition to regulatory proteins, there are binding sites of structural proteins such as DNA- and mRNA-binding proteins (Draper 1999). Histones, for example, bind with a code that has a periodicity of about 10 bp over a site of about 150 bp (Satchwell et al. 1986; Trifonov 1989; Segal et al. 2006). Other codes include splicing signals (Cartegni et al. 2002) that include specific 6–8 bp sequences within coding regions and mRNA secondary structure signals (Zuker and Stiegler 1981; Shpaer 1985; Konecny et al. 2000; Katz and Burge 2003). The latter often correspond to sequences of several dozen base pairs or longer. Since we do not know all of these additional codes, and different organisms can use a vast array of different codes, we tested the ability of the genetic code to support arbitrary sequences of any length in parallel to the protein-coding sequence. We find that the universal genetic code can allow arbitrary sequences of nucleotides within coding regions much better than the vast majority of other possible genetic codes. We further find that the ability to support parallel codes is strongly correlated with an additional property—minimization of the effects of frame-shift translation errors. Selection for either or both of these traits may have helped to shape the universal genetic code. Results Ability to include additional sequences We first considered the ability of the genetic code to support, in addition to the protein-coding sequence, additional sequences that can carry biological signals. For this purpose, we studied the properties of all alternative genetic codes that share the known optimality features of the real code (Fig. 1). Each alternative code has the same number of codons per each amino acid and the same impact of misread errors as in the real code. We tested the ability of the genetic codes to include arbitrary sequences, denoted n-mers, within protein-coding regions. As an example, consider the 5-mer “UGACA.” This sequence may be a protein-binding site, which should appear within a proteincoding region. This 5-mer sequence can appear within a coding sequence in one of the three reading frames: UGA|CAN, 3Corresponding author. E-mail uri.alon@weizmann.ac.il; fax 972-8-934125. Article published online before print. Article and publication date are at http:// www.genome.org/cgi/doi/10.1101/gr.5987307. Freely available online through the Genome Research Open Access option. Letter 17:405–412 ©2007 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/07; www.genome.org Genome Research 405 www.genome.org Downloaded from genome.cshlp.org on June 23, 2011 - Published by Cold Spring Harbor Laboratory Press
Downloaded from genome. cshlp org on June 23, 2011-Published by Cold Spring Harbor Laboratory Press tzkovitz and Alon Val Ala D [ Figure 1. Alternative genetic codes. (A)The real code. (B)An alternative code obtained by an AeG permutation in the first position. (O An alternative ode obtained by an Ae)C permutation in the second position, and (D) AeG permutation in the third position. odons are marked in red start Met) codons in green. Codons that are changed relative to the real code are in gray. There are 4! x 4l x 2=1152 alternative codes obtained by mple,(B) the nine neighboring codons of the Valine codon marked with a red arrow in the real code(shown in A)are the same as(a) the nine eighboring codons of the Valine codon marked with a red arrow in the alternative code shown in B Solid lines connect codons differing in the first tter, dotted lines connect codons differing in the second letter, and dashed lines connect codons differing in the third letter. Different amino acids are displayed in different colors. This equivalence applies to all codons probability that this 5-mer appears in a coding region, one needs 5-mer cannot appear in a coding region in this frame, because 406 Genome research
NNU|GAC|ANN, or NUG|ACA, where N denotes any nucleotide and the vertical lines separate consecutive codons. To assess the probability that this 5-mer appears in a coding region, one needs to sum over the three possible reading frames (Fig. 2A). In one of the frames, this sequence generates a stop codon, UGA. The 5-mer cannot appear in a coding region in this frame, because Figure 1. Alternative genetic codes. (A) The real code. (B) An alternative code obtained by an A↔G permutation in the first position. (C) An alternative code obtained by an A↔C permutation in the second position, and (D) A↔G permutation in the third position. Stop codons are marked in red, start (Met) codons in green. Codons that are changed relative to the real code are in gray. There are 4! 4! 2 = 1152 alternative codes obtained by independent permutations of the nucleotides in each of the three codon positions. (E,F) Structural equivalence of real and alternative genetic codes. For example, (E) the nine neighboring codons of the Valine codon marked with a red arrow in the real code (shown in A) are the same as (F) the nine neighboring codons of the Valine codon marked with a red arrow in the alternative code shown in B. Solid lines connect codons differing in the first letter, dotted lines connect codons differing in the second letter, and dashed lines connect codons differing in the third letter. Different amino acids are displayed in different colors. This equivalence applies to all codons. Itzkovitz and Alon 406 Genome Research www.genome.org Downloaded from genome.cshlp.org on June 23, 2011 - Published by Cold Spring Harbor Laboratory Press
Downloaded from genome. cshlporg on June 23, 2011-Published by Cold Spring Harbor Laboratory Press Genetic code optimality for additional information UGACA frame 0 frame NUGACANN NNUGACAN NUGA CANN AAUGACAAA 1.8103I NNNUGACAA AUGACANNN P=3.0*10 AAUGACAAG NNNUGACAG p=0 ACANNN P-23.10 NNNUGACAC P0!! CUGACANNN P-22*10 CCUGA CACC p-0.4"10 Po(U GACA) 0! PH(UGACA) 98*10 P-CU GA C)=.19*10j P( UGACA)=(Po+P1+P)3=(0+19*10+9.8*105)3=9.610 stop codon:AAA stop codons: CCA, CCG CGG 5-mer: AAAAA 0X NNNUGACAN 0 X NNNAAAJAAN 0 X NNNCCG GUN - V NNUgaCanN -1 X NNAAAAJANN -1 X NNCCGG UNN +IV NNNnUgaCA +1 X NNNNAAJAAA +1 V NNNNCCGGU n-mer size 6-mer probabilities E (A) Calculation of the probability that an n-mer sequence appears within a protein-coding region in the real genetic code. The 5-mer S=UGACA can appear in one of the three reading frames For each reading frame, th ilities of all three codon combinations that are summed up. Codon combinations with an in-frame stop(such as UGA)do not contribute to the n-mer probability since they cannot appea dons, stop codons are in red, Por P-1 P, denote the probabilities of encountering S in the 0/-1/+1 frame.(B, C, D) Three examples of""n-mers in the real code and in al (B)The 5-mer UGACA, which includes the codon UGA, can appear in a protein-coding sequence with the real genetic code in only two of the three possible reading frames(+1 and-1 frames) (O In the alternative code shown in Figure 3D, whose stop codon AAA overlaps with itself, the 5-mer AAAAA cannot appear in a protein-coding sequence in any of the three reading frames. ( D)In an alternative code with the overlapping stop codons CCG and CGG, the S-mer CCGGU can only appear in one reading frame. The 5-mers are in bold text, stop codons are in red, n denotes any DNA letter, green v denotes a frame in whic appear, red x denotes a frame in which the n-l bilities of all 6-mers in the real code(bold black line)and 6-mers with this probability In the real code there are significantly less"difficult "6-mers(with low probabilities), relative to the altemative codes. (2) The fraction of n-mers that have a higher probability in the real code than in altemative codes increases with n-mer The y-axis shows the fractio of n-mers for which the average probability of appearing in the real genetic code is significantly higher than in the altenative codes Genome Research 407
Figure 2. (A) Calculation of the probability that an n-mer sequence appears within a protein-coding region in the real genetic code. The 5-mer sequence S = UGACA can appear in one of the three reading frames. For each reading frame, the probabilities of all three codon combinations that contain S are summed up. Codon combinations with an in-frame stop (such as UGA) do not contribute to the n-mer probability since they cannot appear in a coding region. Vertical lines separate consecutive codons, stop codons are in red, P0, P1, P+1 denote the probabilities of encountering S in the 0/1/+1 frame. (B,C,D) Three examples of “difficult” n-mers in the real code and in alternative codes. (B) The 5-mer UGACA, which includes the stop codon UGA, can appear in a protein-coding sequence with the real genetic code in only two of the three possible reading frames (+1 and 1 frames). (C) In the alternative code shown in Figure 3D, whose stop codon AAA overlaps with itself, the 5-mer AAAAA cannot appear in a protein-coding sequence in any of the three reading frames. (D) In an alternative code with the overlapping stop codons CCG and CGG, the 5-mer CCGGU can only appear in one reading frame. The 5-mers are in bold text, stop codons are in red, N denotes any DNA letter, green v denotes a frame in which the n-mer can appear, red x denotes a frame in which the n-mer cannot appear. (E) Distribution of the probabilities of all 6-mers in the real code (bold black line) and in the alternative codes (light blue lines). The x-axis is the probability of obtaining 6-mers within protein-coding sequences; the y-axis is the number of 6-mers with this probability. In the real code there are significantly less “difficult” 6-mers (with low probabilities), relative to the alternative codes. (F) The fraction of n-mers that have a higher probability in the real code than in alternative codes increases with n-mer size. The y-axis shows the fraction of n-mers for which the average probability of appearing in the real genetic code is significantly higher than in the alternative codes. Genetic code optimality for additional information Genome Research 407 www.genome.org Downloaded from genome.cshlp.org on June 23, 2011 - Published by Cold Spring Harbor Laboratory Press
Downloaded from genome. cshlp org on June 23, 2011-Published by Cold Spring Harbor Laboratory Press Itzkovitz and alon coding regions have no in-frame stop codons. The sequence can, NNAJAAAJANN, or NAAJAAA. Alternative genetic codes that as- however, appear in one of the two other frames. Overall, the sign one of their stop codons as AAA (Fig. 3D),can probability that this 5-mer appears in coding regions will tend to S in a protein-coding sequence. The problem is be lower than that of 5-mers that do not include stop codons codon AAA overlaps with itself when frame hence Each genetic code has n-mer sequences, such as the above- strings such as S include a stop codon in each of the three frames, mentioned sequence UGACA in the real genetic code, which are precluding their presence in a coding region ifficult to include in coding regions: these "difficult"sequence Another example is the 5-mer S=CCGGU In an alternative contain stop codons, and thus cannot appear in at least one of code with stop codons CCA, CCG, and CGG, this n-mer can only the three frames, since protein-coding regions do not contain appear in one of the three reading frames(Fig 2D). This is be. stop codons. We find that the real genetic code is able to include cause two of the stop codons, CCG and CGG, overlap each other. even the most difficult n-mers because it has a special property: In contrast, the real genetic code has the stop codons UAA, UAG its stop codons, when frame shifted, tend to form abundant and UGA that do not overlap with themselves or with each other, codons. Hence, n-mers that cannot be included in one frame. no matter how they are frame shifted. Furthermore, frame shift can be included with high probability in other frame shifts. shifted versions of the real stop codons overlap with the codons To understand the relation between the stop codons and the of the most abundant amino acids. For example, the UGA stop ability of the genetic code to include arbitrary n-mers, consider codon in a -1 frame-shift message results in the di-codon he 5-mer S=AAAAA(Fig. 2C). This 5-mer can appear within a NNUIGAN, where N is any nucleotide(Fig 2B). The GAN codons coding sequence in one of the three reading frames: AAAJAAN, encode Asp and Glu, which are rl 加m::m frame-shsense polypeptide translated after a frame-shift event, and is the inverse of the frame-shifted stop probability, averaged over the tl an bn bfo marked by for a +l frame-shift and-for a-1 frame-shift Abundant codons are shown in heavier font. For example, the stop codon UAA, when frame in codons such as AAN(green box), or NUA(blue boxes), which are re ndant. (O The"best code, which achieves the me-shifted stop probability both in a+l frame-shift and in a-1 frame shift. Stop CAA, CAG, and CGA. In the "best code, " a stop has an overlap of two positions with codons of gly stead of codons of serine and in the real code. ( D)The"worst code"with the rame-shifted stop probability. Stop codons are AUA, AUG, and AAA. Note that the stop codons overlap either with themselves(AAA)or with codons for nonabundant amino-acids(those with light font), in contrast to B and C 408 Genome research
coding regions have no in-frame stop codons. The sequence can, however, appear in one of the two other frames. Overall, the probability that this 5-mer appears in coding regions will tend to be lower than that of 5-mers that do not include stop codons. Each genetic code has n-mer sequences, such as the abovementioned sequence UGACA in the real genetic code, which are difficult to include in coding regions: these “difficult” sequences contain stop codons, and thus cannot appear in at least one of the three frames, since protein-coding regions do not contain stop codons. We find that the real genetic code is able to include even the most difficult n-mers because it has a special property: its stop codons, when frame shifted, tend to form abundant codons. Hence, n-mers that cannot be included in one frameshift can be included with high probability in other frame shifts. To understand the relation between the stop codons and the ability of the genetic code to include arbitrary n-mers, consider the 5-mer S = AAAAA (Fig. 2C). This 5-mer can appear within a coding sequence in one of the three reading frames: AAA|AAN, NNA|AAA|ANN, or NAA|AAA. Alternative genetic codes that assign one of their stop codons as AAA (Fig. 3D), can never include S in a protein-coding sequence. The problem is that the stop codon AAA overlaps with itself when frame shifted; hence, strings such as S include a stop codon in each of the three frames, precluding their presence in a coding region. Another example is the 5-mer S = CCGGU. In an alternative code with stop codons CCA, CCG, and CGG, this n-mer can only appear in one of the three reading frames (Fig. 2D). This is because two of the stop codons, CCG and CGG, overlap each other. In contrast, the real genetic code has the stop codons UAA, UAG, and UGA that do not overlap with themselves or with each other, no matter how they are frame shifted. Furthermore, frameshifted versions of the real stop codons overlap with the codons of the most abundant amino acids. For example, the UGA stop codon in a 1 frame-shift message results in the di-codon NNU|GAN, where N is any nucleotide (Fig. 2B). The GAN codons encode Asp and Glu, which are among the three amino acids Figure 3. Optimality of the genetic code for minimizing the impact of frame-shift translation errors. (A) Distribution of average number of translated codons until a stop codon is encountered after a frame-shift event for the alternative genetic codes. This number corresponds to the mean length of the nonsense polypeptide translated after a frame-shift event, and is the inverse of the frame-shifted stop probability, averaged over the +1 and 1 frame-shifts. (B) In the real code, frame-shifted stop codons overlap with abundant codons. Codons with two-letter overlap with a stop codon are marked by + for a +1 frame-shift and – for a 1 frame-shift. Abundant codons are shown in heavier font. For example, the stop codon UAA, when frame shifted, results in codons such as AAN (green box), or NUA (blue boxes), which are relatively abundant. (C) The “best code,” which achieves the highest frame-shifted stop probability both in a +1 frame-shift and in a 1 frame shift. Stop codons are CAA, CAG, and CGA. In the “best code,” a stop codon has an overlap of two positions with codons of Glycine instead of codons of Serine and Arginine in the real code. (D) The “worst code” with the lowest frame-shifted stop probability. Stop codons are AUA, AUG, and AAA. Note that the stop codons overlap either with themselves (AAA) or with codons for nonabundant amino-acids (those with light font), in contrast to B and C. Itzkovitz and Alon 408 Genome Research www.genome.org Downloaded from genome.cshlp.org on June 23, 2011 - Published by Cold Spring Harbor Laboratory Press