Presence of a Start Codon Some expression vectors provide the start codon for translation tiation, while others rely on the start codon of the gene you're rying to express. Note that in E coli, 5 to 12 base pairs or less sep arate the ribosome binding site and the start codon. So you would incorporate this requirement into your cloning strategy when the start codon is provided by the gene you plan to express GC Content Coding sequences with high GC (70%) content may reduce level of expression of a protein in E coli. Check the sequence ing a DNa analysis program Codon usage may also affect the level of protein expression. If the gene of interest contains codons not commonly used in E coli, low expression may result due to the depletion of tRNAs for the rarer codons. When one or more rare codons is encountered ranslational pausing may result, slowing the rate of protein synthesis and exposing the mRNA to degradation. This potential problem is of particular concern when the sequence encodes a protein >60kDa, when rare codons are found at high frequency or when multiple rare codons are found over a short distance of the coding sequence. For example, rare codons for arginine found in tandem can create a recognition sequence for ribosome binding(e.g,_AGGAGG) that closely approximates a Shine Dalgarno sequence UAAGGAGG. This may bind ribosomes non productively and block translation from the bona fide ribosome binding site(RBS)at the initiator codon further upstream Nonetheless, the appearance of a rare codon does not necessarily lead to poor expression. It is best to try expression of the native gene, and then make changes if these seem warranted later Strate gies include mutating the gene of interest to use optimal codons for the host organism, and co-transforming the host with rare tRNA genes. In one example, introduction into the E coli host of a rare arginine(AGG) tRNA resulted in a several-fold increase in the expression of a protein that uses the AGG codon(Hua et al., 1994). In another case, substitution of the rare arginine codon AGG with the E. coli-preferred CGU improved expression Robinson et aL., 1984). Other work has shown that rare codons account for decreased expression of the gene of interest in E. coli (Zhang, Zubay, and Goldman, 1991; Sorensen, Kurland, and Pederson, 1989). Rare codons may have an even more dramatic 466 Bell
Presence of a Start Codon Some expression vectors provide the start codon for translation initiation, while others rely on the start codon of the gene you’re trying to express. Note that in E. coli, 5 to 12 base pairs or less separate the ribosome binding site and the start codon. So you would incorporate this requirement into your cloning strategy when the start codon is provided by the gene you plan to express. GC Content Coding sequences with high GC (>70%) content may reduce the level of expression of a protein in E. coli. Check the sequence using a DNA analysis program. Codon Usage Codon usage may also affect the level of protein expression. If the gene of interest contains codons not commonly used in E. coli, low expression may result due to the depletion of tRNAs for the rarer codons. When one or more rare codons is encountered, translational pausing may result, slowing the rate of protein synthesis and exposing the mRNA to degradation. This potential problem is of particular concern when the sequence encodes a protein >60kDa, when rare codons are found at high frequency, or when multiple rare codons are found over a short distance of the coding sequence. For example, rare codons for arginine found in tandem can create a recognition sequence for ribosome binding (e.g., _AGGAGG) that closely approximates a ShineDalgarno sequence UAAGGAGG.This may bind ribosomes nonproductively and block translation from the bona fide ribosome binding site (RBS) at the initiator codon further upstream. Nonetheless, the appearance of a rare codon does not necessarily lead to poor expression. It is best to try expression of the native gene, and then make changes if these seem warranted later. Strategies include mutating the gene of interest to use optimal codons for the host organism, and co-transforming the host with rare tRNA genes. In one example, introduction into the E. coli host of a rare arginine (AGG) tRNA resulted in a several-fold increase in the expression of a protein that uses the AGG codon (Hua et al., 1994). In another case, substitution of the rare arginine codon AGG with the E. coli-preferred CGU improved expression (Robinson et al., 1984). Other work has shown that rare codons account for decreased expression of the gene of interest in E. coli (Zhang, Zubay, and Goldman, 1991; Sorensen, Kurland, and Pederson, 1989). Rare codons may have an even more dramatic 466 Bell
effect on translation when they occur close to the initiator codon Chen and Inouye, 1990). While codon usage is not the only or most important factor, be aware that it may influence translation Secondary Structure Secondary structures that occur near the start codon may block translation initiation(Gold et al., 1981; Buell et al., 1985) or serve as translation pause sites resulting in premature termi nation and truncated protein. These can be found using DNA or RNA analysis software Structures with clear stem structures greater than eight bases long may be disrupted by site-specific mutation or by making all or a portion of the coding sequence synthetically Depending on the size of the gene, and the importance of obtaining high-expression levels, it may be worth synthesizing the gene. This has been generally done by synthesizing overlapping oligonucleotides that when annealed can be extended using PCr and ligated to form the full-length coding sequence. There are several examples where this approach has been used to optimize codon usage for E coli (Koshiba et al., 1999; Beck von Bodman et aL., 1986). In addition, if one takes on the work and expense of synthesizing a gene, secondary structures in the predicted RnA that might stall translation can be removed and sites for restric tion endonucleases can be introduced Size of a Gene or protein As a rule, very large(>100k Da)and very small(<5kDa) pro teins are more difficult to express in E coli. Small polypeptides with little secondary structure tend to be rapidly degraded in E. coli Degradation can be minimized by expressing such short oligopeptides as concatemers with proteolytic or chemical cleav age sites in between the monomeric units(Hostomsky, Smrt, and Paces, 1985). Short peptides are also successfully expressed as fusion proteins. Fusion with GST, MalB or other larger, well folded partners will tend to stabilize a short peptide, making expression possible and purification relatively simple. One publi- cation has shown MBP to be superior to other large fusion pro- teins at stabilizing short polypeptides(Kapust and Waugh, 1999) At the other extreme, proteins that are above 60kDa are best made using smaller affinity tags, such as FLAG, his, or on their own, without any fusion. While there is no clear upper limit, the larger the protein, the lower the yield is likely to be E coli Expression System 467
effect on translation when they occur close to the initiator codon (Chen and Inouye, 1990). While codon usage is not the only or most important factor, be aware that it may influence translation efficiency. Secondary Structure Secondary structures that occur near the start codon may block translation initiation (Gold et al., 1981; Buell et al., 1985), or serve as translation pause sites resulting in premature termination and truncated protein. These can be found using DNA or RNA analysis software. Structures with clear stem structures greater than eight bases long may be disrupted by site-specific mutation or by making all or a portion of the coding sequence synthetically. Depending on the size of the gene, and the importance of obtaining high-expression levels, it may be worth synthesizing the gene. This has been generally done by synthesizing overlapping oligonucleotides that when annealed can be extended using PCR and ligated to form the full-length coding sequence. There are several examples where this approach has been used to optimize codon usage for E. coli (Koshiba et al., 1999; Beck von Bodman et al., 1986). In addition, if one takes on the work and expense of synthesizing a gene, secondary structures in the predicted RNA that might stall translation can be removed, and sites for restriction endonucleases can be introduced. Size of a Gene or Protein As a rule, very large (>100kDa) and very small (<5kDa) proteins are more difficult to express in E. coli. Small polypeptides with little secondary structure tend to be rapidly degraded in E. coli. Degradation can be minimized by expressing such short oligopeptides as concatemers with proteolytic or chemical cleavage sites in between the monomeric units (Hostomsky, Smrt, and Paces, 1985). Short peptides are also successfully expressed as fusion proteins. Fusion with GST, MalB or other larger, wellfolded partners will tend to stabilize a short peptide, making expression possible and purification relatively simple. One publication has shown MBP to be superior to other large fusion proteins at stabilizing short polypeptides (Kapust and Waugh, 1999). At the other extreme, proteins that are above 60kDa are best made using smaller affinity tags, such as FLAG, his6, or on their own, without any fusion. While there is no clear upper limit, the larger the protein, the lower the yield is likely to be. E. coli Expression Systems 467
What Do You know about your protein? Cysteine There are many things that E. coli does not do well, or at all. If the protein of interest is naturally multimeric, or requires post ranslational modifications for activity, E coli as an expression host may be a poor choice. Disulfide bonds, formed between two cysteines in an expressed protein, are made inefficiently in the reducing environment of the E. coli cytoplasm(Bessette et al 1999: Derman et al., 1993). If the protein is produced, and can be purified from E. coli, in vitro oxidation of the cysteines may be tried(Dodd et aL., 1995). Alternatively, the gene of interest can be cloned in a vector that includes a signal sequence(e.g, OmpA, genelll, and phoA) that will direct the recombinant protein to the relatively oxidizing environment of the periplasm of E coli, where disulfide formation is more efficient Strains of e coli that are defi cient in thioredoxin reductase (trxB)permit proper disulfide on in the cytoplasm(Derman et al., 1993; Yasukawa et al 1995). Subsequent work has produced strains that lack both trxB and glutathione oxidoreductase and give better rates of disulfide formation than those seen in native E. coli periplasm(Bessette et al., 1999) Membrane bound If the protein to be expressed is naturally associated with mem brane and/or has at least one transmembrane domain addition of a secretion signal to the amino terminus may help to maxi- mize expression of functional protein. Signal sequences, about 20 residues long are derived from proteins that naturally are secreted into the periplasmic space, such as pelB, OmpA, OmpT, Male, alkaline phosphatase(phoA), or genelll of filamentous phage (Izard and Kendall, 1994). Protein with an amino terminal signal will be directed to the inner membrane of E coli, and the carboxy terminal portion of the protein will be translocated into th periplasmic space. Depending on the hydrophobicity of the protein of interest, it may not translocate entirely into the periplasm but remain associated with the inner membrane Secretion may help stabilize proteins from proteolytic attack(Pines and Inouye, 1999) or at least can reduce aggregation of hydrophobic proteins in the ytoplasm, and minimize inclusion body formation. Because of the redu nvironment of the periplasmic space, proteins that contain one or more disulfide bonds are best secreted The presence of an N-terminal signal sequence appears to 468 Bell
What Do You Know about Your Protein? Cysteines There are many things that E. coli does not do well, or at all. If the protein of interest is naturally multimeric, or requires posttranslational modifications for activity, E. coli as an expression host may be a poor choice. Disulfide bonds, formed between two cysteines in an expressed protein, are made inefficiently in the reducing environment of the E. coli cytoplasm (Bessette et al., 1999; Derman et al., 1993). If the protein is produced, and can be purified from E. coli, in vitro oxidation of the cysteines may be tried (Dodd et al., 1995). Alternatively, the gene of interest can be cloned in a vector that includes a signal sequence (e.g., OmpA, geneIII, and phoA) that will direct the recombinant protein to the relatively oxidizing environment of the periplasm of E. coli, where disulfide formation is more efficient. Strains of E. coli that are defi- cient in thioredoxin reductase (trxB) permit proper disulfide formation in the cytoplasm (Derman et al., 1993; Yasukawa et al., 1995). Subsequent work has produced strains that lack both trxB and glutathione oxidoreductase and give better rates of disulfide formation than those seen in native E. coli periplasm (Bessette et al., 1999). Membrane Bound If the protein to be expressed is naturally associated with membrane and/or has at least one transmembrane domain, addition of a secretion signal to the amino terminus may help to maximize expression of functional protein. Signal sequences, about 20 residues long are derived from proteins that naturally are secreted into the periplasmic space, such as pelB, OmpA, OmpT, MalE, alkaline phosphatase (phoA), or geneIII of filamentous phage (Izard and Kendall, 1994). Protein with an amino terminal signal will be directed to the inner membrane of E. coli, and the carboxy terminal portion of the protein will be translocated into the periplasmic space.Depending on the hydrophobicity of the protein of interest, it may not translocate entirely into the periplasm but remain associated with the inner membrane. Secretion may help stabilize proteins from proteolytic attack (Pines and Inouye, 1999), or at least can reduce aggregation of hydrophobic proteins in the cytoplasm, and minimize inclusion body formation. Because of the reducing environment of the periplasmic space, proteins that contain one or more disulfide bonds are best secreted. The presence of an N-terminal signal sequence appears to 468 Bell
be necessary but not sufficient to direct a target protein to the periplasm. Translocation across the outer membrane and into the growth medium is inefficient. In most cases target proteins found in the growth medium are the result of damage to the cell enve- pe and do not represent true secretion (Stader and Silhavy, 1990). Translocation across the inner cell membrane of E coli is incompletely understood (reviewed by Wickner, Driessen, and Hartl, 1991), and the efficiency of export will depend on the indi idual target protein. Currently the export cannot be predicted based on protein sequence, although some generalizations have been made about the sequence immediately following the sigr peptide(Boyd and vith. 1990: Yamane and mizushima 1988). Therefore it is possible to find target proteins in the cyto- plasm(with uncleaved signal sequence) or in the periplasm in partially processed form, in place of or in addition to the expected periplasmic processed species. In some cases the proportion of protein that is exported can be increased by lowering the tem- perature 15 to 30oC during induction Post-translational Modification E. coli does not glycosylate or phosphorylate proteins or cognize proteolytic processing signals from eukaryotes, so take this into account when designing the cloning strategy. If proteolytic processing is needed, it is best to express only the coding sequences for the fully processed protein. If the protein of interest requires glycosylation for activity, and full activity is important in the final der a eukaryotic host, such Pichia, insect cells, or mammalian cells. Is the protein potentially Toxic? Consider whether the protein of interest is likely to have a tox effect on the host cell. Where the function of the protein is known this can be guessed at with some accuracy. For example, non specific proteases, nucleases, or pore-forming membrane proteins might all be expected to have some toxic effect on E coli. Expres sion of toxic proteins may be very low, and there will be strong selective pressure on cells to eliminate the gene of interest by point mutation to change the translation frame, insertion of a stop codon, or change in an amino acid residue critical to the proteins function. Larger deletion of parts of the plasmid may also be seen. If there is a suggestion that the gene product will be toxic, use an expression vector with a tightly regulated promoter (e.g, T7, PET E coli Expression Systems 469
be necessary but not sufficient to direct a target protein to the periplasm. Translocation across the outer membrane and into the growth medium is inefficient. In most cases target proteins found in the growth medium are the result of damage to the cell envelope and do not represent true secretion (Stader and Silhavy, 1990). Translocation across the inner cell membrane of E. coli is incompletely understood (reviewed by Wickner, Driessen, and Hartl, 1991), and the efficiency of export will depend on the individual target protein. Currently the export cannot be predicted based on protein sequence, although some generalizations have been made about the sequence immediately following the signal peptide (Boyd and Beckwith, 1990; Yamane and Mizushima, 1988). Therefore it is possible to find target proteins in the cytoplasm (with uncleaved signal sequence) or in the periplasm in partially processed form, in place of or in addition to the expected periplasmic processed species. In some cases the proportion of protein that is exported can be increased by lowering the temperature 15 to 30°C during induction. Post-translational Modification E. coli does not glycosylate or phosphorylate proteins or recognize proteolytic processing signals from eukaryotes, so take this into account when designing the cloning strategy. If proteolytic processing is needed, it is best to express only the coding sequences for the fully processed protein. If the protein of interest requires glycosylation for activity, and full activity is important in the final use, consider a eukaryotic host, such as Pichia, insect cells, or mammalian cells. Is the Protein Potentially Toxic? Consider whether the protein of interest is likely to have a toxic effect on the host cell.Where the function of the protein is known, this can be guessed at with some accuracy. For example, nonspecific proteases, nucleases, or pore-forming membrane proteins might all be expected to have some toxic effect on E. coli. Expression of toxic proteins may be very low, and there will be strong selective pressure on cells to eliminate the gene of interest by point mutation to change the translation frame, insertion of a stop codon, or change in an amino acid residue critical to the protein’s function. Larger deletion of parts of the plasmid may also be seen. If there is a suggestion that the gene product will be toxic, use an expression vector with a tightly regulated promoter (e.g., T7, pET E. coli Expression Systems 469
vectors). Minimize propagation of the cells to avoid opportunities for mutation and recombination Each requirement placed on a recombinant protein will affect the choice of expression system. If a protein is to be used only to prepare antibody, it need not be soluble or active, and the pro- duction of inclusion bodies(aggregates of improperly folded protein)in E. coli may be all that is needed. Alternatively, if a proteins biological activity will be assayed, or if it is to be used in structural studies(NMR, crystallography, etc. ) a properly folded and soluble form will be required Will Structural Changes(Additional or Fewer Amino Acids) Affect Your Application? Depending on the way that a gene is inserted in an expression vector, additional sequences may be added to the clone, and these may lead to extra amino acid residues at the N-or C-termini of the final expressed protein. In many cases these will have no dele terious effect, but if structural studies or precise comparisons to a native protein are to be done, it is wise to eliminate amino acids added by cloning steps. PCR amplification is the most commonly used method to generate inserts for expression, and proper desigN of PCR primers can eliminate most or all additional residues in he protein. Is the Sequence of Your protein Recognized by Specific Proteases? If you plan to express your gene in a fusion vector that prov an internal protease cleavage site for removal of the affinity tag (discussed below ), check that your native protein is not recognized by the protease. Most proteases are highly specific, but thrombin has a variety of secondary cleavage sites( Chang, 1985). Advertisements for Commercial Expression vectors Are Very Promising. What Levels of Expression Should You Expect? There are several systems available for protein expressio mammalian, insect, yeast, and E. coli. While it is impossible to predict the yields of protein from these systems for any given protein, some rough guidelines can be given. For any vector it is possible that no expression will be seen! Reported yields in stably transfected mammalian cells are in the range of 1 to 100 ug/10 470 Bell
vectors). Minimize propagation of the cells to avoid opportunities for mutation and recombination. Must Your Protein Be Functional? Each requirement placed on a recombinant protein will affect the choice of expression system. If a protein is to be used only to prepare antibody, it need not be soluble or active, and the production of inclusion bodies (aggregates of improperly folded protein) in E. coli may be all that is needed. Alternatively, if a protein’s biological activity will be assayed, or if it is to be used in structural studies (NMR, crystallography, etc.), a properly folded and soluble form will be required. Will Structural Changes (Additional or Fewer Amino Acids) Affect Your Application? Depending on the way that a gene is inserted in an expression vector, additional sequences may be added to the clone, and these may lead to extra amino acid residues at the N- or C-termini of the final expressed protein. In many cases these will have no deleterious effect, but if structural studies or precise comparisons to a native protein are to be done, it is wise to eliminate amino acids added by cloning steps. PCR amplification is the most commonly used method to generate inserts for expression, and proper design of PCR primers can eliminate most or all additional residues in the protein. Is the Sequence of Your Protein Recognized by Specific Proteases? If you plan to express your gene in a fusion vector that provides an internal protease cleavage site for removal of the affinity tag (discussed below), check that your native protein is not recognized by the protease. Most proteases are highly specific, but thrombin has a variety of secondary cleavage sites (Chang, 1985). Advertisements for Commercial Expression Vectors Are Very Promising.What Levels of Expression Should You Expect? There are several systems available for protein expression in mammalian, insect, yeast, and E. coli. While it is impossible to predict the yields of protein from these systems for any given protein, some rough guidelines can be given. For any vector it is possible that no expression will be seen! Reported yields in stably transfected mammalian cells are in the range of 1 to 100mg/106 470 Bell