articles the fraction of reads matching the draft genome sequence should contigs but between sequenced clones(gaps of the second type)and provide an estimate of genome coverage. In practice, the compar- one failed to identify clones in the fingerprint map(gaps of the third ison is complicated by the need to allow for repeat sequences, the type) but did identify clones in another large-insert library mperfect sequence quality of both the raw sequence and the draft Although these numbers are small, they are consistent with the genome sequence, and the possibility of polymorphism. None- view that the much of the remaining genome sequence lies within theless, the analysis provides a reasonable view of the extent to already identified clones in the current map which the genome is represented in the draft genome sequence and Estimates of genome and chromosome sizes. Informed by this the public databases. analysis of genome coverage, we proceeded to estimate the sizes of We compared the raw sequence reads against both the sequences the genome and each of the chromosomes(Table 8). Beginning with ed in the construction of the draft genome sequence and all of the current assigned sequence for each chromosome, we corrected sequence reads analysed (each containing at least 100 bp of con- above). We attempted to account for the sizes of centromeres and identity ith-repetitive sequence), 4,924 had a match of =97% heterochromatin, neither of which are well represented in the draft a sequenced clone, indicating that 88+ 1.5%of the sequence. Finally, we corrected for around 100 Mb of artefactual genome was represented in sequenced clones. The estimate is duplication in the assembly. We arrived at a total human genome subject to various uncertainties. Most serious is the proportion of size estimate of around 3, 200 Mb, which compares favourably with peat sequence in the remainder of the genome. If the unsequenced previous estimates based on DNA content. ortion of the genome is unusually rich in repeated sequence, We also independently estimated the size of the euchromatic we would underestimate its size (although the excess would be portion of the genome by determining the fraction of the 5,615 random raw sequences that matched the finished portion of We examined those raw sequences that failed to match by the human genome (whose total length is known with greater comparing them to the other publicly available sequence resources. precision). Twenty-nine per cent of these raw sequences found a ifty(0.9%) had matches in public databases containing cDNA match among 835 Mb of nonredundant finished sequence. This sequences, STSs and similar data. An additional 276(or 43% of the leads to an estimate of the euchromatic genome size of 2.9 Gb. This remaining raw sequence)had matches to the whole-genome shot- agrees reasonably with the prediction above based on the length of gun reads discussed above(consistent with the idea that these reads the draft quence(Table 8). cover about half of the genome) Update. The results above reflect the data on 7 October 2000. New We also examined the extent of genome coverage by aligning the data are continually being added, with improvements being made CDNA sequences for genes in the RefSeq dataset to the draft the physical map, new clones being sequenced to close gaps and genome sequence. We found that 88%of the bases of these cDNAs draft clones progressing to full shotgun coverage and finishing. The ould be aligned to the draft genome sequence at high stringency (at draft genome sequence will be regularly reassembled and publicly least 98% identity). (A few of the alignments with either the random released. raw sequence reads or the cDNAs may be to a highly similar region Currently, the physical map has been refined such that the in the genome, but such matches should affect the estimate of number of fingerprint clone contigs has fallen from 1, 246 to 965; genome coverage by considerably less than 1%, based on the this reflects the elimination of some artefactual contigs and the estimated extent of duplication within the genome(see below). closure of some gaps. The sequence coverage has risen such that These results indicate that about 88% of the human genome is 90% of the human genome is now represented in the sequenced represented in the draft genome sequence and about 94% in the clones and more than 94% is represented in the combined publicly ombined publicly available sequence databases. The figure of 88% available sequence databases. The total amount of finished sequence agrees well with our independent estimates above that about 3%, is now around 1 Gb 5%and 4% of the genome reside in the three types of gap in the draft genome sequence. Broad genomic landscape Finally, a small experimental check was perform ge-insert clone library with probes corresponding to 16 of the What biological insights can be gleaned from the draft sequence? In whole genome shotgun reads that failed to match the draft genome this section, we consider very large-scale features of the draft sequence. Five hybridized to many clones from different fingerprint genome sequence: the distribution of GC content, CpG islands remaining eleven, two fell within sequenced clones(presumably the human genome. The draft genome sequence makes it possible to ithin sequence gaps of the first type), eight fell in fingerprint clone integrate these features and others at scales ranging from individual e/ Ensembl Figure 10 Screen shot from UCSC Draft Human Genome Browser. See Figure 11 Screen shot from the Genome Browser of Project Ensembl. See httpgenome.ucscedu/. NatuRevOl409115FeBruAry2001www.nature.com A@ 2001 Macmillan Magazines
the fraction of reads matching the draft genome sequence should provide an estimate of genome coverage. In practice, the comparison is complicated by the need to allow for repeat sequences, the imperfect sequence quality of both the raw sequence and the draft genome sequence, and the possibility of polymorphism. Nonetheless, the analysis provides a reasonable view of the extent to which the genome is represented in the draft genome sequence and the public databases. We compared the raw sequence reads against both the sequences used in the construction of the draft genome sequence and all of GenBank using the BLAST computer program. Of the 5,615 raw sequence reads analysed (each containing at least 100 bp of contiguous non-repetitive sequence), 4,924 had a match of $ 97% identity with a sequenced clone, indicating that 88 6 1.5% of the genome was represented in sequenced clones. The estimate is subject to various uncertainties. Most serious is the proportion of repeat sequence in the remainder of the genome. If the unsequenced portion of the genome is unusually rich in repeated sequence, we would underestimate its size (although the excess would be comprised of repeated sequence). We examined those raw sequences that failed to match by comparing them to the other publicly available sequence resources. Fifty (0.9%) had matches in public databases containing cDNA sequences, STSs and similar data. An additional 276 (or 43% of the remaining raw sequence) had matches to the whole-genome shotgun reads discussed above (consistent with the idea that these reads cover about half of the genome). We also examined the extent of genome coverage by aligning the cDNA sequences for genes in the RefSeq dataset110 to the draft genome sequence. We found that 88% of the bases of these cDNAs could be aligned to the draft genome sequence at high stringency (at least 98% identity). (A few of the alignments with either the random raw sequence reads or the cDNAs may be to a highly similar region in the genome, but such matches should affect the estimate of genome coverage by considerably less than 1%, based on the estimated extent of duplication within the genome (see below).) These results indicate that about 88% of the human genome is represented in the draft genome sequence and about 94% in the combined publicly available sequence databases. The ®gure of 88% agrees well with our independent estimates above that about 3%, 5% and 4% of the genome reside in the three types of gap in the draft genome sequence. Finally, a small experimental check was performed by screening a large-insert clone library with probes corresponding to 16 of the whole genome shotgun reads that failed to match the draft genome sequence. Five hybridized to many clones from different ®ngerprint clone contigs and were discarded as being repetitive. Of the remaining eleven, two fell within sequenced clones (presumably within sequence gaps of the ®rst type), eight fell in ®ngerprint clone contigs but between sequenced clones (gaps of the second type) and one failed to identify clones in the ®ngerprint map (gaps of the third type) but did identify clones in another large-insert library. Although these numbers are small, they are consistent with the view that the much of the remaining genome sequence lies within already identi®ed clones in the current map. Estimates of genome and chromosome sizes. Informed by this analysis of genome coverage, we proceeded to estimate the sizes of the genome and each of the chromosomes (Table 8). Beginning with the current assigned sequence for each chromosome, we corrected for the known gaps on the basis of their estimated sizes (see above). We attempted to account for the sizes of centromeres and heterochromatin, neither of which are well represented in the draft sequence. Finally, we corrected for around 100 Mb of artefactual duplication in the assembly. We arrived at a total human genome size estimate of around 3,200 Mb, which compares favourably with previous estimates based on DNA content. We also independently estimated the size of the euchromatic portion of the genome by determining the fraction of the 5,615 random raw sequences that matched the ®nished portion of the human genome (whose total length is known with greater precision). Twenty-nine per cent of these raw sequences found a match among 835 Mb of nonredundant ®nished sequence. This leads to an estimate of the euchromatic genome size of 2.9 Gb. This agrees reasonably with the prediction above based on the length of the draft genome sequence (Table 8). Update. The results above re¯ect the data on 7 October 2000. New data are continually being added, with improvements being made to the physical map, new clones being sequenced to close gaps and draft clones progressing to full shotgun coverage and ®nishing. The draft genome sequence will be regularly reassembled and publicly released. Currently, the physical map has been re®ned such that the number of ®ngerprint clone contigs has fallen from 1,246 to 965; this re¯ects the elimination of some artefactual contigs and the closure of some gaps. The sequence coverage has risen such that 90% of the human genome is now represented in the sequenced clones and more than 94% is represented in the combined publicly available sequence databases. The total amount of ®nished sequence is now around 1 Gb. Broad genomic landscape What biological insights can be gleaned from the draft sequence? In this section, we consider very large-scale features of the draft genome sequence: the distribution of GC content, CpG islands and recombination rates, and the repeat content and gene content of the human genome. The draft genome sequence makes it possible to integrate these features and others at scales ranging from individual articles NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 875 Figure 10 Screen shot from UCSC Draft Human Genome Browser. See http://genome.ucsc.edu/. Figure 11 Screen shot from the Genome Browser of Project Ensembl. See http://www.ensembl.org. © 2001 Macmillan Magazines Ltd
articles nucleotides to collections of chromosomes. Unless noted, all ana- from these sites the entire draft genome sequence together with the lyses were conducted on the assembled draft genome sequence annotations in a computer-readable format. The sequences of the described above underlying sequenced clones are all available through the public e Figure 9 provides a high-level view of the contents of the draft sequence databases. URLs for these and other genome websites are ourse tinginformationspanningnearlytenordersofwww.nhgri.nihgov/genome_hub.Anintroductiontousingthe agnitude requires computational tools to extract the full value. draft ge sequence, as well as associated databases and analy We have created and made freely available various ' Genome Brow- tical tools, is provided in an accompanying paper sers Browsers were developed and are maintained by the Universit In addition, the human cytogenetic map has been integrated with of California at Santa Cruz( Fig. 10)and the EnsEMBL project of the the draft genome sequence as part of a related project. The BAC European Bioinformatics Institute and the Sanger Centre( Fig. 11). Resource Consortium established dense connections between the Additional browsers have been created; URLs are listed at maps using more than 7, 500 sequenced large-insert clones that had wwnhgri. nih. gov/genome_hub. These web-based computer been cytogenetically mapped by FISH; the average density of the tools allow users to view an annotated display of the draft genome map is 2.3 clones per Mb. Although the precision of the integration sequence,with the ability to scroll along the chromosomes and is limited by the resolution of FISH, the links provide a powerful zoom in or out to different scales. They include: the nucleotide tool for the analysis of cytogenetic aberrations in inherited diseases ence, sequence contigs, clone contigs, sequence coverage and and cancer. These cytogenetic links can also be accessed through the finishing status, local GC content, CpG islands, known STS markers Genome Browsers from previous genetic and physical maps, families of repeat Long-range variation in GC content sequences, known genes, ESTs and mRNAS, predicted genes, SNPs The existence of GC-rich and GC-poor regions in the human and sequence similarities with other organisms(currently the genome was first revealed by experimental studies involving density pufferfish Tetraodon nigroviridis). These browsers will be updated gradient separation, which indicated substantial variation in aver- as the draft genome sequence is refined and corrected as additional age GC content among large fragments. Subsequent studies have annotations are developed. indicated that these GC-rich and GC-poor regions may have In addition to using the Genome Browsers, one can download different biological properties, such as gene density, composition of repeat sequences, correspondence with cytogenetic bands and recombination rate12-7. Many of these studies were indirect, owi Sources of publicly available sequence data and other relevant to the lack of sufficient sequence data. genomic information The draft genome sequence makes it possible to explore the http:/genome.ucsc.edw/ variation in GC content in a direct and global manner. Visual University of California at Santa Cruz nspection(Fig. 9)confirms that local GC content undergoes Contains the assembly of the draft genome sequence used in this paper and substantial long-range excursions from its genome-wide average of 41%. If the genome were drawn from a uniform distribution of http://genome.wustl.edw/gsc/humanmapping/ be 41+v((41)(59)/n)%. Fluctuations would be modest, with the standard deviation being halved as the window size is quadrupled- Contains links to clone and accession maps of the human genome for example, 0.70%6,0.35%,0.17% and 0.09% for windows of size 5, 0.80 and 320 kb httpAwww.ensembl.org The draft genome sequence, however, contains many regions with EBl/Sanger Centr much more extreme variation. There are huge regions(>10 Mb) Allows access to DNA and protein sequences with automatic baseline annotation with GC content far from the average. For example, the most distal http:/www.ncbi.nlmnih.gow/genome/guide/ 48 Mb of chromosome lp(from the telomere to about STS marker NCBI DiS3279)has an average GC content of 47. 1%, and chromosome 13 Views of chromosomes and maps and loci with inks to other NCBI resources has a 40-Mb region (roughly between STS marker A005X38 and http:/awWw.ncbi.nlm.nih.gow/genemaps Gene map 99: contains data and viewers for radiation hybrid maps of EST-based STSS http://compbio.ornlgaw/channevindex.html 10000 http:/mgrep.im RIKEN and the University of Tokyo Gives an overview of the entire human genome structure 636 8.000 Includes a variety of ways to query for SNPs in the human genome 4.000 http:/awWw.ncbi.nlmnihgow/omim/ Online Mendelian Inhertance in Man Contain information about human genes and disease http://www.nhgrinih.gow/elsvandhttp://www.oml.gow/hgmis/elsi/elsi.html 025303540455055 70 Contains information, links and articles on a wide range of social, ethical and legal GC content issues Figure 12 Histogram of GC content of 20-kb windows in the draft genome sequence. 876 A@2001 Macmillan Magazines Ltd NATURE VOL 409 15 FEBRUARY 20011
nucleotides to collections of chromosomes. Unless noted, all analyses were conducted on the assembled draft genome sequence described above. Figure 9 provides a high-level view of the contents of the draft genome sequence, at a scale of about 3.8 Mb per centimetre. Of course, navigating information spanning nearly ten orders of magnitude requires computational tools to extract the full value. We have created and made freely available various `Genome Browsers'. Browsers were developed and are maintained by the University of California at Santa Cruz (Fig. 10) and the EnsEMBL project of the European Bioinformatics Institute and the Sanger Centre (Fig. 11). Additional browsers have been created; URLs are listed at www.nhgri.nih.gov/genome_hub. These web-based computer tools allow users to view an annotated display of the draft genome sequence, with the ability to scroll along the chromosomes and zoom in or out to different scales. They include: the nucleotide sequence, sequence contigs, clone contigs, sequence coverage and ®nishing status, local GC content, CpG islands, known STS markers from previous genetic and physical maps, families of repeat sequences, known genes, ESTs and mRNAs, predicted genes, SNPs and sequence similarities with other organisms (currently the puffer®sh Tetraodon nigroviridis). These browsers will be updated as the draft genome sequence is re®ned and corrected as additional annotations are developed. In addition to using the Genome Browsers, one can download from these sites the entire draft genome sequence together with the annotations in a computer-readable format. The sequences of the underlying sequenced clones are all available through the public sequence databases. URLs for these and other genome websites are listed in Box 2. A larger list of useful URLs can be found at www.nhgri.nih.gov/genome_hub. An introduction to using the draft genome sequence, as well as associated databases and analytical tools, is provided in an accompanying paper111. In addition, the human cytogenetic map has been integrated with the draft genome sequence as part of a related project. The BAC Resource Consortium103 established dense connections between the maps using more than 7,500 sequenced large-insert clones that had been cytogenetically mapped by FISH; the average density of the map is 2.3 clones per Mb. Although the precision of the integration is limited by the resolution of FISH, the links provide a powerful tool for the analysis of cytogenetic aberrations in inherited diseases and cancer. These cytogenetic links can also be accessed through the Genome Browsers. Long-range variation in GC content The existence of GC-rich and GC-poor regions in the human genome was ®rst revealed by experimental studies involving density gradient separation, which indicated substantial variation in average GC content among large fragments. Subsequent studies have indicated that these GC-rich and GC-poor regions may have different biological properties, such as gene density, composition of repeat sequences, correspondence with cytogenetic bands and recombination rate112±117. Many of these studies were indirect, owing to the lack of suf®cient sequence data. The draft genome sequence makes it possible to explore the variation in GC content in a direct and global manner. Visual inspection (Fig. 9) con®rms that local GC content undergoes substantial long-range excursions from its genome-wide average of 41%. If the genome were drawn from a uniform distribution of GC content, the local GC content in a window of size n bp should be 41 6 Î((41)(59)/n)%. Fluctuations would be modest, with the standard deviation being halved as the window size is quadrupledÐ for example, 0.70%, 0.35%, 0.17% and 0.09% for windows of size 5, 20, 80 and 320 kb. The draft genome sequence, however, contains many regions with much more extreme variation. There are huge regions (. 10 Mb) with GC content far from the average. For example, the most distal 48 Mb of chromosome 1p (from the telomere to about STS marker D1S3279) has an average GC content of 47.1%, and chromosome 13 has a 40-Mb region (roughly between STS marker A005X38 and articles 876 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com Number of 20-kb windows 0 20 25 30 35 40 45 50 55 60 65 70 2,000 4,000 6,000 8,000 10,000 12,000 GC content Figure 12 Histogram of GC content of 20-kb windows in the draft genome sequence. Box 2 Sources of publicly available sequence data and other relevant genomic information http://genome.ucsc.edu/ University of California at Santa Cruz Contains the assembly of the draft genome sequence used in this paper and updates http://genome.wustl.edu/gsc/ human/Mapping/ Washington University Contains links to clone and accession maps of the human genome http://www.ensembl.org EBI/Sanger Centre Allows access to DNA and protein sequences with automatic baseline annotation http://www.ncbi.nlm.nih.gov/ genome/guide/ NCBI Views of chromosomes and maps and loci with links to other NCBI resources http://www.ncbi.nlm.nih.gov/ genemap99/ Gene map 99: contains data and viewers for radiation hybrid maps of EST-based STSs http://compbio.ornl.gov/channel/index.html Oak Ridge National Laboratory Java viewers for human genome data http://hgrep.ims.u-tokyo.ac.jp/ RIKEN and the University of Tokyo Gives an overview of the entire human genome structure http://snp.cshl.org/ The SNP Consortium Includes a variety of ways to query for SNPs in the human genome http://www.ncbi.nlm.nih.gov/Omim/ Online Mendelian Inheritance in Man Contain information about human genes and disease http://www.nhgri.nih.gov/ELSI/ and http://www.ornl.gov/hgmis/elsi/elsi.html NHGRI and DOE Contains information, links and articles on a wide range of social, ethical and legal issues © 2001 Macmillan Magazines Ltd
articles 50 Mb 100Mb 30% 际购炉的的期被成海 0%- Figure 13 Variation in GC content at various scales. The GC content in subregions of a 100-Mb region analysed in non-overlapping 20-kb windows. Middle, GC content of 00-Mb region of chromosome 1 is plotted, starting at about 83 Mb from the beginning of t10 Mb analysed in 2-kb windows. Bottom, GC content of the first 1 Mb, analysed the draft genome sequence. This region is AT-rich overall. Top, the GC content of the in 200-bp windows. At this scale, gaps in the sequence can be seen. stsG30423)with only 36%GC content. There are also examples of worth redefining the concept so that it becomes possible rigorously large shifts in GC content between adjacent multimegabase regions. to partition the genome into regions. In the absence of a precise For example, the average GC content on chromosome 17q is 50% definition, we will loosely refer to such regions asGC content for the distal 10.3 Mb but drops to 38% for the adjacent 3.9 Mb. domains in the context of the discussion belot There are regions of less than 300 kb with even wider swings in GC Fickett et al. 2 have explored a model in which the underlying content, for example, from 33. 1%to 59.3%. just from preference for a particular GC content drifts continuously through Long-range variation in GC content is evident not just from out the genome, an approach that bears further examination. extreme outliers, but throughout the genome. The distribution of Churchill has proposed that the boundaries between average GC content in 20-kb windows across the draft genome domains can in some cases be predicted by a hidden Markov model, quence is shown in Fig. 12. The spread is 15-fold larger than with one state representing a GC-rich region and one representing predicted by a uniform process. Moreover, the standard deviation an AT-rich region. We found that this approach tended to ide entity barely decreases as window size increases by successive factors of only very short domains of less than a kilobase(data not shown) four--5.9%, 5.2%, 4.9%and 4.6% for windows of size 5, 20, 80 and but variants of this approach deserve further attention. 320 kb. The distribution is also notably skewed, with 58% below the The correlation between GC content domains and various average and 42% above the average of 41%, with a long tail of GC- biological properties is of great interest, and this is likely to be the ch tion in GC content may refer roposed that the long-range varia- content. As described below, we confirm the existence of strong mosaic of compositionally homogeneous regions that they dubbed integration between the draft genome sequence and the cytogeneti isochores. They suggested that the skewed distribution is com- map described above, it is possible to confirm a statistically osed of five normal distributions, corresponding to five distinct significant correlation between GC content and Giemsa bands(g types of isochore(L1, L2, HI, H2 and H3, with GC contents of bands). For example, 98% of large-insert clones mapping to the 38%,38-42%,42-47%,47-52%and>52%, respectively) darkest G-bands are in 200-kb regions of low GC content(average We studied the draft genome sequence to see whether strict 37%), whereas more than 80% of clones mapping isochores could be identified. For example, the sequence was bands are in regions of high GC content(average 45%). Estimated divided into 300-kb windows, and each window was subdivided band locations can be seen in Fig 9 and viewed in the context of into20-kbsubwindowsWecalculatedtheaverageGccontentforothergenomeannotationathttp://genome.ucsc.edu/goldenpath/ eachwindowandsubwindowsandinvestigatedhowmuchofthemapplots/andhttp:/genome.ucscedu/goldenpath/hgTracks.html ariance in the GC content of subwindows across the genome can be CpG islands statistically explained by the average GC content in each window. A related topic is the distribution of so-called CpG islands across the About three-quarters of the genome-wide variance among 20-kb genome. The dinucleotide Cpg is notable because it is greatly windows can be statistically explained by the average GC content of under-represented in human DNA, occurring at or 21).Ty about one- 300-kb windows that contain them, but the residual variance among fifth of the roughly 4% frequency that would be ex subwindows(standard deviation, 2.4%)is still far too large to be multiplying the typical fraction of Cs and Gs(0. 21 X0. 21). The consistent with a homogeneous distribution. In fact, the hypothesis deficit occurs because most CpG dinucleotides are methylated on of homogeneity could be rejected for each 300-kb window in the the cytosine base, and spontaneous deamination of methyl-C draft genome sequence. residues gives rise to T residues. (Spontaneous deamination of Similar results were obtained with other window and subwindow ordinary cytosine residues gives rise to uracil residues that are sizes. Some of the local heterogeneity in GC content is attributable to readily recognized and repaired by the cell. )As a result, methyl- ansposable element insertions(see below). Such repeat elements CpG dinucleotides steadily mutate to Tpg dinucleotide es. However typically have a higher GC content than the surrounding sequence, the genome contains many ' CpG islands'in which CpG dinucleo- ith the effect being strongest for the most recent insertions. des are not methylated and occur at a frequency closer to that These results rule out a strict notion of isochores as composi- predicted by the local GC content. CpG islands are of particular mony dit homogeneous. Instead, there is substantial variation at interest because many are associated with the 5'ends of genes fferent scales, as illustrated in Fig. 13. Although isochores We searched the draft genome sequence for CpG islands. Ideally do not appear to merit the prefix iso, the genome clearly does they should be defined by directly testing for the absence of cytosine contain large regions of distinctive GC content and it is likely to be methylation, but that was not practical for this report. There are NATURE VOL 409 15 FEBRUARY 200 .nature. com A@2001 Macmillan Magazines Ltd 77
stsG30423) with only 36% GC content. There are also examples of large shifts in GC content between adjacent multimegabase regions. For example, the average GC content on chromosome 17q is 50% for the distal 10.3 Mb but drops to 38% for the adjacent 3.9 Mb. There are regions of less than 300 kb with even wider swings in GC content, for example, from 33.1% to 59.3%. Long-range variation in GC content is evident not just from extreme outliers, but throughout the genome. The distribution of average GC content in 20-kb windows across the draft genome sequence is shown in Fig. 12. The spread is 15-fold larger than predicted by a uniform process. Moreover, the standard deviation barely decreases as window size increases by successive factors of fourÐ5.9%, 5.2%, 4.9% and 4.6% for windows of size 5, 20, 80 and 320 kb. The distribution is also notably skewed, with 58% below the average and 42% above the average of 41%, with a long tail of GCrich regions. Bernardi and colleagues118,119 proposed that the long-range variation in GC content may re¯ect that the genome is composed of a mosaic of compositionally homogeneous regions that they dubbed `isochores'. They suggested that the skewed distribution is composed of ®ve normal distributions, corresponding to ®ve distinct types of isochore (L1, L2, H1, H2 and H3, with GC contents of , 38%, 38±42%, 42±47%, 47±52% and . 52%, respectively). We studied the draft genome sequence to see whether strict isochores could be identi®ed. For example, the sequence was divided into 300-kb windows, and each window was subdivided into 20-kb subwindows. We calculated the average GC content for each window and subwindow, and investigated how much of the variance in the GC content of subwindows across the genome can be statistically `explained' by the average GC content in each window. About three-quarters of the genome-wide variance among 20-kb windows can be statistically explained by the average GC content of 300-kb windows that contain them, but the residual variance among subwindows (standard deviation, 2.4%) is still far too large to be consistent with a homogeneous distribution. In fact, the hypothesis of homogeneity could be rejected for each 300-kb window in the draft genome sequence. Similar results were obtained with other window and subwindow sizes. Some of the local heterogeneity in GC content is attributable to transposable element insertions (see below). Such repeat elements typically have a higher GC content than the surrounding sequence, with the effect being strongest for the most recent insertions. These results rule out a strict notion of isochores as compositionally homogeneous. Instead, there is substantial variation at many different scales, as illustrated in Fig. 13. Although isochores do not appear to merit the pre®x `iso', the genome clearly does contain large regions of distinctive GC content and it is likely to be worth rede®ning the concept so that it becomes possible rigorously to partition the genome into regions. In the absence of a precise de®nition, we will loosely refer to such regions as `GC content domains' in the context of the discussion below. Fickett et al.120 have explored a model in which the underlying preference for a particular GC content drifts continuously throughout the genome, an approach that bears further examination. Churchill121 has proposed that the boundaries between GC content domains can in some cases be predicted by a hidden Markov model, with one state representing a GC-rich region and one representing an AT-rich region. We found that this approach tended to identify only very short domains of less than a kilobase (data not shown), but variants of this approach deserve further attention. The correlation between GC content domains and various biological properties is of great interest, and this is likely to be the most fruitful route to understanding the basis of variation in GC content. As described below, we con®rm the existence of strong correlations with both repeat content and gene density. Using the integration between the draft genome sequence and the cytogenetic map described above, it is possible to con®rm a statistically signi®cant correlation between GC content and Giemsa bands (Gbands). For example, 98% of large-insert clones mapping to the darkest G-bands are in 200-kb regions of low GC content (average 37%), whereas more than 80% of clones mapping to the lightest Gbands are in regions of high GC content (average 45%)103. Estimated band locations can be seen in Fig. 9 and viewed in the context of other genome annotation at http://genome.ucsc.edu/goldenPath/ mapPlots/ and http://genome.ucsc.edu/goldenPath/hgTracks.html. CpG islands A related topic is the distribution of so-called CpG islands across the genome. The dinucleotide CpG is notable because it is greatly under-represented in human DNA, occurring at only about one- ®fth of the roughly 4% frequency that would be expected by simply multiplying the typical fraction of Cs and Gs (0.21 ´ 0.21). The de®cit occurs because most CpG dinucleotides are methylated on the cytosine base, and spontaneous deamination of methyl-C residues gives rise to T residues. (Spontaneous deamination of ordinary cytosine residues gives rise to uracil residues that are readily recognized and repaired by the cell.) As a result, methylCpG dinucleotides steadily mutate to TpG dinucleotides. However, the genome contains many `CpG islands' in which CpG dinucleotides are not methylated and occur at a frequency closer to that predicted by the local GC content. CpG islands are of particular interest because many are associated with the 59 ends of genes122±127. We searched the draft genome sequence for CpG islands. Ideally, they should be de®ned by directly testing for the absence of cytosine methylation, but that was not practical for this report. There are articles NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 877 60%- 50%- 40%- 30%- 20%- 60%- 50%- 40%- 30%- 20%- 60%- 50%- 40%- 30%- 20%- 0 Mb 50 Mb 100 Mb 0 Mb 10 Mb 0 Mb 1 Mb 5 Mb 0.5 Mb Figure 13 Variation in GC content at various scales. The GC content in subregions of a 100-Mb region of chromosome 1 is plotted, starting at about 83 Mb from the beginning of the draft genome sequence. This region is AT-rich overall. Top, the GC content of the entire 100-Mb region analysed in non-overlapping 20-kb windows. Middle, GC content of the ®rst 10 Mb, analysed in 2-kb windows. Bottom, GC content of the ®rst 1 Mb, analysed in 200-bp windows. At this scale, gaps in the sequence can be seen. © 2001 Macmillan Magazines Ltd
articles Table 10 Number of CpG islands by GC content unusually low 2.9 islands per Mb, and chromosomes 16, 17 and 22 Nucleotides have 19-22 islands per Mb. The extreme outlier is chromosome 19, of islands ucleotides with 43 islands per Mb. Similar trends are seen when considering the in islands percentage of bases contained in CpG islands. The relative density of CpG islands correlates reasonably well with estimates of relative 3589.742 of gene predictions discussed below. ce smarimaly scoring segments. The draft genome sequence makes it possible to compare genetic at atime, Comparison of genetic and physical distance ed proportion on the basis of the GC content and physical distances and thereby to explore variation in the rate of the segment (0.60), using a modification of a program developed by G. Mickdem(personal recombination the human chromosomes. We focus here on large-scale variation. Finer variation is examined in an accompany various computer programs that attempt to identify CpG islands on The genetic and physical maps are integrated by 5, 282 poly- the basis of primary sequence alone. These programs differ in some morphic loci from the Marshfield genetic map, whose positions important respects(such as how aggressively they subdivide long are known in terms of centimorgans (cM)and Mb along the CpG-containing regions), and the precise correspondence hromosomes. Figure 15 shows the ce experimentally undermethylated islands has not been validated. genome sequence for chromosome 12 with the male, female and Nevertheless, there is a good correlation, and computational ana- sex-averaged maps. One can calculate the approximate ratio of cM lysis thus provides a reasonable picture of the distribution of Cpg per Mb across a chromosome(reflected in the slopes in Fig. 15)and islands in the genor the average recombination rate for each chromosome arm. To identify CpG islands, we used the definition proposed by Two striking features emerge from analysis of these data. First, the Gardiner-Garden and Frommer and embodied in a computer average recombination rate increases as the length of the chromo- program. We searched the draft genome sequence for CpG islands, some arm decreases(Fig. 16). Long chromosome arms have an using both the full sequence and the sequence masked to eliminate average recombination rate of about 1 cM per Mb, whereas the repeat sequences. The number of regions satisfying the definition of shortest arms are in the range of 2 cM per Mb. A similar trend has a CpG island was 50, 267 in the full sequence and 28, 890 in the been seen in the yeast genome 3233,despite the fact that the physical repeat-masked sequence. The difference reflects the fact that some scale is nearly 200 times as small. Moreover, experimental studies repeat elements(notably Alu)are GC-rich. Although some of these have shown that lengthening or shortening yeast chromosomes repeat elements may function as control regions, it seems unlikely results in a compensatory change in recombination rate that most of the apparent CpG islands in repeat sequences are The second observation is that the recombination rate tends to be functional. Accordingly, we focused on those in the non-repeated suppressed near the centromeres and higher in the distal portions sequence. The count of 28, 890 CpG islands is reasonably close to the of most chromosomes, with the increase largely in the terminal previous estimate of about 35,000(ref. 129, as modified by ref. 130) Most of the islands are short, with 60-70% GC content(Table 10 More than 95% of the islands are less than 1, 800 bp long, and more than 75% are less than 850 bp. The longest CpG island (on chromosome 10)is 36, 619 bp long, and 322 are longer than 3,000 bp. Some of the larger islands contain ribosomal pseudogenes, proportion of all islands(<0.5%). The smaller islands are consis- tent with their previously hypothesized function, but the role of these larger islands is uncertain. 12q The density of CpG islands varies substantially among some of ith a mean of 10.5 islands per Mb. However, chromosome Y has an a/sof the chromosomes. Most chromosomes have 5-15 islands per Mb, 1:120 0102030/405060708090100110120130140 Figure 15 Distance in cM along the genetic map of chromosome 12 plotted agai Number of CpG islands per Mb position in Mb in the draft genome sequence. Female, male and sex-averaged maps are shown. Female recombination rates are much higher than male recombination rates. The Figure 14 Number of CpG islands per Mb for each chromosome plotted against the increased slopes at either end of the chromosome reflect the increased rates of number of genes per Mb(the number of genes was taken from GeneMap98 (ref. 1001). recombination per Mb near the telomeres. Conversely, the flatter slope near the Chromosomes 16, 17, 22 and particularly 19 are clear outliers, with a density of CpG centromere shows decreased recombination there especially in male meiosis. This is slandsthatisevengreaterthanwouldbeexpectedfromthehighgenecountsforthesetypicaloftheotherchromosomesaswell(seehttp:/genome.ucsc.edu/goldenPath four chromosomes mapPlots) Discordant markers may be map, marker placement or assembly errors. 878 A@2001 Macmillan Magazines Ltd NATURE VOL 409 15 FEBRUARY 20011
various computer programs that attempt to identify CpG islands on the basis of primary sequence alone. These programs differ in some important respects (such as how aggressively they subdivide long CpG-containing regions), and the precise correspondence with experimentally undermethylated islands has not been validated. Nevertheless, there is a good correlation, and computational analysis thus provides a reasonable picture of the distribution of CpG islands in the genome. To identify CpG islands, we used the de®nition proposed by Gardiner-Garden and Frommer128 and embodied in a computer program. We searched the draft genome sequence for CpG islands, using both the full sequence and the sequence masked to eliminate repeat sequences. The number of regions satisfying the de®nition of a CpG island was 50,267 in the full sequence and 28,890 in the repeat-masked sequence. The difference re¯ects the fact that some repeat elements (notably Alu) are GC-rich. Although some of these repeat elements may function as control regions, it seems unlikely that most of the apparent CpG islands in repeat sequences are functional. Accordingly, we focused on those in the non-repeated sequence. The count of 28,890 CpG islands is reasonably close to the previous estimate of about 35,000 (ref. 129, as modi®ed by ref. 130). Most of the islands are short, with 60±70% GC content (Table 10). More than 95% of the islands are less than 1,800 bp long, and more than 75% are less than 850 bp. The longest CpG island (on chromosome 10) is 36,619 bp long, and 322 are longer than 3,000 bp. Some of the larger islands contain ribosomal pseudogenes, although RNA genes and pseudogenes account for only a small proportion of all islands (, 0.5%). The smaller islands are consistent with their previously hypothesized function, but the role of these larger islands is uncertain. The density of CpG islands varies substantially among some of the chromosomes. Most chromosomes have 5±15 islands per Mb, with a mean of 10.5 islands per Mb. However, chromosome Y has an unusually low 2.9 islands per Mb, and chromosomes 16, 17 and 22 have 19±22 islands per Mb. The extreme outlier is chromosome 19, with 43 islands per Mb. Similar trends are seen when considering the percentage of bases contained in CpG islands. The relative density of CpG islands correlates reasonably well with estimates of relative gene density on these chromosomes, based both on previous mapping studies involving ESTs (Fig. 14) and on the distribution of gene predictions discussed below. Comparison of genetic and physical distance The draft genome sequence makes it possible to compare genetic and physical distances and thereby to explore variation in the rate of recombination across the human chromosomes. We focus here on large-scale variation. Finer variation is examined in an accompanying paper131. The genetic and physical maps are integrated by 5,282 polymorphic loci from the Marsh®eld genetic map102, whose positions are known in terms of centimorgans (cM) and Mb along the chromosomes. Figure 15 shows the comparison of the draft genome sequence for chromosome 12 with the male, female and sex-averaged maps. One can calculate the approximate ratio of cM per Mb across a chromosome (re¯ected in the slopes in Fig. 15) and the average recombination rate for each chromosome arm. Two striking features emerge from analysis of these data. First, the average recombination rate increases as the length of the chromosome arm decreases (Fig. 16). Long chromosome arms have an average recombination rate of about 1 cM per Mb, whereas the shortest arms are in the range of 2 cM per Mb. A similar trend has been seen in the yeast genome132,133, despite the fact that the physical scale is nearly 200 times as small. Moreover, experimental studies have shown that lengthening or shortening yeast chromosomes results in a compensatory change in recombination rate132. The second observation is that the recombination rate tends to be suppressed near the centromeres and higher in the distal portions of most chromosomes, with the increase largely in the terminal articles 878 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com Table 10 Number of CpG islands by GC content GC content of island Number of islands Percentage of islands Nucleotides in islands Percentage of nucleotides in islands Total 28,890 100 19,818,547 100 .80% 22 0.08 5,916 0.03 70±80% 5,884 20 3,111,965 16 60±70% 18,779 65 13,110,924 66 50±60% 4,205 15 3,589,742 18 ............................................................................................................................................................................. Potential CpG islands were identi®ed by searching the draft genome sequence one base at a time, scoring each dinucleotide (+17 for GC, -1 for others) and identifying maximally scoring segments. Each segment was then evaluated to determine GC content ($50%), length (.200) and ratio of observed proportion of GC dinucleotides to the expected proportion on the basis of the GC content of the segment (.0.60), using a modi®cation of a program developed by G. Micklem (personal communication). 19 22 X 16 13 18 4 2 5 21 8 3 14 6 9 7 10 12 11 15 1 20 17 0 5 10 15 20 25 0 10 20 30 40 50 Number of CpG islands per Mb Number of genes per Mb Figure 14 Number of CpG islands per Mb for each chromosome, plotted against the number of genes per Mb (the number of genes was taken from GeneMap98 (ref. 100)). Chromosomes 16, 17, 22 and particularly 19 are clear outliers, with a density of CpG islands that is even greater than would be expected from the high gene counts for these four chromosomes. 10 20 30 40 50 60 70 80 90 100 110 120 130 140 60 0 Position (Mb) Distance from centromere (cM) Centromere Sex-averaged Male Female 50 40 30 20 10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 12q 12p Figure 15 Distance in cM along the genetic map of chromosome 12 plotted against position in Mb in the draft genome sequence. Female, male and sex-averaged maps are shown. Female recombination rates are much higher than male recombination rates. The increased slopes at either end of the chromosome re¯ect the increased rates of recombination per Mb near the telomeres. Conversely, the ¯atter slope near the centromere shows decreased recombination there, especially in male meiosis. This is typical of the other chromosomes as well (see http://genome.ucsc.edu/goldenPath/ mapPlots). Discordant markers may be map, marker placement or assembly errors. © 2001 Macmillan Magazines Ltd
articles 65 Mb. The increase is most pronounced in the male meiotic Amoeba dubia. This mystery(the C-value paradox) was largely The effect can be seen, for example, from the higher slope at resolved with the recognition that genomes can contain a large ends of chromosome 12(Fig. 15). Regional and sex-speci fic quantity of repetitive sequence, far in excess of that devoted effects have been observed for chromosome 21(refs 110, 134) protein-coding genes(reviewed in refs 140, 141) Why is recombination higher on smaller chromosome arms? A In the human, coding sequences comprise less than 5% of the gher rate would increase the likelihood of at least one crossover genome(see below), whereas repeat sequences account for at least during meiosis on each chromosome arm, as is generally observed 50% and probably much more. Broadly, the repeats fall into five human chiasmata counts. Crossovers are believed to be classes:(1)transposon-derived repeats, often referred to as inter- necessary for normal meiotic disjunction of homologous chror persed repeats;(2)inactive(partially)retroposed copies of cellular ome pairs in eukaryotes. An extreme example is the pseudoauto- genes(including protein-coding genes and small structural RNAs) somal regions on chromosomes Xp and Yp, which pair during male usually referred to as processed pseudogenes; (3)simple sequence meiosis; this physical region of only 2.6 Mb has a genetic length of repeats, consisting of direct repetitions of relatively short k-mers 50 cM(corresponding to 20 cM per Mb), with the result that a such as(A)m(CA) or(CGG)m(4)segmental duplications, con- sisting of blocks of around 10-300 kb that have been copied from Mechanistically, the increased rate of recombination on shorter one region of the genome into another region; and(5)blocks of hromosome arms could be explained if, once an initial recombina- tandemly repeated sequences, such as at centromeres, telomeres, tion event occurs, additional nearby events are blocked by positive the short arms of acrocentric chromosomes and ribosomal gene crossover interference on each arm. Evidence from yeast mutants in clusters. (These regions are intentionally under-represented in the hich interference is abolished shows that interference plays a key draft genome sequence and are not discussed here. le in distributing a limited number of crossovers among the Repeats are often described as junk and dismissed as uninterest- various chromosome arms in yeast. An alternative possibility is ing. However, they actually represent an extraordinary trove of that a checkpoint mechanism scans for and enforces the presence of information about biological processes. The repeats constitute a at least one crossover on each chromosome arm rich palaeontological record, holding crucial clues about evolu Variation in recombination rates along chromosomes and tionary events and forces. As passive markers, they provide assays between the sexes is likely to reflect variation in the initiation of for studying processes of mutation and selection. It is possible to meiosis-induced double-strand breaks(DSBs)that initiate recom- recog cohorts of repeats bornat the same time and to follow bination. DSBs in yeast have been associated with open their fates in different regions of the genome or in different species chromatin 3,138, rather than with specific DNA sequence motifs. As active agents, repeats have reshaped the genome by causing With the availability of the draft genome sequence, it should be ectopic rearrangements, creating entirely new genes, modifying and possible to explore in an analogous manner whether variation reshuffling existing genes, and modulating overall GC content. They in human recombination rates reflects systematic differences in also shed light on chromosome structure and dynamics, and chromosome accessibility during meiosis. provide tools for medical genetic and population genetic studies. The human is the first repeat-rich genome to be sequenced, and o we investigated what information could be gleaned from this Repeat content of the human genome majority component of the human genome. Although some of the A puzzling observation in the early days of molecular biology was studies, the draft genome sequence provides the first comprehensive that genome size does not correlate well with organismal complex- view, allowing some questions to be resolved and new mysteries to ity. For example, Homo sapiens has a genome that is 200 times emerge large as that of the yeast S. cerevisiae, but 200 times as small as that of Transposon-derived repeats Most human repeat seque derived from transposable elements We can currently recognize about 45% of the genome as belonging to this class. Much of the remaining unique' DNA must also be derived from ancient transposable element copies that have diverged too far to be recognized a such. To describe our analyses of interspersed repeats, it is necessary briefly to review the relevant features of human transposabl 三 elements Classes of transposable elements. In mammals, almost all trans- posable elements fall into one of four types(Fig. 17), of which three transpose through RNA intermediates and one transposes directl as DNA. These are long interspersed elements (LINEs), short interspersed elements (SINEs), LTR retrotransposons and DNA transposons LINEs are one of the most ancient and successful inventions eukaryotic genomes. In humans, these transposons are about 6 kb long, harbour an internal polymerase II promoter and encode two pen reading frames(ORs). Upon translation, a LINE RNA assembles with its own encoded proteins and moves to the nucleus, where an endonuclease activity makes a single-stranded nick and the riptase uses the nicked DNA to prime reverse Length of chromosome o0 120 140 160 transcription from the 3'end of the LINE RNA. Reverse transcrip- tion frequently fails to proceed to the 5'end, resulting in many Figure 16 Rate of recombination averaged across the euchromatic portion of each truncated, nonfunctional insertions. Indeed, most LINE-derived chromosome arm plotted against the length of the chromosome arm in Mb. For large repeats are short, with an average size of 900 bp for all LINEl copies, chromosomes, the average recombination rates are very similar, but as chromosome arm and a median size of 1,070 bp for copies of the currently active th decreases, average recombination rates rise markedly. LINEl element(LIHs). New insertion sites are flanked by a small NATURE VOL 409 15 FEBRUARY 200 .nature. com A@2001 Macmillan Magazines Ltd
20±35 Mb. The increase is most pronounced in the male meiotic map. The effect can be seen, for example, from the higher slope at both ends of chromosome 12 (Fig. 15). Regional and sex-speci®c effects have been observed for chromosome 21 (refs 110, 134). Why is recombination higher on smaller chromosome arms? A higher rate would increase the likelihood of at least one crossover during meiosis on each chromosome arm, as is generally observed in human chiasmata counts135. Crossovers are believed to be necessary for normal meiotic disjunction of homologous chromosome pairs in eukaryotes. An extreme example is the pseudoautosomal regions on chromosomes Xp and Yp, which pair during male meiosis; this physical region of only 2.6 Mb has a genetic length of 50 cM (corresponding to 20 cM per Mb), with the result that a crossover is virtually assured. Mechanistically, the increased rate of recombination on shorter chromosome arms could be explained if, once an initial recombination event occurs, additional nearby events are blocked by positive crossover interference on each arm. Evidence from yeast mutants in which interference is abolished shows that interference plays a key role in distributing a limited number of crossovers among the various chromosome arms in yeast136. An alternative possibility is that a checkpoint mechanism scans for and enforces the presence of at least one crossover on each chromosome arm. Variation in recombination rates along chromosomes and between the sexes is likely to re¯ect variation in the initiation of meiosis-induced double-strand breaks (DSBs) that initiate recombination. DSBs in yeast have been associated with open chromatin137,138, rather than with speci®c DNA sequence motifs. With the availability of the draft genome sequence, it should be possible to explore in an analogous manner whether variation in human recombination rates re¯ects systematic differences in chromosome accessibility during meiosis. Repeat content of the human genome A puzzling observation in the early days of molecular biology was that genome size does not correlate well with organismal complexity. For example, Homo sapiens has a genome that is 200 times as large as that of the yeast S. cerevisiae, but 200 times as small as that of Amoeba dubia139,140. This mystery (the C-value paradox) was largely resolved with the recognition that genomes can contain a large quantity of repetitive sequence, far in excess of that devoted to protein-coding genes (reviewed in refs 140, 141). In the human, coding sequences comprise less than 5% of the genome (see below), whereas repeat sequences account for at least 50% and probably much more. Broadly, the repeats fall into ®ve classes: (1) transposon-derived repeats, often referred to as interspersed repeats; (2) inactive (partially) retroposed copies of cellular genes (including protein-coding genes and small structural RNAs), usually referred to as processed pseudogenes; (3) simple sequence repeats, consisting of direct repetitions of relatively short k-mers such as (A)n, (CA)n or (CGG)n; (4) segmental duplications, consisting of blocks of around 10±300 kb that have been copied from one region of the genome into another region; and (5) blocks of tandemly repeated sequences, such as at centromeres, telomeres, the short arms of acrocentric chromosomes and ribosomal gene clusters. (These regions are intentionally under-represented in the draft genome sequence and are not discussed here.) Repeats are often described as `junk' and dismissed as uninteresting. However, they actually represent an extraordinary trove of information about biological processes. The repeats constitute a rich palaeontological record, holding crucial clues about evolutionary events and forces. As passive markers, they provide assays for studying processes of mutation and selection. It is possible to recognize cohorts of repeats `born' at the same time and to follow their fates in different regions of the genome or in different species. As active agents, repeats have reshaped the genome by causing ectopic rearrangements, creating entirely new genes, modifying and reshuf¯ing existing genes, and modulating overall GC content. They also shed light on chromosome structure and dynamics, and provide tools for medical genetic and population genetic studies. The human is the ®rst repeat-rich genome to be sequenced, and so we investigated what information could be gleaned from this majority component of the human genome. Although some of the general observations about repeats were suggested by previous studies, the draft genome sequence provides the ®rst comprehensive view, allowing some questions to be resolved and new mysteries to emerge. Transposon-derived repeats Most human repeat sequence is derived from transposable elements142,143. We can currently recognize about 45% of the genome as belonging to this class. Much of the remaining `unique' DNA must also be derived from ancient transposable element copies that have diverged too far to be recognized as such. To describe our analyses of interspersed repeats, it is necessary brie¯y to review the relevant features of human transposable elements. Classes of transposable elements. In mammals, almost all transposable elements fall into one of four types (Fig. 17), of which three transpose through RNA intermediates and one transposes directly as DNA. These are long interspersed elements (LINEs), short interspersed elements (SINEs), LTR retrotransposons and DNA transposons. LINEs are one of the most ancient and successful inventions in eukaryotic genomes. In humans, these transposons are about 6 kb long, harbour an internal polymerase II promoter and encode two open reading frames (ORFs). Upon translation, a LINE RNA assembles with its own encoded proteins and moves to the nucleus, where an endonuclease activity makes a single-stranded nick and the reverse transcriptase uses the nicked DNA to prime reverse transcription from the 39 end of the LINE RNA. Reverse transcription frequently fails to proceed to the 59 end, resulting in many truncated, nonfunctional insertions. Indeed, most LINE-derived repeats are short, with an average size of 900 bp for all LINE1 copies, and a median size of 1,070 bp for copies of the currently active LINE1 element (L1Hs). New insertion sites are ¯anked by a small articles NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 879 0 0.5 1 1.5 2 2.5 3 0 20 40 60 80 100 120 140 160 Length of chromosome arm (Mb) Recombination rate (cM per Mb) Figure 16 Rate of recombination averaged across the euchromatic portion of each chromosome arm plotted against the length of the chromosome arm in Mb. For large chromosomes, the average recombination rates are very similar, but as chromosome arm length decreases, average recombination rates rise markedly. © 2001 Macmillan Magazines Ltd