ARTICLES NATUREIVol 437 27 October 2005 Properties of LD in the human genome Traditionally, des ptions of Ld have focuse ed on measures calcu- lated between pairs of SNPs, averaged as a function of physical distance. Examples of such analyses for the HapMap data are presented in Supplementary Fig. 6. After adjusting for known confounders such as sample size, allele frequency distribution, marker density, and length of sampled regions, these data are highly similar to previously published surveys" Because LD varies markedly on scales of 1-100 kb, and is often discontinuous rather than declining smoothly with distance. averages obscure important aspects of LD structure. A fuller explora tion of the fine-scale structure of LD offers both insight into the causes of LD and understanding of its application to disease research LD patterns are simple in the absence of recombination. The most 0.5 natural path to understanding LD structure is first to consider the implest case in which there is no recombination (or gene conver Figure 4 Minor allele frequency distribution of SNPs in the ENCODE data, on), and then to add recombination to the model.( For simplicity and their contribution to heterozygosity. This figure shows the we ignore genotyping error and recurrent mutation in this discus polymorphic SNPs from the HapMap ENCODE regions according to mi sion, both of which seem to be rare in these data allele frequency(blue), with the lowest minor allele frequency bin(<0.05) In the absence of recombination, diversity arises solely through separated into singletons(SNPs he gous in one individual only, shown mutation. Because each SNP arose on a particular branch of the grey)and SNPs with more than one heterozygous individual. For this genealogical tree relating the chromosomes in the current popu alysis, MAF is averaged across the analysis panels. The sum of the lations, multiple haplotypes are observed. SNPs that arose on the ntribution of each MaF bin to the overall heterozygosity of the ENCODE same branch of the genealogy are perfectly correlated in the sample, egions is also shown (orange) whereas SNPs that occurred on different branches have imperfect correlations, or no correlation at all. We illustrate these concepts using empirical genotype data from 36 selection at this locus(see below; M. L. Freedman et al., personal adjacent SNPs in an ENCODE region(ENr1312q 37), selected because no obligate recombination events were detecta e anon aplotype sharing across populations. We next examined the them in CEU (Fig. 7). (We note that the lack of obligate recombina extent to which haplotypes are shared across populations. We used tion events in a small sample does not guarantee that no recombi a hidden Markov model in which each haplotype is modelled in turn nants have occurred, but it provides a good approximation for as an imperfect mosaic of other haplotypes(see Supplementary illustration. Information). In essence, the method infers probabilistically In principle, 36 such SNPs could give rise to 26 different haplo- which other haplotype in the sample is the closest relative(nearest types. Even with no recombination, gene conversion or recurrent neighbour)at each position along the chromosome 9. Unsurprisingly, the nearest neighbour most often is from the same great potential diversity, only seven haplotypes are observed(five to match a haplotype in another panel (Supplementary Fig. 5). All studied, reflecting shared ancestry since their most recent common individuals have at least some segments over which the nearest ancestor among apparently unrelated individuals neighbour is in a diffe erent a alysis panel. These results indicate In such a setting, it is easy to interpret the two most common nat althor alysis panels are characterized both by different pairwise measures of LD: D and r.(See the Supplementary haplotype frequencies and, to some extent, different combinations of Information for fuller definitions of these measures. )D is defined leles, both common and rare haplotypes are often shared across to be I in the absence of obligate recombination, declining only due aton to recombination or recurrent mutation. In contrast, r is simply CEU CHB+JPT 0.2 0.1 00.1020.30.40.500.10.20.3040.500.102030.40.5 Minor allele frequency analysis panel we plotted(bars)the MAF distribution of all the Phas distribution expected for the standard neutral population with a frequency greater than zero. The solid line shows the MAF constant population size and random mating without asd 2005 Nature Publishing Group
© 2005 Nature Publishing Group selection at this locus (see below; M. L. Freedman et al., personal communication). Haplotype sharing across populations. We next examined the extent to which haplotypes are shared across populations. We used a hidden Markov model in which each haplotype is modelled in turn as an imperfect mosaic of other haplotypes (see Supplementary Information)42. In essence, the method infers probabilistically which other haplotype in the sample is the closest relative (nearest neighbour) at each position along the chromosome. Unsurprisingly, the nearest neighbour most often is from the same analysis panel, but about 10% of haplotypes were found most closely to match a haplotype in another panel (Supplementary Fig. 5). All individuals have at least some segments over which the nearest neighbour is in a different analysis panel. These results indicate that although analysis panels are characterized both by different haplotype frequencies and, to some extent, different combinations of alleles, both common and rare haplotypes are often shared across populations. Properties of LD in the human genome Traditionally, descriptions of LD have focused on measures calculated between pairs of SNPs, averaged as a function of physical distance. Examples of such analyses for the HapMap data are presented in Supplementary Fig. 6. After adjusting for known confounders such as sample size, allele frequency distribution, marker density, and length of sampled regions, these data are highly similar to previously published surveys43. Because LD varies markedly on scales of 1–100 kb, and is often discontinuous rather than declining smoothly with distance, averages obscure important aspects of LD structure. A fuller exploration of the fine-scale structure of LD offers both insight into the causes of LD and understanding of its application to disease research. LD patterns are simple in the absence of recombination. The most natural path to understanding LD structure is first to consider the simplest case in which there is no recombination (or gene conversion), and then to add recombination to the model. (For simplicity we ignore genotyping error and recurrent mutation in this discussion, both of which seem to be rare in these data.) In the absence of recombination, diversity arises solely through mutation. Because each SNP arose on a particular branch of the genealogical tree relating the chromosomes in the current populations, multiple haplotypes are observed. SNPs that arose on the same branch of the genealogy are perfectly correlated in the sample, whereas SNPs that occurred on different branches have imperfect correlations, or no correlation at all. We illustrate these concepts using empirical genotype data from 36 adjacent SNPs in an ENCODE region (ENr131.2q37), selected because no obligate recombination events were detectable among them in CEU (Fig. 7). (We note that the lack of obligate recombination events in a small sample does not guarantee that no recombinants have occurred, but it provides a good approximation for illustration.) In principle, 36 such SNPs could give rise to 236 different haplotypes. Even with no recombination, gene conversion or recurrent mutation, up to 37 different haplotypes could be formed. Despite this great potential diversity, only seven haplotypes are observed (five seen more than once) among the 120 parental CEU chromosomes studied, reflecting shared ancestry since their most recent common ancestor among apparently unrelated individuals. In such a setting, it is easy to interpret the two most common pairwise measures of LD: D0 and r 2 . (See the Supplementary Information for fuller definitions of these measures.) D0 is defined to be 1 in the absence of obligate recombination, declining only due to recombination or recurrent mutation27. In contrast, r 2 is simply Figure 4 | Minor allele frequency distribution of SNPs in the ENCODE data, and their contribution to heterozygosity. This figure shows the polymorphic SNPs from the HapMap ENCODE regions according to minor allele frequency (blue), with the lowest minor allele frequency bin (,0.05) separated into singletons (SNPs heterozygous in one individual only, shown in grey) and SNPs with more than one heterozygous individual. For this analysis, MAF is averaged across the analysis panels. The sum of the contribution of each MAF bin to the overall heterozygosity of the ENCODE regions is also shown (orange). Figure 5 | Allele frequency distributions for autosomal SNPs. For each analysis panel we plotted (bars) the MAF distribution of all the Phase I SNPs with a frequency greater than zero. The solid line shows the MAF distribution for the ENCODE SNPs, and the dashed line shows the MAF distribution expected for the standard neutral population model with constant population size and random mating without ascertainment bias. ARTICLES NATURE|Vol 437|27 October 2005 1304
NATUREIVol 437 27 October 2005 ARTICLES the squared correlation coefficient between the two SNPs. Thus, r-is The availability of nearly complete information about common I when two SNPs arose on the same branch of the genealogy and DNA variation in the ENCODE regions allowed a more precise remain undisrupted by recombination, but has a value less than 1 estimation of recombination rates across large regions than in any hen SNPs arose on different branches, or if an initially strong previous study. We estimated recombination rates and identifie orrelation has been disrupted by crossing over. recombination hotspots in the ENCODE data, using methods haplotype structure, r- values display a complex pattern, varying which recombination rates rise dramatically over local background from 0.0003 to 1.0, with no relationship to physical distance. This rates makes sense, however, because without recombination, correlations Whereas the average recombination rate over 500 kb across the long SNPs depend on the historical order in which they arose, not human genome is about 0.5 cM", the estimated recombination rate the physical order of SNPs on the chromosome across the 500-kb ENCODE regions varied nearly tenfold, from a Most importantly, the seeming complexity of r values can be minimum of 0.19 cM(ENm0137q21 13)to a maximum of 1. 25 cM convolved in a simple manner: only seven different SNP configur-(ENr2329q34 11). Even this tenfold variation obscures much more ations exist in this region, with all but two chromosomes matching dramatic variation over a finer scale: 88 hotspots of recombination five common haplotypes, which can be distinguished from each were identified(Fig 8; see also Supplementary Fig. 7)-that is, one other by typing a specific set of four SNPs. That is, only a small per 57 kb-with hotspots detected in each of the ten regions(from 4 minority of sites need be examined to capture fully the information in 12q12 to 14 in 2q37. 1). Across the 5 Mb, we estimate that about in this region. 80% of all recombination has taken place in about 15% of the Variation in local recombination rates is a major determinant of sequence(Fig 9, see also refs 46, 49) LD Recombination in the ancestors of the current population has A block-like structure of human LD. With most human recombina- typically disrupted the simple picture presented above. In the human tion occurring in recombination hotspots, the breakdown of LD genome, as in yeast", mouse and other genomes, recombination is often discontinuous. A 'block-like structure of LD is visually rates typically vary dramatically on a fine scale, with hotspots of apparent in Fig 8 and Supplementary Fig. 7: segments of consistently recombination explaining much crossing over in each region2. The high D that break down where high recombination rates, recombi- generality of this model has recently been demonstrated through nation hotspots and obligate recombination eventsall cluster. tational methods that allow estimation of recombination rates When haplotype blocks are more formally defined in the (including hotspots and coldspots) from genotype data"d eNCOdE data(using a method based on a composite of local D 0.6 00.20.40.60.81.0 020 60.81.0 YRI allele frequency CEU allele frequency c1.0 d1.0 0.2 00.2040.6081.0 YRI allele frequency CHB allele frequency 00200300400500600+ of analysis panels and between the CHB and JPT sample sets. For each are common in one panel but.e Figure 6 Comparison of allele frequencies in the ENCODE data for all pairs given set of allele frequencies. The purple regions show that very few SNPs another. The red polymorphic SNP we identified the minor allele all panels(a-d)and there are many SNPs that have similar low frequencies in each pair then calculated the frequency of this allele in each analysis panel/sample set. analysis panels/sample sets The colour in each bin represents the number of SNPs that display each 2005 Nature Publishing Group
© 2005 Nature Publishing Group the squared correlation coefficient between the two SNPs. Thus, r 2 is 1 when two SNPs arose on the same branch of the genealogy and remain undisrupted by recombination, but has a value less than 1 when SNPs arose on different branches, or if an initially strong correlation has been disrupted by crossing over. In this region, D0 ¼ 1 for all marker pairs, as there is no evidence of historical recombination. In contrast, and despite great simplicity of haplotype structure, r 2 values display a complex pattern, varying from 0.0003 to 1.0, with no relationship to physical distance. This makes sense, however, because without recombination, correlations among SNPs depend on the historical order in which they arose, not the physical order of SNPs on the chromosome. Most importantly, the seeming complexity of r 2 values can be deconvolved in a simple manner: only seven different SNP configurations exist in this region, with all but two chromosomes matching five common haplotypes, which can be distinguished from each other by typing a specific set of four SNPs. That is, only a small minority of sites need be examined to capture fully the information in this region. Variation in local recombination rates is a major determinant of LD. Recombination in the ancestors of the current population has typically disrupted the simple picture presented above. In the human genome, as in yeast44, mouse45 and other genomes, recombination rates typically vary dramatically on a fine scale, with hotspots of recombination explaining much crossing over in each region28. The generality of this model has recently been demonstrated through computational methods that allow estimation of recombination rates (including hotspots and coldspots) from genotype data46,47. The availability of nearly complete information about common DNA variation in the ENCODE regions allowed a more precise estimation of recombination rates across large regions than in any previous study. We estimated recombination rates and identified recombination hotspots in the ENCODE data, using methods previously described46 (see Supplementary Information for details). Hotspots are short regions (typically spanning about 2 kb) over which recombination rates rise dramatically over local background rates. Whereas the average recombination rate over 500 kb across the human genome is about 0.5 cM48, the estimated recombination rate across the 500-kb ENCODE regions varied nearly tenfold, from a minimum of 0.19 cM (ENm013.7q21.13) to a maximum of 1.25 cM (ENr232.9q34.11). Even this tenfold variation obscures much more dramatic variation over a finer scale: 88 hotspots of recombination were identified (Fig. 8; see also Supplementary Fig. 7)—that is, one per 57 kb—with hotspots detected in each of the ten regions (from 4 in 12q12 to 14 in 2q37.1). Across the 5 Mb, we estimate that about 80% of all recombination has taken place in about 15% of the sequence (Fig. 9, see also refs 46, 49). A block-like structure of human LD. With most human recombination occurring in recombination hotspots, the breakdown of LD is often discontinuous. A ‘block-like’ structure of LD is visually apparent in Fig. 8 and Supplementary Fig. 7: segments of consistently high D0 that break down where high recombination rates, recombination hotspots and obligate recombination events50 all cluster. When haplotype blocks are more formally defined in the ENCODE data (using a method based on a composite of local D0 Figure 6 | Comparison of allele frequencies in the ENCODE data for all pairs of analysis panels and between the CHB and JPT sample sets. For each polymorphic SNP we identified the minor allele across all panels (a–d) and then calculated the frequency of this allele in each analysis panel/sample set. The colour in each bin represents the number of SNPs that display each given set of allele frequencies. The purple regions show that very few SNPs are common in one panel but rare in another. The red regions show that there are many SNPs that have similar low frequencies in each pair of analysis panels/sample sets. NATURE|Vol 437|27 October 2005 ARTICLES 1305
ARTICLES NATUREIVol 437 27 October 2005 34876.000234,879000234,882000234,885000 SNP position Mamala GTC TCAACTGTGTGAGCGAAGGGCCCCCAT GTTACACTCGGCGGTGGGAGCTTAGGAACCCCATGC GTCACACTCGGCGGTGGGAGCTTAGGAACCCCATGC TCCACGCGAGACTACTTAGTTTTCAAGCCT TCACGG CTACTTAGGTTTCAAGCCTTGTCGG TCCACGCGAGACTACTTAGGT TTCAAGCGTTGTCGG ○oooo③ Figure 7 I Genealogical relationships among haplotypes and r values in a binary representation of the same data, with coloured circles at SNP region without obligate recombination events. The region of chromosome positions where a haplotype has the less common allele at that 2(234,876,004-234884481 bp; NCBI build34) within ENr131.2q37 of SNPs all captured by a single tag SNP (with r-20.8)using ontains 36 SNPs, with zero obligate recombination events in the CEU tagging algorithm 4 have the same colour. Seven tag SNPs cor samples. The left part of the plot shows the seven different haplotypes to the seven different colours capture all the SNPs in this region. observed over this region(alleles are indicated only at SNPs), with their respective counts in the data. Underneath each of these haplotypes for the data in this region. values", or another based on the four gamete test), most of the unique haplotypes with frequency more than 0.05 across the 269 Fence falls into long segments of strong LD that contain many individuals in the phased data, and compared them to the fine-scale Ps and yet display limited haplotype diversity (Table 5) recombination map. Figure 10 shows a region of chromosome 19 Specifically, addressing concerns that blocks might be an artefact over which many such haplotypes break at identified recombination of low marker density, in these nearly complete data most of the hotspots, but others continue. Thus, the tendency towards co sequence falls into blocks of four or more SNPs(67% in YRI to 87% localization of recombination sites does not imply that all haplotypes in CEU) and the average sizes of such blocks are similar to initial break at each recombination site. estimates". Although the average block spans many SNPs(30-70), Some regions display remarkably extended haplotype structure he average number of common haplotypes in each block ranged based on a lack of recombination( Supplementary Fig. 8a, b). Most only from 4.0(CHB+ JPT) to 5.6(YRI), with nearly all haplotypes striking, if unsurprising, are centromeric regions, which lack recom in each block matching one of these few common haplotypes. These bination: haplotypes defined by more than 100 SNPs span several results confirm the generality of inferences drawn from disease- megabases across the centromeres. The X chromosome has multiple mapping studies" and genomic surveys with smaller sample sizes regions with very extensive haplotypes, whereas other chromosomes and less complete data typically have a few such domains. ong-range haplotypes and local patterns of recombination. Most global measures of LD become more consistent when Although haplotypes often break at recombination hotspots(and measured in genetic rather than physical distance. For example, block boundaries), this tendency is not invariant. We identified all when plotted against physical distance, the extent of pairwise LI Table 5 I Haplotype blocks in ENCODE regions, according to two methods CHB+JPT Average number of SNPs per block 30.3 544 Average length per block (kb) Fraction of genome spanned by blocks(% Average number of haplotypes (MAF 2 0.05) per block 01 Fraction of chromosomes due to haplotypes with MAF 20.05(%) Method based on the four gamete tests Average number of SNPs per block 24.3 Average length per block(kb) Average number of haplotypes (MAF 2 0.05) per block 5.12 3.63 Fraction of chromosomes due to haplotypes with MAF 2 0.05(%) 2005 Nature Publishing Group
© 2005 Nature Publishing Group values30, or another based on the four gamete test51), most of the sequence falls into long segments of strong LD that contain many SNPs and yet display limited haplotype diversity (Table 5). Specifically, addressing concerns that blocks might be an artefact of low marker density52, in these nearly complete data most of the sequence falls into blocks of four or more SNPs (67% in YRI to 87% in CEU) and the average sizes of such blocks are similar to initial estimates30. Although the average block spans many SNPs (30–70), the average number of common haplotypes in each block ranged only from 4.0 (CHB þ JPT) to 5.6 (YRI), with nearly all haplotypes in each block matching one of these few common haplotypes. These results confirm the generality of inferences drawn from diseasemapping studies27 and genomic surveys with smaller sample sizes29 and less complete data30. Long-range haplotypes and local patterns of recombination. Although haplotypes often break at recombination hotspots (and block boundaries), this tendency is not invariant. We identified all unique haplotypes with frequency more than 0.05 across the 269 individuals in the phased data, and compared them to the fine-scale recombination map. Figure 10 shows a region of chromosome 19 over which many such haplotypes break at identified recombination hotspots, but others continue. Thus, the tendency towards colocalization of recombination sites does not imply that all haplotypes break at each recombination site. Some regions display remarkably extended haplotype structure based on a lack of recombination (Supplementary Fig. 8a, b). Most striking, if unsurprising, are centromeric regions, which lack recombination: haplotypes defined by more than 100 SNPs span several megabases across the centromeres. The X chromosome has multiple regions with very extensive haplotypes, whereas other chromosomes typically have a few such domains. Most global measures of LD become more consistent when measured in genetic rather than physical distance. For example, when plotted against physical distance, the extent of pairwise LD Table 5 | Haplotype blocks in ENCODE regions, according to two methods Parameter YRI CEU CHB þ JPT Method based on a composite of local D’ values30 Average number of SNPs per block 30.3 70.1 54.4 Average length per block (kb) 7.3 16.3 13.2 Fraction of genome spanned by blocks (%) 67 87 81 Average number of haplotypes (MAF $ 0.05) per block 5.57 4.66 4.01 Fraction of chromosomes due to haplotypes with MAF $ 0.05 (%) 94 93 95 Method based on the four gamete test51 Average number of SNPs per block 19.9 24.3 24.3 Average length per block (kb) 4.8 5.9 5.9 Fraction of genome spanned by blocks (%) 86 84 84 Average number of haplotypes (MAF $ 0.05) per block 5.12 3.63 3.63 Fraction of chromosomes due to haplotypes with MAF $ 0.05 (%) 91 95 95 Figure 7 | Genealogical relationships among haplotypes and r 2 values in a region without obligate recombination events. The region of chromosome 2 (234,876,004–234,884,481 bp; NCBI build 34) within ENr131.2q37 contains 36 SNPs, with zero obligate recombination events in the CEU samples. The left part of the plot shows the seven different haplotypes observed over this region (alleles are indicated only at SNPs), with their respective counts in the data. Underneath each of these haplotypes is a binary representation of the same data, with coloured circles at SNP positions where a haplotype has the less common allele at that site. Groups of SNPs all captured by a single tag SNP (with r 2 $ 0.8) using a pairwise tagging algorithm53,54 have the same colour. Seven tag SNPs corresponding to the seven different colours capture all the SNPs in this region. On the right these SNPs are mapped to the genealogical tree relating the seven haplotypes for the data in this region. ARTICLES NATURE|Vol 437|27 October 2005 1306