An.Hm. Genet.(2001).65,436 Printed in Great Britain The phylogeography of Y chromosome binary haplotypes and the origins of modern human populations P.A. UNDERHILLI* G. PASSARINOI, A.A. LINI P SHEN2 M. MIRAZON LAHR, R. A FOLEY.P.. OEFNER2 AND LL. CAVALLI-SFORZAL I Department of Genetics, Stanford University, 300 Pasteur Dr, Stanford, CA 94305-5120, USA 2 Stanford DNA Sequencing and Technology Center, 855 California Ave, Palo Alto, CA 94304, USA a Department of Biological Anthropology, University of Cambridge, Downing Street Cambridge CB2 3DZ, UK a Departamento de biologia, Inst. de Biociencas, Universidad de Sao Paulo, Rua do matio Travessa 14. No. 321.05508-900 Cidade universitaria. Sao paulo, brasil S Department of Cell Biology, Calabria University, Rende, Italy Received 24.8.00. Accepted 16 11.00 SUAMARY Although molecular genetic evidence continues to accumulate that is consistent with a recent common African ancestry of modern humans, its ability to illuminate regional histories remains incomplete. A set of unique event polymorphisms associated with the non-recombining portion of the Y-chromosome (NR Y)addresses this issue by providing evidence concerning successful migrations originating from Africa, which can be interpreted as subsequent colonizations, differentiations and migrations overlaid upon previous population ranges. A total of 205 markers identified by denaturing high performance liquid chromatography(DHPLC), together with 13 taken from the literature, were used to construct a parsimonious genealogy. Ancestral allelic states were deduced from orthologous great ape sequences. A total of 131 unique haplotypes were defined which trace the microevolutionary trajectory of global modern human genetic diversification. The genealogy provides a detailed phylogeographie portrait of contemporary global population structure that is emblematic of human origins, divergence and population history that is consistent with climatic, paleoanthropological and other genetic knowledge INTRODUCTION Northern eurasia. Overlain on these events are A model for the origins of human diversity the contractions associated with the Last glacial deduced from palaeontological evolutionary ge- Maximum(LGM), and subsequent post-glacial ography maintains that while the modern human expansions of both hunter-gatherers and agri species originates from a single evolutionary culturists event, diversity is a result of subsequent multiple DNA sequences offer an evidentiary alterna evolutionary events associated with various tive to fossil-based pre-historical reconstructions geographie range expansions, migrations, (Jorde et al. 1998, Owens King 1999). The colonizations and differential survival of popu- uniparentally inherited non-recombining haploid lations(Lahr Foley, 1994). Overall, current mtDNA and the Y chromosome loci are par paleoanthropological evidence would suggest an ticularly sensitive to the influences of drift early set of dispersals across Africa and into especially founder effect. Consequently these Western Asia; an early southern dispersal into loci are ideal for assessing the origins of con Asia and melanesia: and a later one into temporary population diversity, and provide Correspondence: P. A Underhill context for paleontological hypothesis testing E-mail: under(@stanford. edu (Foley, 1998). The combination of a recent
Ann. Hum. Genet. (2001), 65, 43–62 Printed in Great Britain 43 The phylogeography of Y chromosome binary haplotypes and the origins of modern human populations P. A. UNDERHILL"*, G. PASSARINO",&, A. A. LIN", P. SHEN#, M. MIRAZO!N LAHR$,%, R. A. FOLEY$, P. J. OEFNER# L. L. CAVALLI-SFORZA" " Department of Genetics, Stanford University, 300 Pasteur Dr., Stanford, CA 94305–5120, USA # Stanford DNA Sequencing and Technology Center, 855 California Ave, Palo Alto, CA 94304, USA $ Department of Biological Anthropology, University of Cambridge, Downing Street Cambridge CB2 3DZ, UK % Departamento de Biologia, Inst. de Biociencas, Universidad de Saho Paulo, Rua do Mataho, Travessa 14, No. 321, 05508–900 Cidade UniversitaUria, Saho Paulo, Brasil & Department of Cell Biology, Calabria University, Rende, Italy (Received 24.8.00. Accepted 16.11.00) Although molecular genetic evidence continues to accumulate that is consistent with a recent common African ancestry of modern humans, its ability to illuminate regional histories remains incomplete. A set of unique event polymorphisms associated with the non-recombining portion of the Y-chromosome (NRY) addresses this issue by providing evidence concerning successful migrations originating from Africa, which can be interpreted as subsequent colonizations, differentiations and migrations overlaid upon previous population ranges. A total of 205 markers identified by denaturing high performance liquid chromatography (DHPLC), together with 13 taken from the literature, were used to construct a parsimonious genealogy. Ancestral allelic states were deduced from orthologous great ape sequences. A total of 131 unique haplotypes were defined which trace the microevolutionary trajectory of global modern human genetic diversification. The genealogy provides a detailed phylogeographic portrait of contemporary global population structure that is emblematic of human origins, divergence and population history that is consistent with climatic, paleoanthropological and other genetic knowledge. A model for the origins of human diversity deduced from palaeontological evolutionary geography maintains that while the modern human species originates from a single evolutionary event, diversity is a result of subsequent multiple evolutionary events associated with various geographic range expansions, migrations, colonizations and differential survival of populations (Lahr & Foley, 1994). Overall, current paleoanthropological evidence would suggest an early set of dispersals across Africa and into Western Asia; an early southern dispersal into Asia and Melanesia; and a later one into Correspondence: P. A. Underhill. E-mail: under!stanford.edu Northern Eurasia. Overlain on these events are the contractions associated with the Last Glacial Maximum (LGM), and subsequent post-glacial expansions of both hunter-gatherers and agriculturists. DNA sequences offer an evidentiary alternative to fossil-based pre-historical reconstructions (Jorde et al. 1998, Owens & King 1999). The uniparentally inherited non-recombining haploid mtDNA and the Y chromosome loci are particularly sensitive to the influences of drift, especially founder effect. Consequently these loci are ideal for assessing the origins of contemporary population diversity, and provide context for paleontological hypothesis testing (Foley, 1998). The combination of a recent
44 P. A. UNDERHILL AND OTHERS molecular age( Shen et al. 2000), and geographical repeat elements other than LINE, yielding structure, makes the NrY a sensitive genetic overlapping amplicons 300-500 bp in length index capable of tracing the microevolutionary PCR conditions are given in Underhill et al. 2000 patterns of novel modern human diversity. Any and Shen et al. 2000. All 218 polymorphisms are andallpopulationlevelforcesandpossiblegiveninAppendixI(depositedathttp:// localizednaturalselectionthatreducesthewww.gene.uclac.uk/anhumgen/)whichlists effective male population size relative to females, primers, the primary reference for each marker will influence the genetic landscap the specific DNA sequence variant and its We combine 205 PCR compatible binary NRY loeation in the fragment. Two new markers polymorphisms(Underhill et al. 2000; Shen et al. (M223, M224)found while genotyping other 2000)together with 13 additional markers from markers are included he literature to examine phylogeographical patterns that may record historical population migrations, mergers and divisions that account DHPLC analy for the current spectrum of human variability. Unpurified PCR products were mixed at an While extrapolating variation associated with a equimolar ratio with a reference Y chromosome single gene to population history must be and subjected to a 3-min 95C denaturing step interpreted cautiously, the phylogeographie re- followed by gradual reannealing from 95C to construction presented here offers one such 65C over 30 min. Ten pl of each mixture were interpretation. It comprehensively integrates the loaded onto a DNASep column(Transgenomic, prehistoric and Y-chromosome data, along with San Jose, CA), and the amplicons were eluted inferences from mt DNA and autosomal haplo- in 0. 1 M triethylammonium acetate, pH 7, with types, into a possible hypothesis for the evolution a linear acetonitrile gradient at a flow rate of of human diversity. We attempt here to diseuss 0.9 ml/min(14). Using appropriate temperature the observed phylogeographie patterns of NR Y conditions, which were optimized by computer variationinthecontextofglobalpopulationsimulation(http://insertion.stanfordedu/melt diversification, and integrate it with paleo- html), mismatches were recognized by the climatological, paleoanthropological and other appearance of two or more peaks in the elution genetic knowledge. In developing our synthesis profiles we have aimed at producing palaeodemographic hypotheses that are consistent with as many DNA sequencing other lines of evidence as possible, and that are Poly morphic and reference PCR samples were amenable to testing by further studies from a purified with QIAGen (Valencia, CA)QIAquick number of disciplines spin columns, cyele sequenced with ABI Dye terminator cycle sequencing reagents and MATERIALS AND METHODS analysed on a PE Biosystems 373A sequencer Chimpanzee, gorilla and orangutan samples were also sequenced for each human polymorphic DNA from 1062 men belonging to 21 popu- locus lations was analvsed. Further details on the ethnic affiliations of these samples are given in RESULTS AND DISCUSSION Underhill et al.(2000) The 218 NrY poly morphisms were used to deduce a phylogenetic tree based on the principle of maximum parsimony, in which a network of PCR branches is drawn that minimizes the num ber of Primers designed for SMCY, DFFRY, UTY, mutational events required to relate the lineages and DBY covered all unique sequences and ( Fig. 1). The ancestral alleles were deduced using
44 P. A. U molecular age (Shen et al. 2000), and geographical structure, makes the NRY a sensitive genetic index capable of tracing the microevolutionary patterns of novel modern human diversity. Any and all population level forces and possible localized natural selection that reduces the effective male population size relative to females, will influence the genetic landscape. We combine 205 PCR compatible binary NRY polymorphisms (Underhill et al. 2000; Shen et al. 2000) together with 13 additional markers from the literature to examine phylogeographical patterns that may record historical population migrations, mergers and divisions that account for the current spectrum of human variability. While extrapolating variation associated with a single gene to population history must be interpreted cautiously, the phylogeographic reconstruction presented here offers one such interpretation. It comprehensively integrates the prehistoric and Y-chromosome data, along with inferences from mtDNA and autosomal haplotypes, into a possible hypothesis for the evolution of human diversity. We attempt here to discuss the observed phylogeographic patterns of NRY variation in the context of global population diversification, and integrate it with paleoclimatological, paleoanthropological and other genetic knowledge. In developing our synthesis we have aimed at producing palaeodemographic hypotheses that are consistent with as many other lines of evidence as possible, and that are amenable to testing by further studies from a number of disciplines. Samples DNA from 1062 men belonging to 21 populations was analysed. Further details on the ethnic affiliations of these samples are given in Underhill et al. (2000). PCR Primers designed for SMCY, DFFRY, UTY, and DBY covered all unique sequences and repeat elements other than LINE, yielding overlapping amplicons 300–500 bp in length. PCR conditions are given in Underhill et al. 2000 and Shen et al. 2000. All 218 polymorphisms are given in Appendix I (deposited at http:}} www.gene.ucl.ac.uk}anhumgen}) which lists primers, the primary reference for each marker, the specific DNA sequence variant and its location in the fragment. Two new markers (M223, M224) found while genotyping other markers are included. DHPLC analysis Unpurified PCR products were mixed at an equimolar ratio with a reference Y chromosome and subjected to a 3-min 95 °C denaturing step followed by gradual reannealing from 95 °C to 65 °C over 30 min. Ten µl of each mixture were loaded onto a DNASep2 column (Transgenomic, San Jose, CA), and the amplicons were eluted in 0.1 triethylammonium acetate, pH 7, with a linear acetonitrile gradient at a flow rate of 0.9 ml}min (14). Using appropriate temperature conditions, which were optimized by computer simulation (http:}}insertion.stanford.edu}melt. html), mismatches were recognized by the appearance of two or more peaks in the elution profiles. DNA sequencing Polymorphic and reference PCR samples were purified with QIAGEN (Valencia, CA) QIAquick spin columns, cycle sequenced with ABI Dyeterminator cycle sequencing reagents and analysed on a PE Biosystems 373A sequencer. Chimpanzee, gorilla and orangutan samples were also sequenced for each human polymorphic locus. The 218 NRY polymorphisms were used to deduce a phylogenetic tree based on the principle of maximum parsimony, in which a network of branches is drawn that minimizes the number of mutational events required to relate the lineages (Fig. 1). The ancestral alleles were deduced using
,,, 国日留,器222习非1日2相日粒4#4444瑟3 Fig. 1. Maximum parsimony phylogeny of human NRY chromosome biallelic variation ' Tree is rooted with respeet to non-human primate sequences. The 131 numbered compound haplotypes were construeted from 218 mutations that are indicated on segments. Marker numbers are discontinuous(see text). Haplotypes are assorted into 10 groups(1-X)
Y chromosome binary haplotypes and origins of modern human populations 45 Fig. 1. Maximum parsimony phylogeny of human NRY chromosome biallelic variation. Tree is rooted with respect to non-human primate sequences. The 131 numbered compound haplotypes were constructed from 218 mutations that are indicated on segments. Marker numbers are discontinuous (see text). Haplotypes are assorted into 10 groups (I–X)
P. A. UNDERHILL AND OTHERS great ape sequence data to root the phylogeny. variations associated with the NRY. in addition All phylogenetically equivalent mutations whose to tracing a common African heritage, resolve order cannot be determined are indicated with a numerous population subdivisions, gene flow slash (i.e. M42/M94/M139). Markers with M episodes and colonization events. They show the numbers >218 reflect the selective removal of overall pattern of the progressive succession of polymorphisms associated with recurrent length Group differentiation and movement across the rariations such as tetra- or pentanucleotide world reflective of expansions and genetic drift repeats and homopolymer tracts. The deter- processes mination of the ancestral state for these poly- This composite collection of 218 NRY variants morphisms is uncertain, and (with one exception, provides improved resolution of extant patri M91)they were excluded from the analysis lineages. Additional resolution will occur with (Underhill et al. 2000). The marker panel com- the discovery of new delimiting markers. The prises 125 transitions, 66 transversions, 26 succession of mutations is unequivocal except in insertions/deletions, plus an Alu element. All branches defined by two or more markers. While polymorphisms except one are biallelic. A double uncertainties related to assessing the effective transversion, M116, has three alleles whose population size of males make temporal esti derived alleles define quite different haplotypes. mates of bifurcation events difficult, age esti Two transitions(M64 and M108) showed evi- mates of key nodes have been made assuming a dence of recurrence but cause no ambiguity. No model of population growth (Thomson et reversions were observed, although one tran- 2000). These indicate a more recent ancestry of sition, SRY10831(Whitfield et al. 1995), also the NRY at 59000 years(95% CI= 40000- referred to as SRY1532(Kwok et al. 1996), is 140000) than previously estimated at known to be a reversion(Hammer et al. 1998). It 134250+44980 years based on 13 mutational is not included here as we have phylogenetically events and constant population size(Karafet et stable transversion and deletion polymorphisms al. 1999). Neither demographic model is likely to cally mimic its patterns be realistic, as the palaeoanthropological er Haplotypes are partitioned into haplogroups dence shows a more complex population history (called Groups I-X) in an attempt to simplify It should be noted that the lower estimate is criteria of presence or absence of alleles located in for dispersals of modern humala iest evidence discussion of phylogeography, using the simple considerably younger than the ear the interior of the phylogeny. These discretion ary Group designations provide a framework for categorization and discussion of haplotypes. The Phylogeography Y genealogy is composed of 131 haplotypes that Intriguing clues about the history of our delineate the 10 Groups, seven of which are species can be derived from the study of the monophyletic. Three groups are polyphyletic, geographie distribution of the lineages on the but have related haplotypes defined as follows: tree in Figure 1, in the approach known as the presence of M89 /M213 and absence of M9 ' phylogeographic'(Avise et al. 1987). Such an (Group VI); the presence of M9 and absence of approach has been previously used for mtDNA M175/M214 and M45/M74(Group VIll)or the networks(Richards et al. 1998, 2000: Kivisild et presence of M45/M74 and absence of M173/M207 al. 1999; Macaulay et al. 1999). Figure 3a-h (Group x). The contemporary global frequeney depicts the hypothesized chronological geo- distribution of the 10 Groups based on >1000 graphic distribution of Y Groups from the globally diverse samples genotyped using a Isotope Stage 5 interglacial to the Holocene. The hierarchical top down approach is illustrated in underlying assumption of phylogeography is that Figure 2, which is based upon frequency data there is a correspondence between the overall given in Underhill et al. 2000. Autochthonal distribution of haplotypes and haplogroups and
46 P. A. U great ape sequence data to root the phylogeny. All phylogenetically equivalent mutations whose order cannot be determined are indicated with a slash (i.e. M42}M94}M139). Markers with M numbers"218 reflect the selective removal of polymorphisms associated with recurrent length variations such as tetra- or pentanucleotide repeats and homopolymer tracts. The determination of the ancestral state for these polymorphisms is uncertain, and (with one exception, M91) they were excluded from the analysis (Underhill et al. 2000). The marker panel comprises 125 transitions, 66 transversions, 26 insertions}deletions, plus an Alu element. All polymorphisms except one are biallelic. A double transversion, M116, has three alleles whose derived alleles define quite different haplotypes. Two transitions (M64 and M108) showed evidence of recurrence but cause no ambiguity. No reversions were observed, although one transition, SRY10831 (Whitfield et al. 1995), also referred to as SRY1532 (Kwok et al. 1996), is known to be a reversion (Hammer et al. 1998). It is not included here as we have phylogenetically stable transversion and deletion polymorphisms that unequivocally mimic its patterns. Haplotypes are partitioned into haplogroups (called Groups I–X) in an attempt to simplify discussion of phylogeography, using the simple criteria of presence or absence of alleles located in the interior of the phylogeny. These discretionary Group designations provide a framework for categorization and discussion of haplotypes. The Y genealogy is composed of 131 haplotypes that delineate the 10 Groups, seven of which are monophyletic. Three groups are polyphyletic, but have related haplotypes defined as follows: the presence of M89}M213 and absence of M9 (Group VI); the presence of M9 and absence of M175}M214 and M45}M74 (Group VIII) or the presence of M45}M74 and absence of M173}M207 (Group X). The contemporary global frequency distribution of the 10 Groups based on"1000 globally diverse samples genotyped using a hierarchical top down approach is illustrated in Figure 2, which is based upon frequency data given in Underhill et al. 2000. Autochthonal variations associated with the NRY, in addition to tracing a common African heritage, resolve numerous population subdivisions, gene flow episodes and colonization events. They show the overall pattern of the progressive succession of Group differentiation and movement across the world reflective of expansions and genetic drift processes. This composite collection of 218 NRY variants provides improved resolution of extant patrilineages. Additional resolution will occur with the discovery of new delimiting markers. The succession of mutations is unequivocal except in branches defined by two or more markers. While uncertainties related to assessing the effective population size of males make temporal estimates of bifurcation events difficult, age estimates of key nodes have been made assuming a model of population growth (Thomson et al. 2000). These indicate a more recent ancestry of the NRY at 59 000 years (95% CI¯40 000– 140 000) than previously estimated at 134 250³44 980 years based on 13 mutational events and constant population size (Karafet et al. 1999). Neither demographic model is likely to be realistic, as the palaeoanthropological evidence shows a more complex population history. It should be noted that the lower estimate is considerably younger than the earliest evidence for dispersals of modern humans. Phylogeography Intriguing clues about the history of our species can be derived from the study of the geographic distribution of the lineages on the tree in Figure 1, in the approach known as ‘phylogeographic’ (Avise et al. 1987). Such an approach has been previously used for mtDNA networks (Richards et al. 1998, 2000; Kivisild et al. 1999; Macaulay et al. 1999). Figure 3a–h depicts the hypothesized chronological geographic distribution of Y Groups from the Isotope Stage 5 interglacial to the Holocene. The underlying assumption of phylogeography is that there is a correspondence between the overall distribution of haplotypes and haplogroups and
y chromosome binary haplotypes and origins of modern human populations 47 X Fig. 2. Contemporary worldwide distribution of Y chromosome groups in 22 regions. Each group is represented by a distinguishing colour. Coloured sectors reflect representative group frequencies. Pacific basin not to scale. With respect to Table 1 of Underhill et al.(2000), Hunza and Pakistan+ India are combined. In addition the results of Native Americans have been subdivided in North(N= 14), Centra (N= 13)and South(N= 79) past human movements. The strong geographieal derived alleles for M42/M94/M139 and presence signal seen in the Y chromosome data is of M91, while all non-African, as well as the consistent with this assumption. The interpret- majority of African males, sampled carry the ative framework should be compared with derived alleles. Both Group I and II lineages are alternatives, such as continuous gene flow, diverse and suggest a deeper genealogical heri- selection, or the effects of recent events. How- tage than other haplotypes. Representatives of ever, these alternatives have not been formally these lineages are distributed across Africa but developed in ways that can be tested against the generally at low frequencies. Populations repre data, and are less consistent with other lines of sented in Groups I and Il include some Khoisan evidence and Bantu speakers from South Africa, Pygmies Groups I and II are restricted to Afriea, and from central Africa, and lineages in Suda are distinct from all other African and non- Ethiopia and Mali. A single Sardinian was in African chromosomes on the basis of the M168 Group I. All members of Group ll share the M60 mutation. In an analogous context, the mtdna and mis1 mutations that are distributed across haplogroups L and La are distinguished from Africa, with an idiosyncratic occurrence other Africans and all non-Africans on the basis Pakistan. M182 defines the major sub-clade. of the 3594 mutation (or 3592 Hpal restriction although an intermediate haplotype in Mali with site,Chen et al. 2000 and citations therein). the unique M146 mutation still persists Group I is distinguished by the absence of the Although not mutually exclusive, some geo
Y chromosome binary haplotypes and origins of modern human populations 47 Fig. 2. Contemporary worldwide distribution of Y chromosome groups in 22 regions. Each group is represented by a distinguishing colour. Coloured sectors reflect representative group frequencies. Pacific basin not to scale. With respect to Table 1 of Underhill et al. (2000), Hunza and PakistanIndia are combined. In addition the results of Native Americans have been subdivided in North (N¯14), Central (N¯13) and South (N¯79). past human movements. The strong geographical signal seen in the Y chromosome data is consistent with this assumption. The interpretative framework should be compared with alternatives, such as continuous gene flow, selection, or the effects of recent events. However, these alternatives have not been formally developed in ways that can be tested against the data, and are less consistent with other lines of evidence. Groups I and II are restricted to Africa, and are distinct from all other African and nonAfrican chromosomes on the basis of the M168 mutation. In an analogous context, the mtDNA haplogroups L" and L# are distinguished from other Africans and all non-Africans on the basis of the 3594 mutation (or 3592 HpaI restriction site, Chen et al. 2000 and citations therein). Group I is distinguished by the absence of the derived alleles for M42}M94}M139 and presence of M91, while all non-African, as well as the majority of African males, sampled carry the derived alleles. Both Group I and II lineages are diverse and suggest a deeper genealogical heritage than other haplotypes. Representatives of these lineages are distributed across Africa but generally at low frequencies. Populations represented in Groups I and II include some Khoisan and Bantu speakers from South Africa, Pygmies from central Africa, and lineages in Sudan, Ethiopia and Mali. A single Sardinian was in Group I. All members of Group II share the M60 and M181 mutations that are distributed across Africa, with an idiosyncratic occurrence in Pakistan. M182 defines the major sub-clade, although an intermediate haplotype in Mali with the unique M146 mutation still persists. Although not mutually exclusive, some geo-