ResearchGatecation/341478747Phylogenomics of the genus Populus reveals extensive interspecific gene flowandbalancing selectionArticle in New Phytologist-February 2020DOe 10,111/nph,16215CITATIONREADS111814 authors, includingMingchengWangLei ZhangSichuan UniversityDalian University of Technology14 PUBLICATIONS ~ 42 CITATIONS259 PUBLCATIONS2,743 CITATIONSSEEPROFILESEE PROFILEZhiyang ZhangMengmeng LiSichuan UniversityNanjing University9 PUBLICATIONS 2CITATIONS84 PUBLUCATIONS 678 CITATIONSSEEPROFLESEEPROFILESome of the authors of this publication are also working on these related projects:Optimal control problem of SPDEs View projectoPlastome phylogeny of Brassicaceae View projectAll content following this page was uploaded by Mingcheng Wang on 19 May 2020nloaded file
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/341478747 Phylogenomics of the genus Populus reveals extensive interspecific gene flow and balancing selection Article in New Phytologist · February 2020 DOI: 10.1111/nph.16215 CITATION 1 READS 118 14 authors, including: Some of the authors of this publication are also working on these related projects: Optimal control problem of SPDEs View project Plastome phylogeny of Brassicaceae View project Mingcheng Wang Sichuan University 14 PUBLICATIONS 42 CITATIONS SEE PROFILE Lei Zhang Dalian University of Technology 259 PUBLICATIONS 2,743 CITATIONS SEE PROFILE Zhiyang Zhang Sichuan University 9 PUBLICATIONS 2 CITATIONS SEE PROFILE Mengmeng Li Nanjing University 84 PUBLICATIONS 678 CITATIONS SEE PROFILE All content following this page was uploaded by Mingcheng Wang on 19 May 2020. The user has requested enhancement of the downloaded file
NewResearchPhytologistCheckforupdatesPhylogenomics of the genus Populus reveals extensiveinterspecificgeneflowandbalancingselectionMingcheng Wang'* , Lei Zhang'*, Zhiyang Zhang , Mengmeng Li', Deyan Wang', Xu Zhang, ZhenxiangXi',KenKeefover-Ring,LawrenceB.Smart,StephenP.DiFazio,Matthew S.Olson,TongmingYinJianquan Liuland TaoMa'Key Laboratory of Bio-Resource and Eco-Environmentof Ministry ofEducation, College ofLife Sciences, Sichuan University,Chengdu 610065, China: “State Key Laboratory of GrasslandAgro-Ecosystem, Instirute of Innovation Ecology & College of Life Sciences, Lanzhou Universiy, Lanzhou 730o0, China: Deparments of Borany and Geography, University of WisconsinMadison, 430 Lincoln Dr., Madison, WI 53706, USA;Horticulrure Section, School ofIntegrative Plan Science, New York State Agriculrural Experiment Sration, Cornll Universiry, Genea.NY14456,USA;Department of Biology,West Virginia Universiry,Morgantown,WV25606, USA:Deparment ofBiological Sciences,Texas Tech Universiry,Box 43131,Lubbock,X79409-3131,USA;Co-Innovation Center for SustainableForestry in Southern China,College of Forestry,Nanjing Forestry University,Nanjing 210037.ChinaSummaryAuthor for correspondence:.Phylogenetic analysis is complicated by interspecific gene flow and the presence of sharedTaoMaancestral polymorphisms,particularly those maintained by balancing selection. In this study,Tel:+8613519669951we aimed to examine the prevalence of these factors during the diversification of Populus, aEmail:matao.yz@gmail.commodel tree genus in the Northern HemisphereReceived:9February2019Weconstructed phylogenetic trees of 29Populus taxa using 80 individuals based on re-seAccepted:16September2019quenced genomes.Our species tree analyses recovered fourmain clades in thegenus basedon consensus nuclear phylogenies, but in conflict with the plastome phylogeny. A few inter-New Phytologist (2020) 225: 13701382specific relationships remained unresolved within the multiple-species clade because of incondoi:10.1111/nph.16215sistent genetrees.Our results indicated that gene flow has been widespread within eachclade and also occurred among the fourclades during their early divergence.We identified 45 candidate genes with ancient polymorphisms maintained by balancingKey words:balancingselection,gene flow,selection. These genes were mainly associated with mating compatibility,growth and stressphylogenomics, Populus, trans-specific poly-morphisms.resistance.·Bothgeneflow and selection-mediated ancientpolymorphisms areprevalent inthegenusPopulus.These arepotentiallyimportant contributorsto adaptive variation.Our resultsprovide a framework for the diversification of model tree genus that willfacilitate future com-parative studies.thelong-term maintenance of polymorphisms,and may alsoIntroductionaffect the extent of ILS, is balancing selection (Guerrero & Hahn,The phylogenetic histories of species are complicated, and it is2018),which maybe more common in plants than has been hisnow well understood that the persistence of ancestral polymor-torically recognised (Delph & Kelly, 2014). In the context ofphisms across multiple speciation events contributes to the pres-ILS, historical balancing selection actively maintains polymor-ence of gene genealogies that conflict with speciation historyphisms, and will result in a higher proportion of genes exhibiting(Mayr,1966;Schluter,2001;Coyne&Orr,2004).IncompleteILS once the lineages fix,likely because of weakened selectionlineage sorting (ILS), which results from the persistence of ances-pressures relative to drift. In such cases, orthologous sequencestral polymorphisms across multiple speciation events and thefrom the same loci will cluster by allele, rather than by species,subsequentrandomfixation of thesepolymorphisms in differentthereby distorting phylogenetictrees(Charlesworth 2006;Fijar-lineages, is one process that generates genealogical histories thatczyk & Babik,2015; Gao et al.,2015).However, although bal-are inconsistentwith the species tree(Tajima,1983;Pamilo &ancing selection should increase the frequency of ILS becauseNei, 1988; Degnan & Rosenberg, 2009). Because ILS requiresancient linked polymorphisms aremaintainedacross multiplethelong-term maintenance of ancestral polymorphisms relativespeciation events,it should not bias thegenealogical topologies ofto speciation events, ILS is expected to be much more prevalentlinked sites towards specific genealogical histories (Fijarczyk &inclades withrapid radiations (Wu,1991;Schluter,2000;Babik,2015; Gao et al, 2015).Arnold, 2006; Feng et al,2019).Another factor that influencesTo date, the influence of balancing selection on the maintenance of polymorphisms has been reported for only a limitednumber of taxonomic groups at a few loci.For example, the*These authors contributed equally to this work.1370:NewPhytologist(2020)225:1370-13822019The AuthorsNeuPhytologist2019 New Phytologist Trustwww.newphytologist.com
Phylogenomics of the genus Populus reveals extensive interspecific gene flow and balancing selection Mingcheng Wang1 * , Lei Zhang1 *, Zhiyang Zhang1 , Mengmeng Li1 , Deyan Wang1 , Xu Zhang2 , Zhenxiang Xi1 , Ken Keefover-Ring3 , Lawrence B. Smart4 , Stephen P. DiFazio5 , Matthew S. Olson6 , Tongming Yin7 , Jianquan Liu1,2 and Tao Ma1 1 Key Laboratory of Bio-Resource and Eco-Environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu 610065, China; 2 State Key Laboratory of Grassland Agro-Ecosystem, Institute of Innovation Ecology & College of Life Sciences, Lanzhou University, Lanzhou 730000, China; 3 Departments of Botany and Geography, University of WisconsinMadison, 430 Lincoln Dr., Madison, WI 53706, USA; 4 Horticulture Section, School of Integrative Plant Science, New York State Agricultural Experiment Station, Cornell University, Geneva, NY 14456, USA; 5 Department of Biology, West Virginia University, Morgantown, WV 25606, USA; 6 Department of Biological Sciences, Texas Tech University, Box 43131, Lubbock, TX 79409-3131, USA; 7 Co-Innovation Center for Sustainable Forestry in Southern China, College of Forestry, Nanjing Forestry University, Nanjing 210037, China Author for correspondence: Tao Ma Tel: +86 13519669951 Email: matao.yz@gmail.com Received: 9 February 2019 Accepted: 16 September 2019 New Phytologist (2020) 225: 1370–1382 doi: 10.1111/nph.16215 Key words: balancing selection, gene flow, phylogenomics, Populus, trans-specific polymorphisms. Summary Phylogenetic analysis is complicated by interspecific gene flow and the presence of shared ancestral polymorphisms, particularly those maintained by balancing selection. In this study, we aimed to examine the prevalence of these factors during the diversification of Populus, a model tree genus in the Northern Hemisphere. We constructed phylogenetic trees of 29 Populus taxa using 80 individuals based on re-sequenced genomes. Our species tree analyses recovered four main clades in the genus based on consensus nuclear phylogenies, but in conflict with the plastome phylogeny. A few interspecific relationships remained unresolved within the multiple-species clade because of inconsistent gene trees. Our results indicated that gene flow has been widespread within each clade and also occurred among the four clades during their early divergence. We identified 45 candidate genes with ancient polymorphisms maintained by balancing selection. These genes were mainly associated with mating compatibility, growth and stress resistance. Both gene flow and selection-mediated ancient polymorphisms are prevalent in the genus Populus. These are potentially important contributors to adaptive variation. Our results provide a framework for the diversification of model tree genus that will facilitate future comparative studies. Introduction The phylogenetic histories of species are complicated, and it is now well understood that the persistence of ancestral polymorphisms across multiple speciation events contributes to the presence of gene genealogies that conflict with speciation history (Mayr, 1966; Schluter, 2001; Coyne & Orr, 2004). Incomplete lineage sorting (ILS), which results from the persistence of ancestral polymorphisms across multiple speciation events and the subsequent random fixation of these polymorphisms in different lineages, is one process that generates genealogical histories that are inconsistent with the species tree (Tajima, 1983; Pamilo & Nei, 1988; Degnan & Rosenberg, 2009). Because ILS requires the long-term maintenance of ancestral polymorphisms relative to speciation events, ILS is expected to be much more prevalent in clades with rapid radiations (Wu, 1991; Schluter, 2000; Arnold, 2006; Feng et al., 2019). Another factor that influences the long-term maintenance of polymorphisms, and may also affect the extent of ILS, is balancing selection (Guerrero & Hahn, 2018), which may be more common in plants than has been historically recognised (Delph & Kelly, 2014). In the context of ILS, historical balancing selection actively maintains polymorphisms, and will result in a higher proportion of genes exhibiting ILS once the lineages fix, likely because of weakened selection pressures relative to drift. In such cases, orthologous sequences from the same loci will cluster by allele, rather than by species, thereby distorting phylogenetic trees (Charlesworth 2006; Fijarczyk & Babik, 2015; Gao et al., 2015). However, although balancing selection should increase the frequency of ILS because ancient linked polymorphisms are maintained across multiple speciation events, it should not bias the genealogical topologies of linked sites towards specific genealogical histories (Fijarczyk & Babik, 2015; Gao et al., 2015). To date, the influence of balancing selection on the maintenance of polymorphisms has been reported for only a limited *These authors contributed equally to this work. number of taxonomic groups at a few loci. For example, the 1370 New Phytologist (2020) 225: 1370–1382 2019 The Authors www.newphytologist.com New Phytologist 2019 New Phytologist Trust Research
NewResearch1371Phytologistdiverged alleles at the major histocompatibility locus (MHC) areunresolved between lineages from sect. Turanga ranging fromoften shared across distantly related vertebrate species (Kleincentral Asia to Africa and P.mexicana of the monotypic sect.et al.,2007), and the divergent ABOblood alleles exist togetherAbaso(Eckenwalder,1996).Moreover,phylogeniesconstructedin humans, gibbons and in Old World monkeys (Segurel et al.,with molecular data also conflict.A phylogenetic analyses of2012).In plants, classic examples ofdivergent alleles maintainedchloroplast genomes identified Turanga as the basal sectionby balancing selection include self-incompatibility (S) and disease(Zhang et al., 2018), whereas a phylogeny based on multiple low-resistance (R)genes (Takebayashi et al., 2003; Roux et al.,2013;copy genes suggested that P.mexicana was basal (X.Liu et al.,Karasoyet al.,2014).Recently trans-specific polymorphisms2017).These inconsistencies,togetherwith a lack of resolutionwere reported for five genes in Arabidopsis and distantly relatedin some prior phylogenetic analyses, may result from infuencesCapsella with divergence around 8 Myr, the function of whichof both interspecific gene fow and ILS of ancient polymor-phisms.involved in adaptation to divergent habitats (Wu et al., 2017).In order to investigate the phylogeny of Populus and factorsWith the ability to now generate large whole-genome data sets tothat have infuenced conficting genealogical histories among loci,explore phylogenetic histories, we also can better identify thepotential for long-term balancing selection to influence the main-we generated a whole-genome data set consisting of 80 individu-tenance of polymorphisms across speciation eventsals of 29taxa for thegenus.Our sampling covers all six sectionsUnlike ILS, historical hybridisation will bias the numbers ofof the genus and most distinct species as well as hybrids (Ecken-genealogies that exhibit histories in conflict with the history ofwalder,1996).We first reconstructed a backbone phylogeny forspeciation (Leache et al., 2014; Solis-Lemus et al.,2016).In fact,the genus based on variant sites within the single-copy genes.Wethe presence of this bias underlies the rationale for the ABBA-then examined the presence of hybridisation between the mainBABA test, which has been used to identify historical patterns oflineages by identity-by-descent (IBD) and ABBA-BABA analy-hybridisation throughout the tree of life (Green et al., 2010;ses. Finally, we selected six species from the three main lineagesDurand et al., 2011). One characteristic of this test is that theof thegenus to identify genes with ancient polymorphisms thatwere likely maintained by balancing selection.These resultsgenealogical effects of ancient hybridisation will persist in thegenomes of extant species (Lamichhaney et al., 2015; Novikovaprovide a detailed examination of the phylogenetic relationshipset al,2016;Feng et al.,2019).Plants arewell known tomaintainand evolutionarydiversification ofthemodel genusPopulus.the ability to hybridise even after clear morphological differentia-tion between species has occurred (Eaton et al, 2015; BauteMaterialsand Methodset al., 2016; Pease et al., 2016; Y. Liu et al, 2017; Crowl et al.,2019),which will complicate the reconstruction of speciationSamplecollection,sequencing andmappinghistories.Leaves representing 63 individuals of 24 species were collectedIn this studyweaimed toexaminegeneow andancient polymorphisms within diversification of the genus Populus.All speciesfrom natural populations and dried on silica gel (Supportingof thegenus are collectivelyknown as poplars and are widely dis-Information Table S1).For each individual, whole genomicDNA was extracted using the CTAB protocol (Doyle & Doyletributed in the Northern Hemisphere from subtropical to borealforests (Eckenwalder,1996),wherethey can act askeystone1987).Paired-end Illumina genomic libraries were prepared andspecies (Whitham et al.,2006).In addition, most poplars exhibitsequenced on the HiSeq 2000 and HiSeq 2500 Illumina plat-ecological fexibility, with diverse adaptations and large popula-forms following the manufacturer's instructions (llumina, Santion sizes.Numerous species havebeen artificially planted aroundDiego, CA, USA). Previously published genome sequences of 17the world, and poplars account for more than half of the plantedindividuals from six species were also included in our analysis(Slavov et al.,2012; Geraldes et al.,2015; Wang J.et al., 2016;trees in China, where they are used for the wood, pulp and paperindustries and for environmental restoration projects (IsebrandsMa et al, 2018) (Table S1). In total, genome resequencing data& Richardson, 2014). Although this genus has long been used asfor 80 individuals from 29 taxa covering all six sections of thea model for diverse studies in trees (Tuskan et al.,2006; Janssongenus were obtained.The raw reads were subjected to quality control. Low-quality&Douglas,2007),phylogenetic relationships withinthegenusremain unclear. Frequent interspecific hybridisation and clonalreads were removed if they met any of the following criteria: (1)expansion have perplexed taxonomists of the genus, with≥5% unidentified nucleotides;(2)a phred quality≤7for>65%acknowledged species, varieties, and hybrids ranging from 22 toof read length; and (3) reads overlapping >10bp with the85 (Eckenwalder, 1996).Six sections are traditionally recognisedadapter sequence,allowing<2bp mismatch.Wethen mappedthese high-quality reads to the P.trichocarpa reference genome(Abaso, Turanga, Populus, Leucoides, Tacamahaca and Aigeiros)based on morphological traits (Eckenwalder,1996).However,v.3.0 (Tuskan et al., 2006) using BwA-MEM v.0.7.12-r1039 withthese sections have not been consistently supported by moleculardefault parameters (Li&Durbin,2009).Duplicated reads wereevidence, and relationships among and within the sections haveremoved usingthermdupfunctionof SAMTooLSv.1.3.1(Libeen the subjects of controversy (Hamzeh & Dayanandan, 2004;et al., 2009).Finally,the Genome Analysis Toolkit (GATK)Cervera et al., 2005; Wang et al, 2014; X. Liu et al, 2017,v.3.6 (McKenna et al, 2010) was used to perform local realign-Zhang et al., 2018).For example, based on morphological traitsment of reads to enhance alignments in regions around putativeInDels.and fossil evidence, the basal lineages of Populus remain2019TheAuthorsNew Phytologist(2020)225:1370-1382NeuPlytologist2019New PhytologistTrustwww.newphytologist.com
diverged alleles at the major histocompatibility locus (MHC) are often shared across distantly related vertebrate species (Klein et al., 2007), and the divergent ABO blood alleles exist together in humans, gibbons and in Old World monkeys (Segurel et al., 2012). In plants, classic examples of divergent alleles maintained by balancing selection include self-incompatibility (S) and disease resistance (R) genes (Takebayashi et al., 2003; Roux et al., 2013; Karasov et al., 2014). Recently trans-specific polymorphisms were reported for five genes in Arabidopsis and distantly related Capsella with divergence around 8 Myr, the function of which involved in adaptation to divergent habitats (Wu et al., 2017). With the ability to now generate large whole-genome data sets to explore phylogenetic histories, we also can better identify the potential for long-term balancing selection to influence the maintenance of polymorphisms across speciation events. Unlike ILS, historical hybridisation will bias the numbers of genealogies that exhibit histories in conflict with the history of speciation (Leache et al., 2014; Solıs-Lemus et al., 2016). In fact, the presence of this bias underlies the rationale for the ABBA– BABA test, which has been used to identify historical patterns of hybridisation throughout the tree of life (Green et al., 2010; Durand et al., 2011). One characteristic of this test is that the genealogical effects of ancient hybridisation will persist in the genomes of extant species (Lamichhaney et al., 2015; Novikova et al., 2016; Feng et al., 2019). Plants are well known to maintain the ability to hybridise even after clear morphological differentiation between species has occurred (Eaton et al., 2015; Baute et al., 2016; Pease et al., 2016; Y. Liu et al., 2017; Crowl et al., 2019), which will complicate the reconstruction of speciation histories. In this study, we aimed to examine gene flow and ancient polymorphisms within diversification of the genus Populus. All species of the genus are collectively known as poplars and are widely distributed in the Northern Hemisphere from subtropical to boreal forests (Eckenwalder, 1996), where they can act as keystone species (Whitham et al., 2006). In addition, most poplars exhibit ecological flexibility, with diverse adaptations and large population sizes. Numerous species have been artificially planted around the world, and poplars account for more than half of the planted trees in China, where they are used for the wood, pulp and paper industries and for environmental restoration projects (Isebrands & Richardson, 2014). Although this genus has long been used as a model for diverse studies in trees (Tuskan et al., 2006; Jansson & Douglas, 2007), phylogenetic relationships within the genus remain unclear. Frequent interspecific hybridisation and clonal expansion have perplexed taxonomists of the genus, with acknowledged species, varieties, and hybrids ranging from 22 to 85 (Eckenwalder, 1996). Six sections are traditionally recognised (Abaso, Turanga, Populus, Leucoides, Tacamahaca and Aigeiros) based on morphological traits (Eckenwalder, 1996). However, these sections have not been consistently supported by molecular evidence, and relationships among and within the sections have been the subjects of controversy (Hamzeh & Dayanandan, 2004; Cervera et al., 2005; Wang et al., 2014; X. Liu et al., 2017, Zhang et al., 2018). For example, based on morphological traits and fossil evidence, the basal lineages of Populus remain unresolved between lineages from sect. Turanga ranging from central Asia to Africa and P. mexicana of the monotypic sect. Abaso (Eckenwalder, 1996). Moreover, phylogenies constructed with molecular data also conflict. A phylogenetic analyses of chloroplast genomes identified Turanga as the basal section (Zhang et al., 2018), whereas a phylogeny based on multiple lowcopy genes suggested that P. mexicana was basal (X. Liu et al., 2017). These inconsistencies, together with a lack of resolution in some prior phylogenetic analyses, may result from influences of both interspecific gene flow and ILS of ancient polymorphisms. In order to investigate the phylogeny of Populus and factors that have influenced conflicting genealogical histories among loci, we generated a whole-genome data set consisting of 80 individuals of 29 taxa for the genus. Our sampling covers all six sections of the genus and most distinct species as well as hybrids (Eckenwalder, 1996). We first reconstructed a backbone phylogeny for the genus based on variant sites within the single-copy genes. We then examined the presence of hybridisation between the main lineages by identity-by-descent (IBD) and ABBA–BABA analyses. Finally, we selected six species from the three main lineages of the genus to identify genes with ancient polymorphisms that were likely maintained by balancing selection. These results provide a detailed examination of the phylogenetic relationships and evolutionary diversification of the model genus Populus. Materials and Methods Sample collection, sequencing and mapping Leaves representing 63 individuals of 24 species were collected from natural populations and dried on silica gel (Supporting Information Table S1). For each individual, whole genomic DNA was extracted using the CTAB protocol (Doyle & Doyle, 1987). Paired-end Illumina genomic libraries were prepared and sequenced on the HiSeq 2000 and HiSeq 2500 Illumina platforms following the manufacturer’s instructions (Illumina, San Diego, CA, USA). Previously published genome sequences of 17 individuals from six species were also included in our analysis (Slavov et al., 2012; Geraldes et al., 2015; Wang J. et al., 2016; Ma et al., 2018) (Table S1). In total, genome resequencing data for 80 individuals from 29 taxa covering all six sections of the genus were obtained. The raw reads were subjected to quality control. Low-quality reads were removed if they met any of the following criteria: (1) ≥5% unidentified nucleotides; (2) a phred quality ≤ 7 for ˃65% of read length; and (3) reads overlapping > 10 bp with the adapter sequence, allowing < 2 bp mismatch. We then mapped these high-quality reads to the P. trichocarpa reference genome v.3.0 (Tuskan et al., 2006) using BWA-MEM v.0.7.12-r1039 with default parameters (Li & Durbin, 2009). Duplicated reads were removed using the ‘rmdup’ function of SAMTOOLS v.1.3.1 (Li et al., 2009). Finally, the Genome Analysis Toolkit (GATK) v.3.6 (McKenna et al., 2010) was used to perform local realignment of reads to enhance alignments in regions around putative InDels. 2019 The Authors New Phytologist 2019 New Phytologist Trust New Phytologist (2020) 225: 1370–1382 www.newphytologist.com New Phytologist Research 1371
New1372 ResearchPhytologistthen extracted across all 29 species and divided into threeSingle nucleotide polymorphisms and genotype callingdatasets: (1) the first and second codon positions (C2); (2) theSingle nucleotide polymorphisms (SNPs) and short InDels werethird codon position (C3); and (3) complete coding sequencescalled with GATK UnifiedGenotyper with default parameters for(CDS). For each dataset, the individual gene trees were con-each species separately.Somefiltering steps were performed tostructed using RAxML v.8.0.17 and a species tree was estimatedreduce false positives: (1) SNPs and InDels with a quality scoreusing recently developed coalescence methods in MP-EST v.1.5<30 wereremoved;(2)SNPswithmorethantwoalleleswere(Liu et al., 2010) and AsTRAL v.4.11 (Mirarab et al., 2014).Generemoved;(3)SNPsatorwithin5bpfromanyInDelsweretrees were superimposed using DENsiTREE (Bouckaert, 2010)removed; (4) genotypes with extremely low (less than one-thirdThe ErE2 package (Huerta-Cepas et al., 2010) was used to exam-average depth)or extremelyhigh (greater than three-fold averageine different topologies ofgene trees.We also estimated a con-depth) coverage were assigned as missing sites; and (5) SNPs withcatenation treeusing RAxML v.8.0.17 for concatenatedmissing genotypes in all individuals were removed. The final sin-sequences of the Ci2, C and CDS datasets, respectively.Finallywe reconstructed a ML chloroplast DNA phylogeny based on 77gle nucleotide variants (SNVs) were generated by merging theresults of each species, and only biallelic variant sites wereconcatenated genes present in all the Salicaceae species usingretained for downstream analysis.The final SNVs were annotatedRAxMLwith500bootstrapreplicates.using SNPEFF v.4.3 (Cingolani et al, 2012). A principal compo-nentanalysis was performed on these SNVs usingthe sMARTPCAIdentification of geneflowprogram in EIGENSOFT (Patterson et al., 2006).To detect shared haplotypes and thus possible gene fow betweenspecies, we performed an IBD blocks analysis based on whole-Ancestral state reconstructiongenome SNVs using BEAGLE v.4.1 (Browning &Browning,To obtaindata froman appropriate outgroup for the phylogenetic2013)with thefollowingparameters:window=50000;over-lap=5000; ibdtrim=100; ibdlod=10. To evaluate the correla-analysis, the Salix suchowensis genome (v.5.2) (Dai et al., 2014)tions between IBD block length and recombination, theand the S.purpureagenome(v.1.0)(Zhou etal.,2018)weredownloaded and aligned to the P.tricbocarpa reference genomepopulation-scaled recombination rates (p)of P.trichocarpa and(v.3.0) using LAST v.869 (Kielbasa et al., 2011). Here, c. 151.04P.tremulawereobtained from a previous study(Wang J.etal.,and 151.68 Mb of the P. trichocarpa genome sequences could be2016).We also performed ABBA-BABA analysis usingS. sucbowensis as an outgroup in all comparisons. In brief, for theaccurately aligned byS.suchowensis and S.purpurea,respectively.The sites that were covered by both Salix genomes were thenordered alignment (S,S2),S3), O),two classes of sharedextracted from the alignment file and directly used as the ancestralderived alleles were identified:the ABBA site refers to a patternstate.In total,we identified the ancestral stateat about 81%ofthein which S, has the outgroup allele and S2 and S3 share thewhole-genome SNVs, which covered about 98% of the codingderived allele, the BABA site corresponds to patterns in which SSNVs from the single-copy genes identified below.and S3 share the derived allele and S2 has the outgroup alleleDstatisticswerethencalculatedas(ABBA-BABA)/(ABBA + BABA) (Green et al., 2010). Under the null hypothesisPhylogenetic analysesof ILS,the number of ABBAandBABA sites is expected tobeA maximum-likelihood (ML)tree of the concatenated wholeequal (D=0). Alternatively, significant deviation of D from 0genome SNVs was constructed using RAxML v.8.0.17 (Sta-suggests other events, in particular S exchanging genes with Smatakis, 2006).One hundred bootstrapped alignment files wereor S2 (Durand et al, 2011).D statistics were estimated usinggenerated with the option -fj.For each fle, a ML tree was builtANGsD 0.9.21 (Korneliussen et al., 2014) with a block size ofwith theoption'-f d-N3'based on theGTRCATmodel, which5Mb,and Z-scoreswerecalculated usingthem-block jackknifehas been designed to accelerate the computations on largemethod (Busing et al, 1999). A Z-score with an absolute valuedatasets with>50 taxa.Finallya majority-rule consensus tree of>3was considered statisticallysignificant.thebootstrapped treeswas generated using the'consensus'func-tion of the R package APE (Paradis et al., 2004) and support val-Identification of trans-specific polymorphisms underues of tree splits were calculated using the SuMTREES programbalancingselectionfrom the DENDRoPY package (Sukumaran & Holder, 2010).TheTo investigate trans-specific polymorphisms within Populus, wesame pipeline was applied to SNVs at four-fold degenerate sites(4D SNVs).analysed 72 additional individuals of six species from previouslyWe also identified single-copy genes using the OrthoMCL (Lipublished data sets representing three of the major lineages(Slavov et al., 2012; Geraldes et al, 2015; Wang J. et al, 2016;etal.,2003)method forall protein-codinggenes from seven Sali-caceae species:S. suchowensis (Dai et al., 2014),S.purpureaMa et al.,2018).We applied the same criteria as used above for(Zhou et al.,2018),P.trichocarpa (Tuskan et al., 2006),reads mapping, SNP and genotype calling and filtering,and onlyP. euphratica (Ma et al, 2013), P. pruinosa (Yang et al, 2017),retained SNVs with missing genotype rates <20% in all sixspecies. Shared biallelic SNVs were counted and the genomicP.deltoides(https://genome.jgi.doe.gov/)and P.albavar.pyramidalis (Ma et al., 2019).The SNVs within these genes weredivergence (Fst) between each pair ofspecies was estimated using2019TheAuthorsNeo Pbytologist (2020)225: 1370-1382NeuwPbytologist2019 New Phytologist Trustwww.newphytologist.com
Single nucleotide polymorphisms and genotype calling Single nucleotide polymorphisms (SNPs) and short InDels were called with GATK UnifiedGenotyper with default parameters for each species separately. Some filtering steps were performed to reduce false positives: (1) SNPs and InDels with a quality score < 30 were removed; (2) SNPs with more than two alleles were removed; (3) SNPs at or within 5 bp from any InDels were removed; (4) genotypes with extremely low (less than one-third average depth) or extremely high (greater than three-fold average depth) coverage were assigned as missing sites; and (5) SNPs with missing genotypes in all individuals were removed. The final single nucleotide variants (SNVs) were generated by merging the results of each species, and only biallelic variant sites were retained for downstream analysis. The final SNVs were annotated using SNPEFF v.4.3 (Cingolani et al., 2012). A principal component analysis was performed on these SNVs using the SMARTPCA program in EIGENSOFT (Patterson et al., 2006). Ancestral state reconstruction To obtain data from an appropriate outgroup for the phylogenetic analysis, the Salix suchowensis genome (v.5.2) (Dai et al., 2014) and the S. purpurea genome (v.1.0) (Zhou et al., 2018) were downloaded and aligned to the P. trichocarpa reference genome (v.3.0) using LAST v.869 (Kiełbasa et al., 2011). Here, c. 151.04 and 151.68 Mb of the P. trichocarpa genome sequences could be accurately aligned by S. suchowensis and S. purpurea, respectively. The sites that were covered by both Salix genomes were then extracted from the alignment file and directly used as the ancestral state. In total, we identified the ancestral state at about 81% of the whole-genome SNVs, which covered about 98% of the coding SNVs from the single-copy genes identified below. Phylogenetic analyses A maximum-likelihood (ML) tree of the concatenated wholegenome SNVs was constructed using RAxML v.8.0.17 (Stamatakis, 2006). One hundred bootstrapped alignment files were generated with the option ‘-f j’. For each file, a ML tree was built with the option ‘-f d -N 3’ based on the GTRCAT model, which has been designed to accelerate the computations on large datasets with > 50 taxa. Finally a majority-rule consensus tree of the bootstrapped trees was generated using the ‘consensus’ function of the R package APE (Paradis et al., 2004) and support values of tree splits were calculated using the SUMTREES program from the DENDROPY package (Sukumaran & Holder, 2010). The same pipeline was applied to SNVs at four-fold degenerate sites (4D SNVs). We also identified single-copy genes using the OrthoMCL (Li et al., 2003) method for all protein-coding genes from seven Salicaceae species: S. suchowensis (Dai et al., 2014), S. purpurea (Zhou et al., 2018), P. trichocarpa (Tuskan et al., 2006), P. euphratica (Ma et al., 2013), P. pruinosa (Yang et al., 2017), P. deltoides (https://genome.jgi.doe.gov/) and P. alba var. pyramidalis (Ma et al., 2019). The SNVs within these genes were then extracted across all 29 species and divided into three datasets: (1) the first and second codon positions (C12); (2) the third codon position (C3); and (3) complete coding sequences (CDS). For each dataset, the individual gene trees were constructed using RAXML v.8.0.17 and a species tree was estimated using recently developed coalescence methods in MP-EST v.1.5 (Liu et al., 2010) and ASTRAL v.4.11 (Mirarab et al., 2014). Gene trees were superimposed using DENSITREE (Bouckaert, 2010). The ETE2 package (Huerta-Cepas et al., 2010) was used to examine different topologies of gene trees. We also estimated a ‘concatenation tree’ using RAXML v.8.0.17 for concatenated sequences of the C12, C3 and CDS datasets, respectively. Finally, we reconstructed a ML chloroplast DNA phylogeny based on 77 concatenated genes present in all the Salicaceae species using RAXML with 500 bootstrap replicates. Identification of gene flow To detect shared haplotypes and thus possible gene flow between species, we performed an IBD blocks analysis based on wholegenome SNVs using BEAGLE v.4.1 (Browning & Browning, 2013) with the following parameters: window = 50 000; overlap = 5000; ibdtrim = 100; ibdlod = 10. To evaluate the correlations between IBD block length and recombination, the population-scaled recombination rates (q) of P. trichocarpa and P. tremula were obtained from a previous study (Wang J. et al., 2016). We also performed ABBA–BABA analysis using S. suchowensis as an outgroup in all comparisons. In brief, for the ordered alignment (((S1, S2), S3), O), two classes of shared derived alleles were identified: the ABBA site refers to a pattern in which S1 has the outgroup allele and S2 and S3 share the derived allele, the BABA site corresponds to patterns in which S1 and S3 share the derived allele and S2 has the outgroup allele. D statistics were then calculated as (ABBA BABA)/ (ABBA + BABA) (Green et al., 2010). Under the null hypothesis of ILS, the number of ABBA and BABA sites is expected to be equal (D = 0). Alternatively, significant deviation of D from 0 suggests other events, in particular S3 exchanging genes with S1 or S2 (Durand et al., 2011). D statistics were estimated using ANGSD 0.9.21 (Korneliussen et al., 2014) with a block size of 5 Mb, and Z-scores were calculated using the m-block jackknife method (Busing et al., 1999). A Z-score with an absolute value > 3 was considered statistically significant. Identification of trans-specific polymorphisms under balancing selection To investigate trans-specific polymorphisms within Populus, we analysed 72 additional individuals of six species from previously published data sets representing three of the major lineages (Slavov et al., 2012; Geraldes et al., 2015; Wang J. et al., 2016; Ma et al., 2018). We applied the same criteria as used above for reads mapping, SNP and genotype calling and filtering, and only retained SNVs with missing genotype rates < 20% in all six species. Shared biallelic SNVs were counted and the genomic divergence (FST) between each pair of species was estimated using New Phytologist (2020) 225: 1370–1382 2019 The Authors www.newphytologist.com New Phytologist 2019 New Phytologist Trust Research New 1372 Phytologist
NewResearch1373PhytologistVCFro0LS v.0.1.14 (Danecek et al., 2011). Joint allele frequencyAigeiros, Tacamahaca and Leucoides, which werefertofrom thisspectra were calculated for biallelic sites berween species,point forward as'ATL'.The divergence of these four clades wereunfolded using S.suchowensis as the outgroup,and plotted usingalso supported by a principal component analysis (Fig.S2).WeDADI v.1.7.0 (Gutenkunst et al., 2009).found that c.6% of the identified genome-wide SNVs wereTo identify genes under balancing selection, we focused onlyshared among these clades (Fig. S3),which may contribute signifon SNVs located in genic regions and shared by all six species.icantly to phylogenetic inconsistencies.To filter out potential duplicated genes,we estimated the copyTo further investigate the phylogenetic conficts of this genus,number for these genes in each individual based on the ratio ofwe identified 5305 single-copy orthologous genes across sevenexon coverage depth divided by genome coverage depth (Hast-Salicaceae genomes and extracted 620531 SNVs within the cod-ings et al., 2009), and retained genes with a ratio between 0.4ing region of these genes. Individual gene trees constructed fromand 1.6 in all individuals. To avoid the potential for samplingthese data generally had low support values and the relationshipssites influenced by convergence resulting from repeated muta-among sections of Populus werehighly variable among them (Figstions among shared SNVs, we considered genes with more thanS4,S5).Wealso constructed concatenationtrees based ondiffer-one shared SNV and at least one shared SNV in coding regions.ent partitions of these orthologues, and the results consistentlyFinally,the genes with at least two shared SNVs in linkage dise-supported the basal position of sect.Abaso and followed by thequilibrium (2>0.3) in all three major lineages were selected assuccessive divergences of the other three clades: sect. Turanga,candidate genes under balancing selection.sect. Populus and ATL (Fig. S6). The Ci2 and CDS concatena-tion trees supported the placement of sect.Turanga as basal tosects.Populus and ATL, whereas in the C3 concatenation tree,Dataaccessibilitysect.Populus diverged earlier than sect.Turanga.These phylogeThe sequencing data have been deposited in the Genomenetic conflicts among different gene partitions of orthologuesSequence Archive in the BIGData Center (BIG Data Centerwere also observed in our species tree analyses. Both AsTRAL andMembers, 2019), Beijing Institute of Genomics (BIG), ChineseMP-EST methods generated species trees nearly identical withthe concatenation tree when applied to different gene partitionsAcademy of Sciences, under accession number CRA001510 thatis publicly accessible at http:/bigd.big.ac.cn/gsa. The whole-(Figs S7,S8): sect.Populus was placed sister to the ATL clade ingenome SNV data and gene trees have been deposited in GitHuball the species trees except the C3 partition.However,this phylo(https://github.com/wangmcyao/Whole-genome-SNPs-and-gegenetic relationship was supported byCspartition after eliminat-ing gene trees with low values of average bootstrap supportne-trees-of-genus-Populus).(Fig, S9). Among these trees, the Ci2 species tree had the highestbootstrap support in all the major nodes, and thus was consideredResultsas the most reliable topologyto resolvethe Populus phylogeny(Fig, la). However, based on the phylogenetic analyses of plas-Phylogenetic analysestome sequences, onlythe monophyly of sect.Turanga was sup-To clarify the phylogenetic relationships within Populus, we col-ported, which occupied the basal position, while P. mexicana oflected 1.12Tb of whole-genome sequencing data of 80 individu-sect. Abaso clustered with species from sects. Aigeiros (P. deltoidesals from 29 species, representing all six sections and almost all ofand P.fremonti),Tacamabaca (P.angustifolia,P.trichocarpa andthe currently recognised species of this genus (Isebrands &P. balsamifera) and Leucoides (P. heterophylla) (Fig. S10).Richardson, 2014). Sequences were first aligned to the referencegenome of P.trichocarpa and about 88% were successfullyExtensivegeneflowmapped, covering 81% of the genome and yielding an averageThe highly variable relationships indicated by the gene trees, anddepth of 29 × per individual (Table S1). We applied stringentvariant calling and quality filters to identify a final set of 12.93the striking discordance between the plastome phylogeny and ourmillion biallelic SNVs (Table S2). Among these, 1.94 millioncoalescence tree,maybe due to ILS and interspecific gene fow.nonsynonymousSNVsand1.8millionsynonymousSNVswereTo gain further insight into the relationships among the species,identified.we searched for IBD haplotypes using BEAGLE. No IBD blocksWe next inferred a concatenated genome tree using a MLwere identified across sects.Abaso, Turanga, Populus or ATL;method for genome-wide and four-fold degenerate (4D) SNVs,however, abundant shared IBD blocks were detected in compar-respectively. Both trees were highly resolved for most clades andisons within the four major sections, both between and withinshowed nearly identical topologies, with the only differences inspecies (Figs 2a, S11). Of the shared blocks between species,the positions of P. lasiocarpa and P.ningshanica (Fig.Si). In both75.9% (62 539) were detected within the ATL clade, while 15.2-phylogenies, P.mexicana of sect.Abaso diverged first with high%(12483)were detected within sect.Populus,and only8.9%support, followed by sect. Populus, which was monophyletic and(7323) within sect. Turanga. The length of IBD haplotypesclearly divided into two subclades, one with only Asian speciesshared between species (Fig. S1la) ranged from 1.9 kb to 1.5 Mband a second dlade with species representing Asian, Europe and(median=23.2 kb). As expected, extensively shared haplotypesNorth America. The next split included the monophyletic sect.were found berween recently diverged species, for example,Turanga and a polyphyletic cladeconsisting of members of sects.between P.trichocarpa and P. balsamifera (median=15.1kb,2019TheAuthorsNewPhytologist(2020)225:1370-1382New Plytologist @2019New Phytologist Trustwww.newphytologist.com
VCFTOOLS v.0.1.14 (Danecek et al., 2011). Joint allele frequency spectra were calculated for biallelic sites between species, unfolded using S. suchowensis as the outgroup, and plotted using DADI v.1.7.0 (Gutenkunst et al., 2009). To identify genes under balancing selection, we focused only on SNVs located in genic regions and shared by all six species. To filter out potential duplicated genes, we estimated the copy number for these genes in each individual based on the ratio of exon coverage depth divided by genome coverage depth (Hastings et al., 2009), and retained genes with a ratio between 0.4 and 1.6 in all individuals. To avoid the potential for sampling sites influenced by convergence resulting from repeated mutations among shared SNVs, we considered genes with more than one shared SNV and at least one shared SNV in coding regions. Finally, the genes with at least two shared SNVs in linkage disequilibrium (r 2 > 0.3) in all three major lineages were selected as candidate genes under balancing selection. Data accessibility The sequencing data have been deposited in the Genome Sequence Archive in the BIG Data Center (BIG Data Center Members, 2019), Beijing Institute of Genomics (BIG), Chinese Academy of Sciences, under accession number CRA001510 that is publicly accessible at http://bigd.big.ac.cn/gsa. The wholegenome SNV data and gene trees have been deposited in GitHub (https://github.com/wangmcyao/Whole-genome-SNPs-and-ge ne-trees-of-genus-Populus). Results Phylogenetic analyses To clarify the phylogenetic relationships within Populus, we collected 1.12 Tb of whole-genome sequencing data of 80 individuals from 29 species, representing all six sections and almost all of the currently recognised species of this genus (Isebrands & Richardson, 2014). Sequences were first aligned to the reference genome of P. trichocarpa and about 88% were successfully mapped, covering 81% of the genome and yielding an average depth of 29 9 per individual (Table S1). We applied stringent variant calling and quality filters to identify a final set of 12.93 million biallelic SNVs (Table S2). Among these, 1.94 million nonsynonymous SNVs and 1.8 million synonymous SNVs were identified. We next inferred a concatenated genome tree using a ML method for genome-wide and four-fold degenerate (4D) SNVs, respectively. Both trees were highly resolved for most clades and showed nearly identical topologies, with the only differences in the positions of P. lasiocarpa and P. ningshanica (Fig. S1). In both phylogenies, P. mexicana of sect. Abaso diverged first with high support, followed by sect. Populus, which was monophyletic and clearly divided into two subclades, one with only Asian species and a second clade with species representing Asian, Europe and North America. The next split included the monophyletic sect. Turanga and a polyphyletic clade consisting of members of sects. Aigeiros, Tacamahaca and Leucoides, which we refer to from this point forward as ‘ATL’. The divergence of these four clades were also supported by a principal component analysis (Fig. S2). We found that c. 6% of the identified genome-wide SNVs were shared among these clades (Fig. S3), which may contribute significantly to phylogenetic inconsistencies. To further investigate the phylogenetic conflicts of this genus, we identified 5305 single-copy orthologous genes across seven Salicaceae genomes and extracted 620 531 SNVs within the coding region of these genes. Individual gene trees constructed from these data generally had low support values and the relationships among sections of Populus were highly variable among them (Figs S4, S5). We also constructed concatenation trees based on different partitions of these orthologues, and the results consistently supported the basal position of sect. Abaso and followed by the successive divergences of the other three clades: sect. Turanga, sect. Populus and ATL (Fig. S6). The C12 and CDS concatenation trees supported the placement of sect. Turanga as basal to sects. Populus and ATL, whereas in the C3 concatenation tree, sect. Populus diverged earlier than sect. Turanga. These phylogenetic conflicts among different gene partitions of orthologues were also observed in our species tree analyses. Both ASTRAL and MP-EST methods generated species trees nearly identical with the concatenation tree when applied to different gene partitions (Figs S7, S8): sect. Populus was placed sister to the ATL clade in all the species trees except the C3 partition. However, this phylogenetic relationship was supported by C3 partition after eliminating gene trees with low values of average bootstrap support (Fig. S9). Among these trees, the C12 species tree had the highest bootstrap support in all the major nodes, and thus was considered as the most reliable topology to resolve the Populus phylogeny (Fig. 1a). However, based on the phylogenetic analyses of plastome sequences, only the monophyly of sect. Turanga was supported, which occupied the basal position, while P. mexicana of sect. Abaso clustered with species from sects. Aigeiros (P. deltoides and P. fremontii), Tacamahaca (P. angustifolia, P. trichocarpa and P. balsamifera) and Leucoides (P. heterophylla) (Fig. S10). Extensive gene flow The highly variable relationships indicated by the gene trees, and the striking discordance between the plastome phylogeny and our coalescence tree, may be due to ILS and interspecific gene flow. To gain further insight into the relationships among the species, we searched for IBD haplotypes using BEAGLE. No IBD blocks were identified across sects. Abaso, Turanga, Populus or ATL; however, abundant shared IBD blocks were detected in comparisons within the four major sections, both between and within species (Figs 2a, S11). Of the shared blocks between species, 75.9% (62 539) were detected within the ATL clade, while 15.2- % (12 483) were detected within sect. Populus, and only 8.9% (7323) within sect. Turanga. The length of IBD haplotypes shared between species (Fig. S11a) ranged from 1.9 kb to 1.5 Mb (median = 23.2 kb). As expected, extensively shared haplotypes were found between recently diverged species, for example, between P. trichocarpa and P. balsamifera (median = 15.1 kb, 2019 The Authors New Phytologist 2019 New Phytologist Trust New Phytologist (2020) 225: 1370–1382 www.newphytologist.com New Phytologist Research 1373