Acta Societatis Botanicorum Poloniae ORIGINAL RESEARCH PAPER Acta Soc Bot Pol 83(3):191-199 DOl:10.5586/asbp.2014.023 Received:2014-02-19 Accepted:2014-07-28 Published electronically:2014-09-26 De novo sequencing and comparative transcriptome analysis of white petals and red labella in Phalaenopsis for discovery of genes related to flower color and floral differentation Yuxia Yang',Jingjing Wang12,Zhihu Ma3,Guosheng Sun3,Changwei Zhang1* College of Horticulture,Nanjing Agricultural University,1 Weigang,Nanjing,Jiangsu Province,21009,PR.China 2Dongxin Department,Jiansu Provincial Agricultural Redamation and Development Corporation,26 Dongfang Middle Road,Lianyungang,Jiangsu Province,222248,P.R.China Zhenjiang Institute of Agricultural Sciences in Hilly Area of Jiangsu Province,112#Ninghang Road,Jurong,Jiangsu,212400,P.R.China Abstract Phalaenopsis is one of the world's most popular and important epiphytic monopodial orchids.The extraordinary floral diversity of Phalaenopsis is a reflection of its evolutionary success.As a consequence of this diversity,and of the complexity of flower color development in Phalaenopsis,this species is a valuable research material for developmental biology studies. Nevertheless,research on the molecular mechanisms underlying flower color and floral organ formation in Phalaenopsis is still in the early phases.In this study,we generated large amounts of data from Phalaenopsis flowers by combining Illumina sequencing with differentially expressed gene(DEG)analysis.We obtained 37723 and 34020 unigenes from petals and labella, respectively.A total of 2736 DEGs were identified,and the functions of many DEGs were annotated by BLAST-searching against several public databases.We mapped 837 up-regulated DEGs(432 from petals and 405 from labella)to 102 Kyoto Encyclopedia of Genes and Genomes pathways.Almost all pathways were represented in both petals(102 pathways)and labella(99 pathways).DEGs involved in energy metabolism were significantly differentially distributed between labella and petals,and various DEGs related to flower color and floral differentiation were found in the two organs.Interestingly,we also identified genes encoding several key enzymes involved in carotenoid synthesis.These genes were differentially expressed between petals and labella,suggesting that carotenoids may influence Phalaenopsis flower color.We thus conclude that a combination of anthocyanins and/or carotenoids determine flower color formation in Phalaenopsis.These results broaden our understanding of the mechanisms controlling flower color and floral organ differentiation in Phalaenopsis and other orchids Keywords:Phalaenopsis;RNA-seq;transcriptome;flower color;diversity Introduction exogenous and endogenous signals,including light stimuli, hormones,ligand-receptor interactions,signal transduction Phalaenopsis amabilis,an epiphytic monopodial orchid, pathways,and transcription factor cascades [4].A recent is an important ornamental species with a huge global theory known as "the orchid code"proposes a reasonable market,especially in western countries.Ornamental quality model in which four different class B AP3/DEF-like MADS- in Phalaenopsis is influenced by many factors,including box genes have played a vital role in the evolution of the flower color,fragrance and shape,cut-flower longevity, orchid perianth by their combinatorial interaction.Orchid flowering control,and abiotic stress tolerance [1].Among floral diversity is an evolutionary consequence of two du- these,flower color and diversity are the two main factors plication events and associated changes that occurred in the visually impacting ornamental and commercial values of regulatory regions of the class B AP3/DEF-like MADS-box Phalaenopsis. genes,which was followed by sub-and neo-functionalization. Phalaenopsis flowers are zygomorphic and include three In orchids,class B AP3/DEF-like MADS-box genes are outer tepals(T1-T3,also known as sepals)in the first whorl, divided into four distinct clades:PeMADS2-like(clade 1), two lateral inner tepals(tl and t2,the petals),and a median OMADS3-like (clade 2),PeMADS3-like (clade 3),and inner tepal(t3,the lip or labella)[2,3].Floral initiation PeMADS4-like (clade 4),each having its own specific ex- and development are regulated by complex networks of pression pattern.The combined expressions of clade 1 and clade 2 genes mediate the development of the three outer tepals,while the combination of clade 1,clade 2,and clade *Corresponding author.Email:changweizh@njau.edu.cn 3 genes leads to the development of the lateral inner tepals. Handling Editor:Przemyslaw Wojtaszek Labella development is determined by a combination of genes This is an Open Access article distributed under the terms of the Creative Commons Attribution 3.0 License (creativecommons.org/licenses/by/3.0/),which permits 191 redistribution,commercial and non-commercial provided that the article is properly cited.The Author(s)2014 Published by Polish Botanical Soclety
This is an Open Access article distributed under the terms of the Creative Commons Attribution 3.0 License (creativecommons.org/licenses/by/3.0/), which permits 191 redistribution, commercial and non-commercial, provided that the article is properly cited. © The Author(s) 2014 Published by Polish Botanical Society Introduction Phalaenopsis amabilis, an epiphytic monopodial orchid, is an important ornamental species with a huge global market, especially in western countries. Ornamental quality in Phalaenopsis is influenced by many factors, including flower color, fragrance and shape, cut-flower longevity, flowering control, and abiotic stress tolerance [1]. Among these, flower color and diversity are the two main factors visually impacting ornamental and commercial values of Phalaenopsis. Phalaenopsis flowers are zygomorphic and include three outer tepals (T1–T3, also known as sepals) in the first whorl, two lateral inner tepals (t1 and t2, the petals), and a median inner tepal (t3, the lip or labella) [2,3]. Floral initiation and development are regulated by complex networks of exogenous and endogenous signals, including light stimuli, hormones, ligand–receptor interactions, signal transduction pathways, and transcription factor cascades [4]. A recent theory known as “the orchid code” proposes a reasonable model in which four different class B AP3/DEF-like MADSbox genes have played a vital role in the evolution of the orchid perianth by their combinatorial interaction. Orchid floral diversity is an evolutionary consequence of two duplication events and associated changes that occurred in the regulatory regions of the class B AP3/DEF-like MADS-box genes, which was followed by sub- and neo-functionalization. In orchids, class B AP3/DEF-like MADS-box genes are divided into four distinct clades: PeMADS2-like (clade 1), OMADS3-like (clade 2), PeMADS3-like (clade 3), and PeMADS4-like (clade 4), each having its own specific expression pattern. The combined expressions of clade 1 and clade 2 genes mediate the development of the three outer tepals, while the combination of clade 1, clade 2, and clade 3 genes leads to the development of the lateral inner tepals. Labella development is determined by a combination of genes * Corresponding author. Email: changweizh@njau.edu.cn Handling Editor: Przemysław Wojtaszek Keywords: Phalaenopsis; RNA-seq; transcriptome; flower color; diversity Abstract Phalaenopsis is one of the world’s most popular and important epiphytic monopodial orchids. The extraordinary floral diversity of Phalaenopsis is a reflection of its evolutionary success. As a consequence of this diversity, and of the complexity of flower color development in Phalaenopsis, this species is a valuable research material for developmental biology studies. Nevertheless, research on the molecular mechanisms underlying flower color and floral organ formation in Phalaenopsis is still in the early phases. In this study, we generated large amounts of data from Phalaenopsis flowers by combining Illumina sequencing with differentially expressed gene (DEG) analysis. We obtained 37 723 and 34 020 unigenes from petals and labella, respectively. A total of 2736 DEGs were identified, and the functions of many DEGs were annotated by BLAST-searching against several public databases. We mapped 837 up-regulated DEGs (432 from petals and 405 from labella) to 102 Kyoto Encyclopedia of Genes and Genomes pathways. Almost all pathways were represented in both petals (102 pathways) and labella (99 pathways). DEGs involved in energy metabolism were significantly differentially distributed between labella and petals, and various DEGs related to flower color and floral differentiation were found in the two organs. Interestingly, we also identified genes encoding several key enzymes involved in carotenoid synthesis. These genes were differentially expressed between petals and labella, suggesting that carotenoids may influence Phalaenopsis flower color. We thus conclude that a combination of anthocyanins and/or carotenoids determine flower color formation in Phalaenopsis. These results broaden our understanding of the mechanisms controlling flower color and floral organ differentiation in Phalaenopsis and other orchids. 1 College of Horticulture, Nanjing Agricultural University, 1# Weigang, Nanjing, Jiangsu Province, 210095, P.R. China 2 Dongxin Department, Jiansu Provincial Agricultural Reclamation and Development Corporation, 26# Dongfang Middle Road, Lianyungang, Jiangsu Province, 222248, P.R. China 3 Zhenjiang Institute of Agricultural Sciences in Hilly Area of Jiangsu Province, 112# Ninghang Road, Jurong, Jiangsu, 212400, P.R. China Yuxia Yang1 , Jingjing Wang1,2, Zhihu Ma3 , Guosheng Sun3 , Changwei Zhang1 * De novo sequencing and comparative transcriptome analysis of white petals and red labella in Phalaenopsis for discovery of genes related to flower color and floral differentation ORIGINAL RESEARCH PAPER Acta Soc Bot Pol 83(3):191–199 DOI: 10.5586/asbp.2014.023 Received: 2014-02-19 Accepted: 2014-07-28 Published electronically: 2014-09-26
Yang et al/Comparative from all four clades.Differential expression of clade 3 genes Total RNA extraction,cDNA library construction, is obviously responsible for differences between inner and and Illumina deep sequencing outer tepals,whereas differential expression of clade 4 genes Total RNA was extracted from petal and labella samples differentiates the lateral inner tepals from the labella [2,3]. using a Trizol kit (Takara,Japan).Total RNA quality and Orchid PI/GLO-like genes,found to be present in the AP3/ quantity were analyzed using a Nanodrop 2000 instrument DEF-like gene copies,are also necessary for current floral (Thermo Scientific)and a ChipRNA 7500 Series II Bioana- tissue development [4].Despite this knowledge,however,the lyzer (Agilent).The two total RNA samples were delivered regulatory network controlling orchid floral development to Beijing Biomarker Biotechnology Co.(Beijing,China) remains unclear. for the construction of cDNA libraries using an mRNA-Seq Flower color is derived from the three major classes of Sample Preparation kit (Illumina)according to the manu- plant pigments:anthocyanins,betalains,and carotenoids [5]. facturer's instructions.The sequencing of the two samples Of these,anthocyanins are the major contributors to flower was performed on an Illumina HiSeq 2000 system. color [6].A class of water-soluble flavonoids,anthocyanins are synthesized in the cytosol and localized in vacuoles. Sequence assembly and annotation Through the phenylpropanoid pathway,they provide a wide The raw image data produced from sequencing was range of colors,ranging from orange-red to violet-blue in transformed by base calling into raw reads.Transcriptome dark-colored flowers [5].Despite their structural variety, de novo assembly was carried out with the Trinity short-read anthocyanins are only categorized into six chromophore assembly program,which generated in turn contigs,tran- forms:pelargonidin,cyanidin,peonidin,delphinidin,pe- scripts,and unigenes.To identify unigene putative functions, tunidin,and malvidin [7,8].The anthocyanin biosynthetic their sequences were aligned using BLASTX(E-value s10-5) pathway has been well elaborated [9]. against the following public protein databases:National In orchids,the primary anthocyanin in red flowers is a Center for Biotechnology Information non-redundant(Nr) cyanidin derivative that is typically modified by glycosylation and nucleotide (Nt)databases and SwissProt,TrEMBL, and acylation [10].The glycosylation-related gene PeU-FGT3 Clusters of Orthologous Groups(COG),Gene Ontology plays a critical role in red color formation in Phalaenopsis (GO),and Kyoto Encyclopedia of Genes and Genomes [8].Several important enzymes,such as chalcone synthase (KEGG)databases.The Blast2GO software package was (CHS),chalcone isomerase(CHI),dihydroflavonol 4-reduc- used to compare and determine unigene GO annotations. tase (DFR),and anthocyanidin synthase (ANS),are involved Finally,WEGO software was used to obtain GO functional in the formation of colored anthocyanidins [5]. classifications for all annotated unigenes. Compared with studies in other flowering plants,the molecular basis of floral color development has not been Identification of differentially expressed genes(DEGs) well characterized in orchids.A better understanding of To identify DEGs between the two samples,the following the molecular mechanisms underlying orchid flower color formula was used to calculate significance(P)of differences and floral organ formation is thus needed.Furthermore,few in transcript accumulation for each gene: transcriptomic-based investigations of the functions of genes N2 related to flower color and floral differentiation have been (x+y)! P(y/x)= N1 reported for Phalaenopsis.To expand knowledge regarding 01+e* flower color and diversity for Phalaenopsis breeding,in this study we analyzed differential gene expression between where NI and N2 represent the total number of clean reads petals and labella using the Illumina RNA-Seq method. from petals and labella,respectively,and x and y represent Our study generated a huge number of Phalaenopsis tran- the number of reads mapping to the given gene.We then script sequences during floral formation that can be used to discover putative genes related to flower color and floral differentiation.By comparing relative gene expression levels between petals and labella,novel insights can be gleaned into orchid floral development.Our study therefore provides a foundation for future research on mechanisms underlying floral development in Phalaenopsis and other orchids. Material and methods Plant material and sample collection Phalaenopsis plants with white petals and red labella Peta Labella (Fig.1)were grown in greenhouses at Nanjing Agriculture University under natural light conditions and a controlled temperature of 22-27C.Petals and labella were collected at the full-bloom stage.The two samples were immersed in liquid nitrogen and stored at-80C until subjected to total RNA extraction. Fig.I Photographic image of selected flower materials The Author(s)2014 Published by Polish Botanical Soclety Acta Soc Bot Pol 83(3):191-199 192
© The Author(s) 2014 Published by Polish Botanical Society Acta Soc Bot Pol 83(3):191–199 192 Yang et al. / Comparative transcriptome analysis of Phalaenopsis flowers from all four clades. Differential expression of clade 3 genes is obviously responsible for differences between inner and outer tepals, whereas differential expression of clade 4 genes differentiates the lateral inner tepals from the labella [2,3]. Orchid PI/GLO-like genes, found to be present in the AP3/ DEF-like gene copies, are also necessary for current floral tissue development [4]. Despite this knowledge, however, the regulatory network controlling orchid floral development remains unclear. Flower color is derived from the three major classes of plant pigments: anthocyanins, betalains, and carotenoids [5]. Of these, anthocyanins are the major contributors to flower color [6]. A class of water-soluble flavonoids, anthocyanins are synthesized in the cytosol and localized in vacuoles. Through the phenylpropanoid pathway, they provide a wide range of colors, ranging from orange-red to violet-blue in dark-colored flowers [5]. Despite their structural variety, anthocyanins are only categorized into six chromophore forms: pelargonidin, cyanidin, peonidin, delphinidin, petunidin, and malvidin [7,8]. The anthocyanin biosynthetic pathway has been well elaborated [9]. In orchids, the primary anthocyanin in red flowers is a cyanidin derivative that is typically modified by glycosylation and acylation [10]. The glycosylation-related gene PeU-FGT3 plays a critical role in red color formation in Phalaenopsis [8]. Several important enzymes, such as chalcone synthase (CHS), chalcone isomerase (CHI), dihydroflavonol 4-reductase (DFR), and anthocyanidin synthase (ANS), are involved in the formation of colored anthocyanidins [5]. Compared with studies in other flowering plants, the molecular basis of floral color development has not been well characterized in orchids. A better understanding of the molecular mechanisms underlying orchid flower color and floral organ formation is thus needed. Furthermore, few transcriptomic-based investigations of the functions of genes related to flower color and floral differentiation have been reported for Phalaenopsis. To expand knowledge regarding flower color and diversity for Phalaenopsis breeding, in this study we analyzed differential gene expression between petals and labella using the Illumina RNA-Seq method. Our study generated a huge number of Phalaenopsis transcript sequences during floral formation that can be used to discover putative genes related to flower color and floral differentiation. By comparing relative gene expression levels between petals and labella, novel insights can be gleaned into orchid floral development. Our study therefore provides a foundation for future research on mechanisms underlying floral development in Phalaenopsis and other orchids. Material and methods Plant material and sample collection Phalaenopsis plants with white petals and red labella (Fig. 1) were grown in greenhouses at Nanjing Agriculture University under natural light conditions and a controlled temperature of 22–27°C. Petals and labella were collected at the full-bloom stage. The two samples were immersed in liquid nitrogen and stored at −80°C until subjected to total RNA extraction. Total RNA extraction, cDNA library construction, and Illumina deep sequencing Total RNA was extracted from petal and labella samples using a Trizol kit (Takara, Japan). Total RNA quality and quantity were analyzed using a Nanodrop 2000 instrument (Thermo Scientific) and a ChipRNA 7500 Series II Bioanalyzer (Agilent). The two total RNA samples were delivered to Beijing Biomarker Biotechnology Co. (Beijing, China) for the construction of cDNA libraries using an mRNA-Seq Sample Preparation kit (Illumina) according to the manufacturer’s instructions. The sequencing of the two samples was performed on an Illumina HiSeq 2000 system. Sequence assembly and annotation The raw image data produced from sequencing was transformed by base calling into raw reads. Transcriptome de novo assembly was carried out with the Trinity short-read assembly program, which generated in turn contigs, transcripts, and unigenes. To identify unigene putative functions, their sequences were aligned using BLASTX (E-value ≤10−5) against the following public protein databases: National Center for Biotechnology Information non-redundant (Nr) and nucleotide (Nt) databases and SwissProt, TrEMBL, Clusters of Orthologous Groups (COG), Gene Ontology (GO), and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases. The Blast2GO software package was used to compare and determine unigene GO annotations. Finally, WEGO software was used to obtain GO functional classifications for all annotated unigenes. Identification of differentially expressed genes (DEGs) To identify DEGs between the two samples, the following formula was used to calculate significance (P) of differences in transcript accumulation for each gene: where N1 and N2 represent the total number of clean reads from petals and labella, respectively, and x and y represent the number of reads mapping to the given gene. We then 𝑃𝑃𝑃𝑃(𝑦𝑦𝑦𝑦/𝑥𝑥𝑥𝑥) = ( 𝑁𝑁𝑁𝑁2 𝑁𝑁𝑁𝑁1 )! 𝑥𝑥𝑥𝑥 + 𝑦𝑦𝑦𝑦 ! 𝑥𝑥𝑥𝑥! 𝑦𝑦𝑦𝑦 ! (1 + 𝑁𝑁𝑁𝑁2 𝑁𝑁𝑁𝑁1)(!!!!!) Fig. 1 Photographic image of selected flower materials
Yang et al/Comparative transcrip lysis of Phalaenopsis flowers used false discovery rate(FDR)to determine the threshold Tab.I Summary of sequencing and assembly results. of the P-value in multiple tests.To identify DEGs,we used a combination of two criteria:expression fold-change log,Ra- Assembled Statistics of data tio 1 and FDR-adjusted P-value <0.001. data production White petals Red labella We performed GO and KEGG functional enrichment Reads analysis to determine which DEGs were significantly en- No.of reads 10734813 16224038 riched in GO terms and metabolic pathways(Ps 0.05) Total nucleotides 21349247973276861889 compared with the selected transcriptome background. (nt) Significance was calculated according to the formula: GC percentage(%) 46.80 45.08 Q20 percentage(%) 100.00 100.00 P=1- Contigs No.of contigs 1057077 2105819 Total nucleotides 79618808 141300083 (nt)in contigs where N is the number of genes with GO/KEGG annota- Length of N50(bp) 99 65 tions,n is the number of DEGs in the set of N genes,M is Mean length of 75 6> the number of genes mapped to a certain GO/KEGG term, contigs (bp) and m is the number of DEGs in the set of M genes. No.of contigs above 16999 17334 500bp Scaffolds Results No.of scaffolds 55101 67026 Total nucleotides 44692377 63535969 Short-read de novo sequencing and assembly (nt)in scaffolds We pooled equal amounts of RNA from the two samples Length of N50 (bp) 1280 1443 (petals and labella)to construct cDNA libraries for tran- Mean length of 811 948 scriptome sequencing and analysis on a Genome Analyzer scaffolds(bp) IIx platform using Illumina sequencing technology.Using No.of scaffolds 23777 40620 this sequencing approach,raw reads were generated from above 500 bp both ends of the cDNA fragments.After data filtering using Unigenes stringent quality criteria(i.e.,removal of sequences shorter No.of unigenes 37723 34020 than 65 bp or with CycleQ20 values less than 100%),we Total nucleotides 26656163 27636110 obtained 10 734 813 clean reads comprising 2 134 924 797 (nt)in unigenes nucleotides(nt)from the petal library and 16 224 038 clean Length of N50 (bp) 1125 1398 reads comprising 3 276 861 889 nt from the labella library. Average length of 707 812 The clean reads were assembled de novo into contigs using unigenes (bp) the Trinity software package.The results of the sequencing No.of unigenes 15903 15984 assembly are presented in Tab.1.A total of 1 057 077 and above 500 bp 2 105 819 contigs were assembled from petals and labella, respectively,with corresponding mean lengths of 75 bp and 67 bp and N50s of 99 bp and 65 bp.Of these,16999(1.61%) contigs from petals and 17 334(0.82%)contigs from labella comparing gene expression between the petal and labella were longer than 500 bp.The contigs from petals and labella libraries.Unigene expression was calculated using the RPKM were further assembled into 55 101 and 67 026 scaffolds with method [11].DEGs detected with at least two-fold differ- mean lengths of 811 bp and 948 bp,respectively.Scaffold ences (FDR 0.001 and log,Ratio>1)between petal and N50 lengths were 1280 bp for petals and 1168 bp for labella. labella libraries are shown in Fig.2.Using these criteria,we In total,15 207 scaffolds assembled from petals and 24 853 identified 2736 DEGs between petals and labella.Of these, scaffolds from labella coded for transcripts longer than I kb. 1277 were up-regulated and 1459 were down-regulated in Finally,37 723 unigenes were obtained from the petals,with petals.A greater number of genes were expressed only in a final unigene N50 length of 1 125 bp and a total length the petals(243)than in the labella(62).This result suggests of 26 656 163 nt;similarly,34 020 unigenes were generated that petal and labella development may involve totally dif- from the labella data,with a final unigene N50 length of ferent processes. 1398 bp and a total length of 27 636 110 nt.These unigenes were then organized into a transcriptome database for the DEG functional annotation identification of putative genes related to flower color and To validate and annotate the assembled DEGs,the 2736 floral differentiation DEGs were subjected to BLASTX comparisons(E-value s 1 x 10-5)against several public protein databases to identify DEG analysis between white petal and labella libraries putative functions of the unigene sequences.As a result,2698, A primary goal of transcriptome sequencing is com- 2183,2319,2706,2446,837,and1077 DEGs were found to parison of gene expression levels in different samples. have homologous sequences in Nr,Nt,SwissProt,TrEMBL, In this study,a large number of DEGs were estimated by GO,KEGG,and COG databases,respectively. The Author(s)2014 Published by Polish Botanical Soclety Acta Soc Bot Pol 83(3):191-199 193
© The Author(s) 2014 Published by Polish Botanical Society Acta Soc Bot Pol 83(3):191–199 193 Yang et al. / Comparative transcriptome analysis of Phalaenopsis flowers used false discovery rate (FDR) to determine the threshold of the P-value in multiple tests. To identify DEGs, we used a combination of two criteria: expression fold-change |log2 Ratio| ≥ 1 and FDR-adjusted P-value < 0.001. We performed GO and KEGG functional enrichment analysis to determine which DEGs were significantly enriched in GO terms and metabolic pathways (P ≤ 0.05) compared with the selected transcriptome background. Significance was calculated according to the formula: where N is the number of genes with GO/KEGG annotations, n is the number of DEGs in the set of N genes, M is the number of genes mapped to a certain GO/KEGG term, and m is the number of DEGs in the set of M genes. Results Short-read de novo sequencing and assembly We pooled equal amounts of RNA from the two samples (petals and labella) to construct cDNA libraries for transcriptome sequencing and analysis on a Genome Analyzer IIx platform using Illumina sequencing technology. Using this sequencing approach, raw reads were generated from both ends of the cDNA fragments. After data filtering using stringent quality criteria (i.e., removal of sequences shorter than 65 bp or with CycleQ20 values less than 100%), we obtained 10 734 813 clean reads comprising 2 134 924 797 nucleotides (nt) from the petal library and 16 224 038 clean reads comprising 3 276 861 889 nt from the labella library. The clean reads were assembled de novo into contigs using the Trinity software package. The results of the sequencing assembly are presented in Tab. 1. A total of 1 057 077 and 2 105 819 contigs were assembled from petals and labella, respectively, with corresponding mean lengths of 75 bp and 67 bp and N50s of 99 bp and 65 bp. Of these, 16 999 (1.61%) contigs from petals and 17 334 (0.82%) contigs from labella were longer than 500 bp. The contigs from petals and labella were further assembled into 55 101 and 67 026 scaffolds with mean lengths of 811 bp and 948 bp, respectively. Scaffold N50 lengths were 1280 bp for petals and 1168 bp for labella. In total, 15 207 scaffolds assembled from petals and 24 853 scaffolds from labella coded for transcripts longer than 1 kb. Finally, 37 723 unigenes were obtained from the petals, with a final unigene N50 length of 1 125 bp and a total length of 26 656 163 nt; similarly, 34 020 unigenes were generated from the labella data, with a final unigene N50 length of 1398 bp and a total length of 27 636 110 nt. These unigenes were then organized into a transcriptome database for the identification of putative genes related to flower color and floral differentiation. DEG analysis between white petal and labella libraries A primary goal of transcriptome sequencing is comparison of gene expression levels in different samples. In this study, a large number of DEGs were estimated by comparing gene expression between the petal and labella libraries. Unigene expression was calculated using the RPKM method [11]. DEGs detected with at least two-fold differences (FDR < 0.001 and |log2 Ratio| ≥ 1) between petal and labella libraries are shown in Fig. 2. Using these criteria, we identified 2736 DEGs between petals and labella. Of these, 1277 were up-regulated and 1459 were down-regulated in petals. A greater number of genes were expressed only in the petals (243) than in the labella (62). This result suggests that petal and labella development may involve totally different processes. DEG functional annotation To validate and annotate the assembled DEGs, the 2736 DEGs were subjected to BLASTX comparisons (E-value ≤ 1 × 10−5) against several public protein databases to identify putative functions of the unigene sequences. As a result, 2698, 2183, 2319, 2706, 2446, 837, and 1077 DEGs were found to have homologous sequences in Nr, Nt, SwissProt, TrEMBL, GO, KEGG, and COG databases, respectively. 𝑃𝑃𝑃𝑃 = 1 − 𝑀𝑀𝑀𝑀 𝑖𝑖𝑖𝑖 𝑁𝑁𝑁𝑁 − 𝑀𝑀𝑀𝑀 𝑛𝑛𝑛𝑛 − 𝑖𝑖𝑖𝑖 𝑁𝑁𝑁𝑁 𝑁𝑁𝑁𝑁 !!! !!! Assembled data Statistics of data production White petals Red labella Reads No. of reads 10 734 813 16 224 038 Total nucleotides (nt) 2 134 924 797 3 276 861 889 GC percentage (%) 46.80 45.08 Q20 percentage (%) 100.00 100.00 Contigs No. of contigs 1 057 077 2 105 819 Total nucleotides (nt) in contigs 79 618 808 141 300 083 Length of N50 (bp) 99 65 Mean length of contigs (bp) 75 67 No. of contigs above 500 bp 16 999 17 334 Scaffolds No. of scaffolds 55 101 67 026 Total nucleotides (nt) in scaffolds 44 692 377 63 535 969 Length of N50 (bp) 1280 1443 Mean length of scaffolds (bp) 811 948 No. of scaffolds above 500 bp 23 777 40 620 Unigenes No. of unigenes 37 723 34 020 Total nucleotides (nt) in unigenes 26 656 163 27 636 110 Length of N50 (bp) 1125 1398 Average length of unigenes (bp) 707 812 No. of unigenes above 500 bp 15 903 15 984 Tab. 1 Summary of sequencing and assembly results
Yang et al/Comparative transcriptome analysis of Phalaenopsis flowers COG ANALYSIS OF DEG PATTERNS.COG analysis is Amabilis_Yellow_vs_Amabilis_Stem-red fold-change plot used to classify orthologous gene products,with all proteins in an orthologous gene cluster considered to have evolved from the same ancestral protein.The COG database,which contains putative protein sequences encoded by all genes from bacteria,archaea,and eukaryotes that have complete genome sequences,facilitates exploration via sequence analy- sis of evolutionary relationships among these groups[12]. The newly identified sequences were aligned against the known COG database to predict and classify their possible functions.In total,1077 unigenes were assigned to at least one COG classification.Among the 25 COG categories (Fig.3),the cluster for"General function prediction only" represented the largest group(262 unigenes;24.3%),fol- lowed by"Posttranslational modification,protein turnover, chaperones'”(146;13.6%),“Transcription”(137;12.7%)and "Translation,ribosomal structure and biogenesis"(118; 10.9%).The categories of“Cell motility”(3;0.28%),“Nuclear structure'”(l;0.09%)and“Extracellular structures”(O;0%) Amabilis Yellow vs Amabilis Stem-red (log2 RPKM mean) contained the smallest numbers of unigenes.This informa- tion should prove to be a valuable resource for studying Fig.2 Scatter plot of relative gene expression in petals vs.labella specific processes and functions in Phalaenopsis. Red dots above zero on the y-axis indicate genes having a higher GO FUNCTIONAL ENRICHMENT.GO is an inter- expression level in petals,green dots below zero correspond to national standardized gene functional classification system genes having a higher expression level in labella,and black dots that describes properties of genes and their products in any represent genes that were similar in both libraries.False discovery organism [13].Based on sequence homology,2446 DEGs rate 0.001 and log Ratio2 1 were used as thresholds to judge the significance of gene expression differences. were categorized into three main categories (biological process,cellular component,and molecular function)and 62 subcategories(Fig.4).Multiple terms were frequently assigned to the same transcript;thus,15 159 GO term annotations were assigned to at least one GO category under biological COG Function Classification of Consensus Sequence 300 J:Translation,ribosomal structure and biogenesis A:RNA processing and modification K:Transcription 250 L:Replication,recombination and repair B:Chromatin structure and dynamics D:Cell cycle control,cell division,chromosome partitioning Y:Nuclear structure 200 V:Defense mechanisms T:Signal transduction mechanisms M:Cell wall/membranelenvelope biogenesis 150 N:Cell motility Z:Cytoskeleton W:Extracellular structures U:Intracellular trafficking,secretion,and vesicular transport 0.0 O:Posttranslational modification,protein turnover,chaperones C:Energy production and conversion G:Carbohydrate transport and metabolism E:Amino acid transport and metabolism F:Nucleotide transport and metabolism Coenzyme transport and metabolism Lipid transport and metabolism P:Inorganic ion transport and metabolism Q:Secondary metabolites biosynthesis,transport and catabolism R:General function prediction only Function Class S:Function unknown Fig.3 Clusters of orthologous genes(COG)annotations of putative proteins.All putative proteins were aligned to the COG database and functionally classified into one or more of 25 molecular families.The capital letters on the x-axis correspond to the COG categories listed to the right of the histogram,and the y-axis indicates the numbers of DEGs assigned to the corresponding COG category. The Author(s)2014 Published by Polish Botanical Soclety Acta Soc Bot Pol 83(3):191-199 鸟
© The Author(s) 2014 Published by Polish Botanical Society Acta Soc Bot Pol 83(3):191–199 194 Yang et al. / Comparative transcriptome analysis of Phalaenopsis flowers COG ANALYSIS OF DEG PATTERNS. COG analysis is used to classify orthologous gene products, with all proteins in an orthologous gene cluster considered to have evolved from the same ancestral protein. The COG database, which contains putative protein sequences encoded by all genes from bacteria, archaea, and eukaryotes that have complete genome sequences, facilitates exploration via sequence analysis of evolutionary relationships among these groups [12]. The newly identified sequences were aligned against the known COG database to predict and classify their possible functions. In total, 1077 unigenes were assigned to at least one COG classification. Among the 25 COG categories (Fig. 3), the cluster for “General function prediction only” represented the largest group (262 unigenes; 24.3%), followed by “Posttranslational modification, protein turnover, chaperones” (146; 13.6%), “Transcription” (137; 12.7%) and “Translation, ribosomal structure and biogenesis” (118; 10.9%). The categories of “Cell motility” (3; 0.28%), “Nuclear structure” (1; 0.09%) and “Extracellular structures” (0; 0%) contained the smallest numbers of unigenes. This information should prove to be a valuable resource for studying specific processes and functions in Phalaenopsis. GO FUNCTIONAL ENRICHMENT. GO is an international standardized gene functional classification system that describes properties of genes and their products in any organism [13]. Based on sequence homology, 2446 DEGs were categorized into three main categories (biological process, cellular component, and molecular function) and 62 subcategories (Fig. 4). Multiple terms were frequently assigned to the same transcript; thus, 15 159 GO term annotations were assigned to at least one GO category under biological Fig. 2 Scatter plot of relative gene expression in petals vs. labella. Red dots above zero on the y-axis indicate genes having a higher expression level in petals, green dots below zero correspond to genes having a higher expression level in labella, and black dots represent genes that were similar in both libraries. False discovery rate < 0.001 and |log2 Ratio| ≥ 1 were used as thresholds to judge the significance of gene expression differences. Fig. 3 Clusters of orthologous genes (COG) annotations of putative proteins. All putative proteins were aligned to the COG database and functionally classified into one or more of 25 molecular families. The capital letters on the x-axis correspond to the COG categories listed to the right of the histogram, and the y-axis indicates the numbers of DEGs assigned to the corresponding COG category
Yang et al/Comparative transcriptome analysis of Phalaenopsis flowers Un Cellular component Molecular function Biological process Fig.4 Gene ontology(GO)classification of all annotated unigenes and DEGs.GO classification according to biological process,cellular component,and molecular function categories of all annotated 23 695 unigenes and 2446 differentially expressed unigenes.Subcategories are indicated on the x-axis.The right y-axis indicates the number of genes in each category,while the left y-axis represents the percentage of genes within each GO category falling into a specific subcategory.Gray bars correspond to all annotated unigenes and red bars indicate differentially expressed unigenes(DEGs). processes,11 346 under cellular components,and 3552 under Among DEGs,441 could be mapped to 10 secondary molecular function.Within the molecular function category, metabolic sub-pathways.We identified many gene families the majority of DEGs were related to binding (1655;67.6%). that encode key enzymes involved in major secondary With respect to cellular components,most assignments were metabolic pathways,such as those associated with betalain, to cell(2283;93.2%)and cell part(2283;93.2%)subcategories. flavone and flavonol,phenylpropanoids,flavonoids,terpe- Among biological processes,the cellular subcategory(2016; noids,and carotenoids.A total of 2822 genes involved in 82.3%)was the most highly represented.These results suggest the 10 sub-pathways were detected in our study.The broad that the flowering stage in Phalaenopsis is predominantly coverage obtained for these secondary metabolic genes pro- controlled by genes related to cellular structure and molecular vides more abundant information for examining secondary interactions.In contrast,almost no DEGs were assigned metabolite biosynthesis in Phalaenopsis. to virion (2;0.8%)and virion part(2;0.8%)under cellular Expression levels of these genes exhibited great dif- components,cell killing(1;0.4%)under biological processes, ferences.The 432 DEGs up-regulated in petals mapped and chemoattractant activity(1;0.4%)and receptor regulator to 102 KEGG pathways distributed into the categories of activity (1;0.4%)under molecular function. metabolism(292 unigenes),genetic information processing KEGG PATHWAY ENRICHMENT ANALYSIS.Pathway (122 unigenes),organism systems(12 unigenes),cellular enrichment analysis identifies significantly enriched meta- processes(4 unigenes),environmental information process- bolic pathways or signal transduction pathways in dEGs by ing(9 unigenes),and human diseases(2 unigenes).The comparing them with the whole genome background.Using distribution of the 405 DEGs upregulated in labella into pathway enrichment analysis,DEG-associated metabolic 94 KEGG pathways was as follows:genetic information and signal transduction pathways can be identified.On the processing(201 unigenes),metabolism(146 unigenes), basis on sequence homology,837 up-regulated DEGs(432 cellular processes(38 unigenes),environmental information in petals and 405 in labella)were mapped to 102 KEGG processing(19 unigenes),organism systems(15 unigenes), pathways.This result suggests that the development and and human diseases(50 unigenes).The broad distribution of formation of petals and labella involve two totally different both petal and labella unigenes in these essential processes processes.Almost all KEGG pathways were found in the two suggests that the associated biochemical pathway networks floral parts:102 pathways in petals and 99 in labella(Tab.2). are complicated. The Author(s)2014 Published by Polish Botanical Soclety Acta Soc Bot Pol 83(3):191-199 195
© The Author(s) 2014 Published by Polish Botanical Society Acta Soc Bot Pol 83(3):191–199 195 Yang et al. / Comparative transcriptome analysis of Phalaenopsis flowers processes, 11 346 under cellular components, and 3552 under molecular function. Within the molecular function category, the majority of DEGs were related to binding (1655; 67.6%). With respect to cellular components, most assignments were to cell (2283; 93.2%) and cell part (2283; 93.2%) subcategories. Among biological processes, the cellular subcategory (2016; 82.3%) was the most highly represented. These results suggest that the flowering stage in Phalaenopsis is predominantly controlled by genes related to cellular structure and molecular interactions. In contrast, almost no DEGs were assigned to virion (2; 0.8%) and virion part (2; 0.8%) under cellular components, cell killing (1; 0.4%) under biological processes, and chemoattractant activity (1; 0.4%) and receptor regulator activity (1; 0.4%) under molecular function. KEGG PATHWAY ENRICHMENT ANALYSIS. Pathway enrichment analysis identifies significantly enriched metabolic pathways or signal transduction pathways in DEGs by comparing them with the whole genome background. Using pathway enrichment analysis, DEG-associated metabolic and signal transduction pathways can be identified. On the basis on sequence homology, 837 up-regulated DEGs (432 in petals and 405 in labella) were mapped to 102 KEGG pathways. This result suggests that the development and formation of petals and labella involve two totally different processes. Almost all KEGG pathways were found in the two floral parts: 102 pathways in petals and 99 in labella (Tab. 2). Among DEGs, 441 could be mapped to 10 secondary metabolic sub-pathways. We identified many gene families that encode key enzymes involved in major secondary metabolic pathways, such as those associated with betalain, flavone and flavonol, phenylpropanoids, flavonoids, terpenoids, and carotenoids. A total of 2822 genes involved in the 10 sub-pathways were detected in our study. The broad coverage obtained for these secondary metabolic genes provides more abundant information for examining secondary metabolite biosynthesis in Phalaenopsis. Expression levels of these genes exhibited great differences. The 432 DEGs up-regulated in petals mapped to 102 KEGG pathways distributed into the categories of metabolism (292 unigenes), genetic information processing (122 unigenes), organism systems (12 unigenes), cellular processes (4 unigenes),environmental information processing (9 unigenes), and human diseases (2 unigenes). The distribution of the 405 DEGs upregulated in labella into 94 KEGG pathways was as follows: genetic information processing (201 unigenes), metabolism (146 unigenes), cellular processes (38 unigenes), environmental information processing (19 unigenes), organism systems (15 unigenes), and human diseases (50 unigenes). The broad distribution of both petal and labella unigenes in these essential processes suggests that the associated biochemical pathway networks are complicated. Fig. 4 Gene ontology (GO) classification of all annotated unigenes and DEGs. GO classification according to biological process, cellular component, and molecular function categories of all annotated 23 695 unigenes and 2446 differentially expressed unigenes. Subcategories are indicated on the x-axis. The right y-axis indicates the number of genes in each category, while the left y-axis represents the percentage of genes within each GO category falling into a specific subcategory. Gray bars correspond to all annotated unigenes and red bars indicate differentially expressed unigenes (DEGs)