of the human genome Human 1000,000 chromosome 17 p133 cosmid contig 240 2e81 24*25 3000 IACAGTACAAATAIGLAACADIAICI I AEGAIACCAAAAATAGT IGIGEICHA IIIIC IGIIFC IGICIICAICEGAAAIIIICI C影置EAG群AAA以 ACCIAAAFCECACCCIITATC1D影 CICIGAACCAEICICAGIAGI 在m联x灯 XEgeC e AATEALTE Htcsc Olfactory CALEACACCICENFCAGCICICECACHCACACIAAGIAT IGACCIAI IEEfEECACAAACI DACIGACAICEUHCHCICICEHCAICGI AFCPACHICECPESCACPACATC影 CIIEACEAICCEHIECICASEGESES IEESH彩AEmC 纟置 CACIIAT EAACATIAATA置T的B形CAG〔rE0eCm置 AccI我 IICS IICE ACACECAAICECACCACC〔 CIC ACIC的AA HCAPLIAAGIIAAAAAGIGIAT CAIICAESAII II MECIGACACAIIPCACIGCIOETACCELETAATCEEAPCIAIGDCSCATGADGAAA L1 DGITAATI DIIICACIATAIAADTAT TI IACIAAC IACAAGTATC ICATAACAICATGIT IGIAC IIAAATATAEPCAAAAACII DGIIIIAAAAAACA 所外化光的默团江0%Ah FCE TEGEAHIGAAICC IIGAADCECACO的GC联CAGe能致EG0A置 CAAPCAAAACAAAACAAAACAAAACAAAAACALT IRGH IIAAAEES II IAGACATAAIAAALEIAACAAA IGIACEAAGAATACACACAACGIAI I CAIGAT IGIEEAIATCAEAAAEACCAESCAAGIACAIICIIPALGIEALSEALRAC IGAAGIAAPCICAACAII M IICIICAAIAAAI GIIAA T置 CCYCATASC置 EICfHHIICCETHEICAEIACAIICIEAIIC耗GCC置C置CE置AIT CAA代CA置AAA比 ES ICAGCAIIIICAGATIIIIILIIIIIIIIIGAECA ICICICATAESIEPCIPECSTUACEC彩rAC置CAAC0C0 C53 DACSCAIE ICIGCCICHE ECIGAAIKS A C托C
蛋白质序列:20种字母(氨基酸AA) 长度:50-6000AA 实例:人的免疫球蛋白 ID AlBG HUMAN STANDARD: PRT: 495 AA. Immunoglobulin domain; Glycoprotein; Plasma; Repeat; Signal SEQUENCE 495A;54209MW;87A50C21CE89459CcRC64 MSMLVVFLLL WGVTWGPVTE AAIFYETQPS LWAESESLLK PLANVTLTCQ ARLETPDFQL FKNGVAQEPV HLDSPAIKHQ FLLTGDTOGR YRCRSGLSTG WTQLGKLLEL TGPKSLPAPW LSMAPVPWIT PGLKTTAVCR GVLRGETFLL RREGDHEFLE VPEAQEDVEA TFPVHQPGN SCSYRTDGEG ALSEPSATVT IEELAAPPPP VLMHHGESSQ VLHPGNKVTL TCVAPLSGVD FOLRRGEKEL LVPRSSTSPD RIFFHLNAVA LGDGGHYTCR YRLHDNONGW SGDSAPVELI LSDETLPAPE FSPEPESGRA LRLRCLAPLE GARFALVRED RGGRRVHRFQ SPAGTEALFE LHNISVADSA NYSCVYVDLK PPFGGSAPSE RLELHVDGPP PRPQLRATWS GAALAGRDAV LRCEGPIPDV TFELLREGET KAVKTIPTPG AAANLELIFV GPQHAGNYRC RYRSWVPHTF ESELSDPVELLVAES /
ID A1BG_HUMAN STANDARD; PRT; 495 AA. ... ... ... KW Immunoglobulin domain; Glycoprotein; Plasma; Repeat; Signal. ... ... ... SQ SEQUENCE 495 AA; 54209 MW; 87A50C21CE89459C CRC64; MSMLVVFLLL WGVTWGPVTE AAIFYETQPS LWAESESLLK PLANVTLTCQ ARLETPDFQL FKNGVAQEPV HLDSPAIKHQ FLLTGDTQGR YRCRSGLSTG WTQLGKLLEL TGPKSLPAPW LSMAPVPWIT PGLKTTAVCR GVLRGETFLL RREGDHEFLE VPEAQEDVEA TFPVHQPGNY SCSYRTDGEG ALSEPSATVT IEELAAPPPP VLMHHGESSQ VLHPGNKVTL TCVAPLSGVD FQLRRGEKEL LVPRSSTSPD RIFFHLNAVA LGDGGHYTCR YRLHDNQNGW SGDSAPVELI LSDETLPAPE FSPEPESGRA LRLRCLAPLE GARFALVRED RGGRRVHRFQ SPAGTEALFE LHNISVADSA NYSCVYVDLK PPFGGSAPSE RLELHVDGPP PRPQLRATWS GAALAGRDAV LRCEGPIPDV TFELLREGET KAVKTIPTPG AAANLELIFV GPQHAGNYRC RYRSWVPHTF ESELSDPVELLVAES // 蛋白质序列:20种字母(氨基酸AA) 长度:50 – 6000 AA 实例:人的免疫球蛋白
Gene-Finding by Computer Starting from early 1980s Ab initio or de novo algorithms: GeneMark Gen Scan, FgeneSH, Genie, .. based on gene- structure models and training data. Our on-going project: BGF, the bgi gene Finder) Homolog methods based on sequence alignment with known genes in databases and comparative genomics of not-too-distant species Mixed approach using both strategy: TwinScan
Gene-Finding by Computer Starting from early 1980s: • “Ab initio” or “de novo” algorithms: GeneMark, GenScan, FgeneSH, Genie, …based on genestructure models and training data. (Our on-going project: BGF, the BGI Gene Finder) • Homolog methods based on sequence alignment with known genes in databases and comparative genomics of not-too-distant species • Mixed approach using both strategy: TwinScan
Different Stages of Gene-Finding Use all possible existing programs and services on the web with a public-domain or home-made genome viewer Write your own gene -finder. trained for the specific organism a dream for the time being: design a self-training and self-developing program "for any species which would improve itself iteratively starting from a few available reads cdNas. and ests
Different Stages of Gene-Finding • Use all possible existing programs and services on the web with a public-domain or home-made genome viewer • Write your own gene-finder, trained for the specific organism • A dream for the time being: design a self-training and self-developing program “for any species” which would improve itself iteratively starting from a few available reads, cDNAs, and ESTs
Performance of gene-Finders in Eukaryote Genomes M.Q. Zhang, Nature Review genetics, 3(2002)698-710 (mostly for the human genome) Nucleotide level: 80% Exon level: 45% Whole gene structure: 20 Fgenesh and bgf for rice(our tests on 128 cDNA confirmed single-gene genomic sequences Nucleotide level: 90%o Exon level: 60% Whole gene structure: 40%
Performance of Gene-Finders in Eukaryote Genomes • M. Q. Zhang, Nature Review Genetics, 3 (2002) 698-710 (mostly for the human genome): Nucleotide level: 80% Exon level: 45% Whole gene structure: 20% • FgeneSH and BGF for rice (our tests on 128 cDNAconfirmed single-gene genomic sequences): Nucleotide level: 90% Exon level: 60% Whole gene structure: 40%