THE HUMAN GENOME gene boundaries. During this process, multiple ing joined together, resulting in an annotation the region of the genome under analysis was its to the same region were collapsed to a that artificially concatenated these gene models. promoted to the status of an Otto annotation. coherent set of data by tracking the coverage of Next, known genes(those with exact match- Because the genome sequence has gaps and a region. For example, if a group of bases was es of a full-length cDNA sequence to the sequence errors such as frameshifts, it was not represented by multiple overlapping ESTs, the nome) were identified, and the region corre- always possible to predict a transcript that union of these regions matched by the set of sponding to the cDNA was annotated as a agrees precisely with the experimentally deter- ESTs on the scaffold was marked as being predicted transcript. A subset of the curat- mined cDNA sequence. A total of 6538 genes ported by ESt evidence. This resulted in a ed human gene set RefSeq from the Nation- in our inventory were identified and transcripts series of "gene bins, "each of which was be- al Center for Biotechnology Information predicted in this way lieved to contain a single gene. One weakness of (NCBi)was included as a data set searched in Regions that have a substantial amount of this initial implementation of the algorithm was the computational pipeline. If a Refseq tran- sequence similarity, but do not match known demly duplicated genes. Gene clusters frequent. pt matched the genome assembly for at least genes, were analyzed by that part of the Otto in predicting gene boundaries in regions of tan of its length at >92% identity, then the system that uses the sequence similarity in- ly resulted in homologous neighboring genes SIM4 (63)alignment of the RefSeq transcript to formation to predict a transcript. Here, Otto A B 05.0Mp10,0Mbp15.0Mp20.0Mbp25.0Mhp30.0Mbp 萦。顒鑄 :酈气 05.0Mp10.0Mbp15.0Mbp20.0Mp25.0Mp30.0Mbp 50.0 Mbp Fig 6 Comparison of the CSa and the PFP assembly. c 8.9 Mbp 9.0 Mbp 9.I Mbp 9.2 Mbp 9.3 Mbp 9.4 Mbp 9.5 Mbp 9.6 Mbp 9.7 Mbp and(C)a 1-Mb region of chromosome 8 representi a single Celera scaffold. To generate the figure, Celera The pFp assembly is indicated onto each assem third each panel; the Celera assembly is indicated in the 10Sip lower third. In the center of the panel show Celera sequences that are in the sam 2 Kbp consistently ordered run of sequences. Yellow lines dicate sequence blocks that are in the same orier tation, but out of order. Red lines indicate sequenc blocks that are not in the same orientation. For of match least 50 kbp long. The top and bottom thirds of each nt of Celera mate-pair violations incorrect distance between (Mate pairs that are within the correct distance, as 2Kbp expected from the mean library insert size, are omit- ted from the figure for clarity )Predicted breakpoints. orresponding to stacks of violated mate pairs of the 10 Kbp axis Runs of more than 10,000 Ns are shown as cyan bars. plots of all 24 chromosomes can be seen in w fig.3onScienceOnlineatwww.sciencemagorg/cgi/ 5.1 Mbp 5.2 Mbp 5.3 Mbp 5.4 Mbp 5.5 Mbp 5.6 Mbp 5.7 Mbp 5.8 Mbp content/fu/291/5507/1304/DC1 1318 16FebrUarY2001voL291SciEncewww.sciencemag.org
gene boundaries. During this process, multiple hits to the same region were collapsed to a coherent set of data by tracking the coverage of a region. For example, if a group of bases was represented by multiple overlapping ESTs, the union of these regions matched by the set of ESTs on the scaffold was marked as being supported by EST evidence. This resulted in a series of “gene bins,” each of which was believed to contain a single gene. One weakness of this initial implementation of the algorithm was in predicting gene boundaries in regions of tandemly duplicated genes. Gene clusters frequently resulted in homologous neighboring genes being joined together, resulting in an annotation that artificially concatenated these gene models. Next, known genes (those with exact matches of a full-length cDNA sequence to the genome) were identified, and the region corresponding to the cDNA was annotated as a predicted transcript. A subset of the curated human gene set RefSeq from the National Center for Biotechnology Information (NCBI) was included as a data set searched in the computational pipeline. If a RefSeq transcript matched the genome assembly for at least 50% of its length at .92% identity, then the SIM4 (63) alignment of the RefSeq transcript to the region of the genome under analysis was promoted to the status of an Otto annotation. Because the genome sequence has gaps and sequence errors such as frameshifts, it was not always possible to predict a transcript that agrees precisely with the experimentally determined cDNA sequence. A total of 6538 genes in our inventory were identified and transcripts predicted in this way. Regions that have a substantial amount of sequence similarity, but do not match known genes, were analyzed by that part of the Otto system that uses the sequence similarity information to predict a transcript. Here, Otto Fig. 6. Comparison of the CSA and the PFP assembly. (A) All of chromosome 21, (B) all of chromosome 8, and (C) a 1-Mb region of chromosome 8 representing a single Celera scaffold. To generate the figure, Celera fragment sequences were mapped onto each assembly. The PFP assembly is indicated in the upper third of each panel; the Celera assembly is indicated in the lower third. In the center of the panel, green lines show Celera sequences that are in the same order and orientation in both assemblies and form the longest consistently ordered run of sequences. Yellow lines indicate sequence blocks that are in the same orientation, but out of order. Red lines indicate sequence blocks that are not in the same orientation. For clarity, in the latter two cases, lines are only drawn between segments of matching sequence that are at least 50 kbp long. The top and bottom thirds of each panel show the extent of Celera mate-pair violations (red, misoriented; yellow, incorrect distance between the mates) for each assembly grouped by library size. (Mate pairs that are within the correct distance, as expected from the mean library insert size, are omitted from the figure for clarity.) Predicted breakpoints, corresponding to stacks of violated mate pairs of the same type, are shown as blue ticks on each assembly axis. Runs of more than 10,000 Ns are shown as cyan bars. Plots of all 24 chromosomes can be seen in Web fig. 3 on Science Online at www.sciencemag.org/cgi/ content/full/291/5507/1304/DC1. T H E H UMAN G ENOME 1318 16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org on September 27, 2009 www.sciencemag.org Downloaded from
THE HUMAN GENOME evaluates evidence generated by the and cDNAs), similarity to rodent transcripts man genome. The sequence from the region tational pipeline, corresponding to STs and cDNAS), and similarity of the of genomic DNA contained in a gene bin was tion between mouse and human translation of human genomic DNA to known extracted, and the subsequences supported by DNA, similarity to human transcrip ict potential genes in the hu- any E 三三==三 ==- 8 三=三三 5oogEgcg3sE°品 三三三三三三 三三 E 三三三 二目 三 N o t 2+莴 B883 www.sciencemagorgSciEnceVol29116FebRuarY2001 1319
evaluates evidence generated by the computational pipeline, corresponding to conservation between mouse and human genomic DNA, similarity to human transcripts (ESTs and cDNAs), similarity to rodent transcripts (ESTs and cDNAs), and similarity of the translation of human genomic DNA to known proteins to predict potential genes in the human genome. The sequence from the region of genomic DNA contained in a gene bin was extracted, and the subsequences supported by any homology evidence were marked (plus 100 Fig. 7. Schematic view of the distribution of breakpoints and large gaps on all chromosomes. For each chromosome, the upper pair of lines represent the PFP assembly, and the lower pair of lines represent Celera’s assembly. Blue tick marks represent breakpoints, whereas red tick marks represent a gap of larger than 10,000 bp. The number of breakpoints per chromosome is indicated in black, and the chromosome numbers in red. T H E H UMAN G ENOME www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 1319 on September 27, 2009 www.sciencemag.org Downloaded from