![]() ![]() ![]() However, they also have recognized shortcomings. Furthermore, useful as qualitative classification systems are, quantitative metrics are also needed – measures akin to the sensitivity, specificity and accuracy metrics used by the gene-prediction community to evaluate gene-finder performance. In principle these classification systems could be used for whole-genome annotation management, but to our knowledge they have not yet been applied for this purpose. The DEBD and ASTRA projects have also proposed genome-wide categorizations of alternative splicing using graph-based approaches. The Sequence Ontology project, for example, has created a categorization system for alternative splicing that can identify problematic annotations for later manual review. Some previous work has been done in this area. ![]() Here too, new measures of comparison are needed, measures that move beyond the amino acid sequences and take into account other aspects of the annotations such as similarities in intron-exon structures and patterns of alternative splicing. Comparisons of different genomes' annotations also suffer from a paucity of measures, with most studies restricted to analyses of protein alignments. Though indisputably useful, these simple statistics only tell part of the story. Today, most annotation management and comparison at the whole-genome scale is restricted to analyses of basic traits – for example differences between releases are usually evaluated in terms of gene and transcript numbers. ![]() The growing numbers of annotation providers – and users – is creating a pressing need for tools and techniques for gene annotation management and analysis. melanogaster but also emerging model organisms such as the planarian S. Examples include not only model organism databases such as C. The result has been an ever-proliferating number of groups annotating and redistributing their own annotations, independent of the annotation pipelines used by GenBank. This in turn has made possible common formats for data exchange such as CHADO XML and gff3. The Sequence Ontology and GMOD projects, for example, provide tools and standards that promote database interoperability. Standardization of formats and database schemas has helped matters greatly. Gene annotations must be tracked from release to release, and problematic annotations identified, reviewed and modified. Gene annotations are not static entities, and how to best mange them is a complex and challenging problem. Consider too that next-generation sequencing technologies will soon make it possible for individual labs to sequence and annotate genomes, thus the number of gene annotations could well exceed one billion in a few years time. Tools to manage and analyze these gene annotations are badly needed. Even assuming as few as 10,000 genes/genome, these new eukaryotic genomes alone will add more than nine million annotations to GenBank. Of those underway, over 900 are eukaryotic, genomes whose large size and intron-containing genes complicate annotation. There are currently 925 published genomes and 3185 genome sequencing projects underway. The number of sequenced and annotated genomes is rapidly increasing. Our results provide the first detailed, historical overview of how these genomes' annotations have changed over the years, and demonstrate the usefulness of these measures for genome annotation management. We have applied these measures to the annotations of five eukaryotic genomes over multiple releases – H. In response, we have developed a suite of quantitative measures to better characterize changes to a genome's annotations between releases, and to prioritize problematic annotations for manual review. Typically, changes in gene and transcript numbers are used to summarize changes from release to release, but these measures say nothing about changes to individual annotations, nor do they provide any means to identify annotations in need of manual review. The ever-increasing number of sequenced and annotated genomes has made management of their annotations a significant undertaking, especially for large eukaryotic genomes containing many thousands of genes. ![]()
0 Comments
Leave a Reply. |