From GUS Wiki
Jump to: navigation, search

Design of the GUS Schema for the Central Dogma

The major design goal of the GUS schema for the central dogma was to support as complete a representation of both the underlying biology itself, the process of discovering the underlying biology. and the requirements of data integration. The rest of this page describes what this means and how this goal was reflected in the final schema.

At the end of this page is a diagram in roughly ER format that shows the main GUS tables of the central dogma. The rest of this page will refer to it. You can open the picture ( in another browser window You can also use the Gus Schema Browser to see details of the tables mentioned in the text.

Genes, RNAs, and Proteins

GUS of course has tables for genes, RNAs, and proteins. GUS separates the notions of a gene, RNA, and protein from instances of their sequences. We consider first the notion tables, =DoTS::Gene=, =DoTS::RNA=, and =DoTS::Protein= which can be found on the left of the figure. Each =DoTS::Gene= can have multiple =DoTS::RNA= children. Ideally, each RNA child would represent a different splice form or other variant of RNA that can be produced from the gene. A similar relationship holds between =DoTS::RNA= and =DoTS::Protein=.


An database may draw information for a gene from a wide variety of sources. For example, multiple gene prediction algorithms may be run on genomic sequence, gene models for a hand-full of well studied genes may be loaded from GenBank, and there may be official curated annotation. Each of these sources may contain instances of the same gene. Relating these versions of the gene and corresponding RNAs and proteins is the job of the =DoTS::GeneInstance=, =DoTS::~RNAInstance=, and =DoTS::~ProteinInstance= tables. The type of instance is indicated with the controlled vocabulary tables =DoTS::~=, =DoTS::~=, and =DoTS::~=. The contents of the instance category tables can be adjust on a site-by-site basis to reflect the details of data processing done at that site.

Sequences and Feature-based Annotation

GUS contains a wide variety of sequence- and sequence feature-related tables. Since there are so many different kinds of sequences and features, GUS uses an approximation of subclassing. (See GusSchemaSubclassing for details.) There are four underlying superclasses and their corresponding GUS implementation tables: =DoTS::NASequenceImp=, =DoTS::NAFeatureImp=, =DoTS::AASequenceImp=, and =DoTS::AAFeatureImp=. Since the NA and AA schemas are conceptually similar, we discuss only the NA set.

Views on =DoTS::NASequenceImp= are used to store nucleic acid sequence. In the Oracle implementation of GUS, sequence data is stored in =CLOBs= so very large sequences can be stored in the database. Special care is taken in the GusPerlObjectLayer to deal with large sequences.

Features are linked to sequences via their =nasequenceid= column. To accommodate non-contiguous features, e.g., exons in a genome, GUS contains a separate table (=DoTS::NALocation=) for specifying the location of a feature on a sequence. If multiple locations are linked to a feature, then the feature is assumed to be located at all of the locations. more detail here

Central Dogma Features

It is likely that features related to the central dogma will be imported or defined without the corresponding central dogma notions, GUS allows for the full specification of these relationships between features. In a complete database, there will be gene -> RNA -> protein linkages in both features and notions. The main central dogma feature tables are =DoTS::GeneFeature=, =DoTS::~ExonFeature=, =DoTS::RNAFeature=, and =DoTS::~TranslatedAAFeature=. These are related as shown in the ER diagram.

Genes and Exons

Exon features belong directly to a single gene feature. If there are multiple splice-forms, then some of the exons may overlap, though there probably should not be any exons that have identical boundaries associated with the same gene. Since the exon feature have their own individual locations, it is probably best to assign the gene a single location that spans all of the exon locations. If more information is known about promoter features, start site, etc., the gene's location can be expanded to span these feature locations as well.

RNAs and Exons

Each RNA feature can belong to a single gene. The particular set of exons the RNA is made from are linked to it via the =DoTS::RnaExonFeature= table. The location of the RNAFeature could either be a single span (with details left to the exon locations) or a complex location that mirrors each exon's location. Database normalization would suggest the former approach, though biological semantics might suggest the later approach.

Although the GusPerlObjectLayer offers the ability to construct an RNA (or protein) sequence on the fly from genomic sequence, it is often wise to store a copy of the RNA in GUS. In this case, one should create another =DoTS::RNAFeature= for this sequence that spans the entire sequence. Both this RNA-based RNA feature and the genomic-based RNA feature can be linked to the same =DoTS::RNA= with the appropriate == type.


The final stage for now is the translation from RNA to protein. Since many protein sequences will be (trivial) translations from the RNA, GUS supports this process. The =DoTS::TranslatedAAFeature= is linked to a =DoTS::RNAFeature= to indicate which RNA provided the sequence (for the translation). The range of the RNA that was translated is indicated in the =DoTS::TranslatedAAFeature=. The =DoTS::TranslatedAAFeature= also points to an =DoTS::AASequence= and has an =DoTS::AALocation=. Typically the location will span the entire protein sequence.


The =DoTS::NALocation= table contains a number of columns that are used to store information present in GenBank . These will be described later.

The most important columns are =startmin=, =start_max=, =end_min=, and =end_max=. They indicate indicate the position of the feature on the sequence and make allowances for uncertainty in these positions. If a value is left null it is considered to be unknown. If, say, =start_min = stat_max=, then the position is certain. Otherwise the position could be anywhere in the span between =start_min= and =startmax=.

Positions are always specified on the forward strand and thus the start should should be less than the end. The =is_reversed= attribute is used to indicate the strand.

ER Diagram

An entity-relationship diagram of GUS central-dogma tables is available at