GUS :: Schema :: Cluster :: GO Ontologies

A cluster of tables hold information related to the GO ontology (http://www.geneontology.org). The GUS tables that store this data can be put into three groups. First, the GO ontology consists of terms describing the function and cellular location of genes and gene products. The terms are connected in a directed acyclic graph (DAG) by typed edges that indicate relations between the terms. Second, the GO site allows the user to download both the GO ontology and associations between GO terms and sequences from a large variety of organisms. In addition, sites may perform their own automated prediction and/or manual curation of GO associations. Third, GUS has tables to store associations between GO terms and protein domains which can be used to predict GO functions for sequences that contain the domain.

ER Diagram

goOntologies.png Legend

Solid lines indicate hard foreign key references where the table at the tail of the arrow contains a foreign key to the table at the head of the arrow.

The dashed line represents a soft (table ID/row ID) link.

The (R) icons represent links to SRes::ReviewStatus.

The (X) icons represent links to SRes::ExternalDatabaseRelease.

The (U) icons represent links to SRes::UserInfo.

Groups

The GO ontologies are stored in the SRes tables in the top (orange) area.

The associations are stored in the DoTS::GOAssoc tables in the group of tables at the bottom left (green).

Finally, the rules associating protein domain motifs with GO terms are stored in the group of tables at the bottom right (purple).

Storing GO Ontologies

GO terms are put in the SRes::GoTerm table. Relations between GO terms are stored in SRes::GoRelationship which links terms in parent-child relations. Since the terms are in a DAG, a term may have one or more parents and also zero or more children. The edges are typed. The types are essentially a controlled vocabulary containing types such as part-of or contained-in. A short list of types is stored in SRes::GoRelationshipType. Sometimes a term may have synonyms, i.e., other terms that mean the same thing. Synonyms are stored in SRes::GoSynonym. Finally, when associations are made between GO terms and other entities, such as an AA sequence, the evidence for the association is indicated with a small controlled vocabulary of evidence code stored in SRes::GoEvidenceCode.

Since the GO ontology is being constantly revised, the SRes::GoTerm and SRes::GoSynonym contain references to SRes::ExternalDatabaseRelease which can be used to distinguish different releases of the GO ontology. Since there is no external database release attribute for SRes::GoRelationship there should not be relations that connect terms from different relseases and, hence, the release of an edge is determined (uniquely) from the terms it connects. Similarly, there is no external database release attribute for relationship types or evidence codes. These are stored in an accumulating fashion.

Annotation of DoTS Entities

The purpose of the GO ontology is to regularize the annotation of biological function, not just to organize terms for organization's sake. Since one may want to annotate a variety of DoTS object types we use a table (DoTS::GoAssociation) that has a soft link to the DoTS table (or, in fact, tables in any schema). This allows GUS to annotate AA sequences, proteins, genes etc., with GO terms. The exact semantics of the annotation depends on the object being annotated.

Annotation can be a complicated process. Data may be loaded from external sources. CBIL and others have methods for automatic prediction of GO terms. In addition, many sites also have manual curation of annotations. How can these multiple sources can be reconciled? GUS provides support for this process in the DoTS::GoAssocInst* tables. Entries in the DoTS::GoAssociation table represent the current conclusion based on multiple lines of evidence. These lines of evidence which are stored in DoTS::GoAssocInst. DoTS::GoAssociationInstance indicates ...

Attributes

is_defining

The DoTS::GoAssocation? and DoTS::GoAssociationInstance? have an is_defining attribute. Recall that the terms are arranged in a DAG. the DAG has a root node which conveys no information beyond what type of terms are below, e.g., molecular function, biological process, or cellular component. As one travels to nodes located deeper in the DAG the terms become more specific, e.g., more specific description of a enzyme's function. Often an association can not be made at the very deepest level, but is made at, say, two levels, down. To speed queries we also store assocations between the target DoTS entry and all terms that are ancestors of the specifically associated term. We mark the defining term by setting the is_defining bit in that association.

is_not

Occasionally a quick look at a sequence would suggest that the encoded protein should have a particular association. However, upon closer examination it may turn out that the association is not correct. To help mark this situation, the is_not bit should be set.

Use Cases

Basic Loading of GO Ontology and Associations

A new release of the GO ontology can be loaded into the SRes tables using the plugin ???. This plugin does (not?) create a new SRes::ExternalDatabaseRelease? row for the loaded release.

To store the associations, the annotated sequences would first be loaded into the appropriate DoTS:: table, e.g., DoTS::ExternalNaSequence. Then the associations loaded into the DoTS::GOAssoc* tables. For each association.