GusSchemaGoOntologies

From GUS Wiki
Jump to: navigation, search

GUS :: Schema :: Cluster :: GO Ontologies

A cluster of tables hold information related to the GO ontology (http://www.geneontology.org). The GUS tables that store this data can be put into three groups. First, the GO ontology consists of terms describing the function and cellular location of genes and gene products. The terms are connected in a directed acyclic graph (DAG) by typed edges that indicate relations between the terms. Second, the GO site allows the user to download both the GO ontology and associations between GO terms and sequences from a large variety of organisms. In addition, sites may perform their own automated prediction and/or manual curation of GO associations. Third, GUS has tables to store associations between GO terms and protein domains which can be used to predict GO functions for sequences that contain the domain.

ER Diagram

http://www.gusdb.org/wikiresc/Schema/goOntologies.png |

 Legend 

Solid lines indicate hard foreign key references where the table at the tail of the arrow contains a foreign key to the table at the head of the arrow.

The dashed line represents a soft (table ID/row ID) link.

The (R) icons represent links to =SRes::ReviewStatus=.

The (X) icons represent links to =SRes::~ExternalDatabaseRelease=.

The (U) icons represent links to =SRes::~UserInfo=.

Groups

The GO ontologies are stored in the =SRes= tables in the top (orange) area.

The associations are stored in the =DoTS::GOAssoc= tables in the group of tables at the bottom left (green).

Finally, the rules associating protein domain motifs with GO terms are stored in the group of tables at the bottom right (purple).

Storing GO Ontologies

GO terms are put in the =SRes::GoTerm= table. Relations between GO terms are stored in =SRes::~GoRelationship= which links terms in parent-child relations. Since the terms are in a DAG, a term may have one or more parents and also zero or more children. The edges are typed. The types are essentially a controlled vocabulary containing types such as part-of_ or _contained-in. A short list of types is stored in =SRes::~GoRelationshipType=. Sometimes a term may have synonyms, i.e., other terms that mean the same thing. Synonyms are stored in =SRes::~GoSynonym=. Finally, when associations are made between GO terms and other entities, such as an AA sequence, the evidence for the association is indicated with a small controlled vocabulary of evidence code stored in =~SRes::~GoEvidenceCode=.

Since the GO ontology is being constantly revised, the =SRes::GoTerm= and =SRes::~GoSynonym= contain references to =SRes::~ExternalDatabaseRelease= which can be used to distinguish different releases of the GO ontology. Since there is no external database release attribute for =SRes::~GoRelationship= there should not be relations that connect terms from different relseases and, hence, the release of an edge is determined (uniquely) from the terms it connects. Similarly, there is no external database release attribute for relationship types or evidence codes. These are stored in an accumulating fashion.

Annotation of DoTS Entities

The purpose of the GO ontology is to regularize the annotation of biological function, not just to organize terms for organization's sake. Since one may want to annotate a variety of DoTS object types we use a table (=DoTS::GoAssociation=) that has a soft link to the DoTS table (or, in fact, tables in any schema). This allows GUS to annotate AA sequences, proteins, genes etc., with GO terms. The exact semantics of the annotation depends on the object being annotated.

Annotation can be a complicated process. Data may be loaded from external sources. CBIL and others have methods for automatic prediction of GO terms. In addition, many sites also have manual curation of annotations. How can these multiple sources can be reconciled? GUS provides support for this process in the =DoTS::GoAssocInst*= tables. Entries in the =DoTS::~GoAssociation= table represent the current conclusion based on multiple lines of evidence. These lines of evidence which are stored in =DoTS::~GoAssocInst=. =DoTS::~GoAssociationInstance= indicates ...


Attributes

is_defining

The =DoTS::GoAssocation= and =DoTS::GoAssociationInstance= have an =isdefining= attribute. Recall that the terms are arranged in a DAG. the DAG has a root node which conveys no information beyond what type of terms are below, e.g., molecular function, biological process, or cellular component. As one travels to nodes located deeper in the DAG the terms become more specific, e.g., more specific description of a enzyme's function. Often an association can not be made at the very deepest level, but is made at, say, two levels, down. To speed queries we also store assocations between the target =DoTS= entry and all terms that are ancestors of the specifically associated term. We mark the defining term by setting the =isdefining= bit in that association.

is_not

Occasionally a quick look at a sequence would suggest that the encoded protein should have a particular association. However, upon closer examination it may turn out that the association is not correct. To help mark this situation, the =is_not= bit should be set.

Use Cases

Basic Loading of GO Ontology and Associations

A new release of the GO ontology can be loaded into the =SRes= tables using the plugin =???=. This plugin does (not?) create a new =SRes::ExternalDatabaseRelease= row for the loaded release.

To store the associations, the annotated sequences would first be loaded into the appropriate =DoTS::= table, e.g., =DoTS::ExternalNaSequence=. Then the associations loaded into the =DoTS::GOAssoc*= tables. For each association.