GUS 4.0 schema changes
The experience we have accumulated using GUS, as well as the advent of new technologies such as high throughput sequencing (HTS) and development of improved data exchange formats such as MAGE-TAB (2) have pointed out the necessity of refactoring components of the schema and also inspired ideas on how to simplify some of its parts considerably. GUS3.6 is composed of Core, DoTS, Prot, RAD, SRes, Study, and TESS (sub) schemas. GUS4.0 is composed of Core, DoTS, Model, Platform, Results, SRes, and Study. As described below, information and data stored in Prot, RAD, and TESS are reorganized and captured in Model, Platform, Results, and Study.
Changes by schema
One of the schemas that underwent extensive refactoring is Study. The original schema was aimed at capturing only the “wet bench” part of a study and had been designed based on the MAGE-OM model, which was more complicated than necessary in terms of linking inputs and outputs. In reality the whole flow of a study from bench to in silico is typically just a series of protocol applications where the inputs/outputs are either biomaterials or datasets (e.g. from quantifications or analyses). In the old schema, assays, acquisitions, quantifications and analyses for different technologies were kept in separate schemas (RAD - microarray, TESS – HTS, PROT – mass spec) and tables with the same names were repeated in these different schemas. This is not really necessary. It is much simpler to think of a study as a graph whose nodes are biomaterials or datasets connected by directed edges according to the protocol applications of which they are inputs/outputs. Thus, in the new Study schema, we have Protocol Application Nodes stored in one table with a type attribute to distinguish biomaterials from datasets. These are inputs and/or outputs of Protocol Applications, where a protocol application is a process in which a series of protocols has been applied. Note that with this schema design it should become much easier to import/export with the MAGE-TAB format.
The Study schema also contains information about the study, including experimental design and factors of interest, as well as publications linked to the study. Note that the Study table now has a FK to itself, to allow the possibility of Investigations, which might comprise more than one study.
The way results for quantifications or analyses were stored in the older schema became quite convoluted. For example, RAD had primarily been developed for RNA Abundance Data with mostly microarray in mind (but generalized to handle SAGE). However technologies such as microarrays have been used to explore other types of data, such as TF binding, and additional technologies have come
along to yield RNA Abundance Data (such as RNA-Seq). As a consequence the use of RAD has lost its semantic clarity and other schemas (such as TESS or application specific schemas) have been also used to store some of these data (e.g. HTS data). The new schema wants to avoid conflation of the technology aspects from the result type aspects. At the end of the day, independently of the technology, the results of a functional genomics experiment typically either associate values (e.g. expression, differential expression) to entities (such as genes, features, etc.) or yield a list of genomic locations, with possible values attached. This is what inspired the Results schema, which is aimed at storing the results of those Protocol Application Nodes that represent dataset outputs. Entries in the Results schema refer to a particular Protocol Application which in turn is linked to the detailed information about the protocols and parameter settings used to produce those results.
Note also that tables holding quantification or analysis results in the RAD schema had a lot of different views to accommodate results from different algorithms. In reality, in our experience, information at this detailed level was not really queried. The relevant information is the type of results a value represents (e.g. expression value). We can associate URIs to Protocol Application Nodes representing dataset outputs, thereby pointing to a file where additional detailed information is available.
The RAD schema was representing platform information, more specifically microarray information, at a level of detail that is not typically queried. For example, coordinates for microarrays required 6 values for each reporter, allowing 3 levels of grouping. This was done with the possibility in mind that raw data might be normalized (e.g. that spatial biases on the array might be examined) after being loaded in the database. However, in our experience, this was rarely done, indeed often raw data are not loaded into the database but archived in the file system (with a pointer to them in the database) and only processed data would be loaded and queried. The cost of forcing 6 coordinates also reduced flexibility and required remapping of array information. All of this has been eliminated in the new Platform schema, which has also been designed in a more generic fashion to semantically accommodate not just microarrays, but SAGE, PCR arrays, etc.
Currently this schema only contains tables that can be used to represent biological networks (including pathways) and was developed based on the Beta Cell Genomics experience with these types of entities. Eventually other types of models, such as computational ones (HMMs, linear regression, etc.) previously stored in TESS (which has been eliminated from GUS 4.0) should be placed in this schema.
The main change to this schema lies in the fact that most of the previously existing tables for specific types of controlled vocabularies and ontologies, such as Anatomy, DevelopmentalStage, GOTerm etc., are now encompassed within the OntologyTerm and associated tables. Taxon (based on the NCBI
Taxonomy tables) has been kept as is in view of its usage in projects such as EupathDB and some unique properties (e.g., alternate codon usage annotation). However certain commonly used taxa can also be stored in OntologyTerm in the form of another resource (Organism in OBI) or as a select subset of NCBI Taxon so that these terms can be used as a characteristic for a ProtocolAppNode. The category attribute in OntologyTerm can help distinguish to which parts of the schema a particular ontology is relevant to (e.g. ProtocolAppNodeType, StudyFactorType, OrganismPart, Anatomy, etc.).
Core and DoTS
We have left these largely unchanged. The changes in DoTS are:
1. Modifications to foreign keys pointing to SRes tables. Many Sres tables in GUS3.6 have been removed as the contents can be placed in the common Ontology tables. Therefore, the modifications are repointing in GUS4 to OntologyTerm.
2. Enable named attributes in subclass views. Previously, the policy (and code) required that all named attributes be in the superclass even if only needed by one view. This policy has led to some conflicts in the object code when referring to the same attribute in another table more than once. In particular this was an issue with pfam_entry_id, motif_aa_sequence_id, and repeat_type_id in.
Application specific tables
These are the tables that refer to a particular project utilizing the GUS schema typically in a web-based application. They are not part of the generic GUS release, but they can be defined with the required attributes for GUS tables and Core.TableInfo populated, so that objects can be generated and plugins can be utilized to operate on these tables. Their definition is the responsibility of the developers for their particular project application.
The Core.DatabaseDocumentation table can be utilized to store documentation about tables and their attribute. An xml file should be prepared and released with GUS, containing the information to populate this table using the plugin LoadGusFromXml. This table could serve as infrastructure to a Schema Browser application.
Generalities about GUS which are retained
There are typical GUS features that are used in the new schemas (aside from the central dogma of DoTS and the whole infrastructure of the object layer and plugins in Core) such as soft keys and views, albeit the new schemas use few views. The GUS schema is also meant to be vendor-neutral such that is can be implemented in, for example, Oracle and PostgreSQL.
Soft keys are primarily used when one wants a table to point to entries that may be found in more than one GUS table. In this case, it’s not possible to use a Foreign Key and therefore the pair (table_id, row_id) is used instead. Table_id identifies the table (from Core.TableInfo) and row_id the primary_key in that table of what one wants to point to. This is a soft key because it is not connected to a precise database constraint.
A related concept is the pair (external_database_release_id, source_id) that can be used when one wants to reference an entry (e.g., GO term) in a specified version of an external resource/ database (downloaded GO file).
The GUS object layer and Plug-ins
A Perl and a Java object layer are generated from the schemas that are used for loading and querying data in the GUS application (ga) framework. The Perl plugins are of two types: supported and community. Supported are basic plugins needed and meant to be supported by the GUS developers whereas community plugins are generally developed for specific projects and not necessarily kept up to date. With schema changes, all plug-ins will need to be tested. Those relying on tables / schemas that have been altered or removed will clearly be affected (See Appendix 1 for the list of deprecated schemas and attributes).
1. Davidson, S.B., Crabtree, J., Brunk, Brian P., Schug, J., Tannen, V., Overton, G.C., Stoeckert, C.J. Jr. (2001) K2/Kleisli and GUS: Experiments in Integrated Access to Genomic Data Sources. IBM Systems Journal: 40(2), p. 512-531.
2. Rayner TF, Rocca-Serra P, Spellman PT, Causton HC, Farne A, Holloway E, Irizarry RA, Liu J, Maier DS, Miller M, Petersen K, Quackenbush J, Sherlock G, Stoeckert CJ Jr, White J, Whetzel PL, Wymore F, Parkinson H, Sarkans U, Ball CA, Brazma A. (2006) A simple spreadsheet-based, MIAME-supportive format for microarray data: MAGE-TAB BMC Bioinformatics 7(1): 489, 2006
Appendix: GUS 3.6 tables dropped from GUS 4.0
the entire PROT, RAD, and TESS schemas
these 34 tables (34 in SRes, 6 in Study): SRes.Abstract SRes.Anatomy
certain columns of these 7 tables:
Study.Study (the columns BIBLIOGRAPHIC_REFERENCE_ID and CONTACT_ID)