DataProvenance

From GUS Wiki
Jump to: navigation, search

Overview

GUS provides extensive support for Data Provenance (tracking) in multiple domains through the use of the ObjectLayer. These domains include row-level permissions, supporting evidence, versioning deleted data, data source tracking, and tracking change metadata, including modification date and change methodology (algorithm). Collectively, these domains allow a GUS maintainer to track what data has been modified, who has modified data, how the data was modified, when the data was modified, and (to some extent) why the data was modified, in addition to allowing the maintainer to restore the previous data and to limit modifications.

When using Oracle as the RDBMS system for GUS, additional Data Provenance functionality is provided outside of GUS for more demanding instances. These include Virtual Private Databases and Workspace Manager.

Permissions

GUS uses UNIX-style row-level permissions for all tables. Every row is assigned a user (ROWUSER_ID) and group (ROW_GROUP_ID) at creation (usually using the values supplied in the GusPropertiesFile). The rows additionally contain read and write switches for the user, group, and other (all). The combination of the permission switches and the assigned user/group allows individual users and individual groups to be granted read and/or write privileges on the individual row, as well as assigning read and/or write privileges to all other users (who are not in the ROW_GROUP_ID and are not the ROW_USERID).

Evidence

GUS supports the use of evidence (support for a specific entity in the database) through the use of the DoTS::Evidence table, which provides a bridge between a target (or entity) table, and the fact (evidence) table. Evidence may be as simple as a plain text comment, or may require deeper biological knowledge. For example, at CBIL, EST and mRNA assemblies containing a ~RefSeq use the ~RefSeq as evidence to support that the assembly is full length.

Versioning

GUS implements simple versioning of deleted data, providing protection against accidental modification, and allowing for queries against historical data. GUS versioning should never take the place of a well-designed, executed, and tested disaster recovery plan which includes full and frequent backups. (When running GUS on Oracle, using the ARCHIVELOG mode is highly recommended).

GUS versioning is implemented by mirroring all tables with a corresponding "version" table, denoted by the "VER" suffix and the presence in a "VER" schema. For example, DoTS::~NASequenceImp is versioned in the ~DoTSVER::~NASequenceImpVER table. The version tables contain all rows of the original table, plus three rows to track the date of versioning, the method (algorithm) of versioning, and version transaction id.

GUS version tables generally have fewer constraints and indexes than their corresponding original table as a performance optimization.

Data Source Tracking

Where applicable, GUS provides support for tracking the source of row-level data. For example, many tables contain an EXTERNALDATABASE_RELEASEID column, which serves as a reference to SRes::~ExternalDatabaseRelease table. This table and it's parent, ~SRes::~ExternalDatabase, provide the tracking of external databases, and (where applicable) individual releases of external databases.

Algorithm

The GUS system refers to the method of data manipulation as an Algorithm, and contains several tables to support the tracking of Algorithms, their runtime parameters, and other details. Every GUS table contains a ROWALG_INVOCATION_ID column, which is a reference to a specific invocation of a specific algorithm. When the algorithm is invoked, an entry is made in the Core::AlgorithmInvocation table, and an entry is made in the Core::~AlgorithmParam table for each runtime parameter value specified to the algorithm (the parameter keys are stored in the Core::~AlgorithmParamKey table). The algorithm then executes, setting the ROW_ALG_INVOCATIONID on rows which it effects. Finally, the Core::~AlgorithmInvocation is updated with execution and result status from the algorithm.

The Core::Algorithm table stores details about the algorithms which may be invoked, including the algorithm checksum and version number from CVS. A change in the algorithm, however minor, requires reregistering the algorithm with GUS to create a new Core::~Algorithm entry.

GUS Algorithm System [1]


Ontologies

GUS uses Ontologies and Controlled Vocabularies extensively, and this deserves its own WikiPage: ControlledVocabularies or GusOntologies