Various tables in GUS are used as dictionaries or restricted vocabularies. Each center can have its own vocabulary but it can be quite daunting to set these up from scratch. Here are the important tables to populate along with example XML, although they should be in CVS in the near future...

Contents


SRes::ExternalDatabase and SRes::ExternalDatabaseRelease

These tables hold every database that your data may reference and the version or release of that database at the time the reference was made. For example, Pfam is currently at release 11, but release 9 and 10 might also be in your GUS database. The actual link in the database would be to ExternalDatabaseRelease.

If you haven't already, register the plugin to load XML data into GUS;

ga +create GUS::Supported::Plugin::LoadGusXml --commit

Cut and paste the XML from the 2 pages below into files called ExternalDatabase.xml and ExternalDatabaseRelease.xml.

SRes::ExternalDatabaseXML
SRes::ExternalDatabaseReleaseXML

Then run;

ga GUS::Supported::Plugin::LoadGusXml --commit --filename=ExternalDatabase.xml
ga GUS::Supported::Plugin::LoadGusXml --commit --filename=ExternalDatabaseRelease.xml

DoTS::SequenceType

( Old page to be deleted: DoTS::SequenceType )

Save the below XML to a file called 'Dots.SequenceType.xml' and run this command;

ga GUS::Supported::Plugin::LoadGusXml --commit --filename=DoTS.SequenceType.xml >& DoTS.SequenceType.log
<DoTS::SequenceType>
  <sequence_type_id>1</sequence_type_id>
  <nucleotide_type>DNA</nucleotide_type>
  <hierarchy>1</hierarchy>
  <parent_sequence_type_id>1</parent_sequence_type_id>
  <name>DNA</name>
  <description>DNA, unknown strandedness</description>
</DoTS::SequenceType>
//
<DoTS::SequenceType>
  <sequence_type_id>2</sequence_type_id>
  <nucleotide_type>RNA</nucleotide_type>
  <hierarchy>1</hierarchy>
  <parent_sequence_type_id>2</parent_sequence_type_id>
  <name>RNA</name>
  <description>RNA, unknown strandedness</description>
</DoTS::SequenceType>
//
<DoTS::SequenceType>
  <sequence_type_id>3</sequence_type_id>
  <nucleotide_type>DNA</nucleotide_type>
  <strand>ds</strand>
  <hierarchy>2</hierarchy>
  <parent_sequence_type_id>1</parent_sequence_type_id>
  <name>ds-DNA</name>
  <description>double stranded DNA</description>
</DoTS::SequenceType>
//
<DoTS::SequenceType>
  <sequence_type_id>4</sequence_type_id>
  <nucleotide_type>DNA</nucleotide_type>
  <strand>ss</strand>
  <hierarchy>2</hierarchy>
  <parent_sequence_type_id>1</parent_sequence_type_id>
  <name>ss-DNA</name>
  <description>single stranded DNA</description>
</DoTS::SequenceType>
//
<DoTS::SequenceType>
  <sequence_type_id>5</sequence_type_id>
  <nucleotide_type>RNA</nucleotide_type>
  <strand>ss</strand>
  <hierarchy>2</hierarchy>
  <parent_sequence_type_id>2</parent_sequence_type_id>
  <name>ss-RNA</name>
  <description>single stranded RNA</description>
</DoTS::SequenceType>
//
<DoTS::SequenceType>
  <sequence_type_id>6</sequence_type_id>
  <nucleotide_type>RNA</nucleotide_type>
  <strand>ds</strand>
  <hierarchy>2</hierarchy>
  <parent_sequence_type_id>2</parent_sequence_type_id>
  <name>ds-RNA</name>
  <description>double stranded RNA</description>
</DoTS::SequenceType>
//
<DoTS::SequenceType>
  <sequence_type_id>7</sequence_type_id>
  <nucleotide_type>RNA</nucleotide_type>
  <sub_type>mRNA</sub_type>
  <strand>ss</strand>
  <hierarchy>3</hierarchy>
  <parent_sequence_type_id>5</parent_sequence_type_id>
  <name>mRNA</name>
  <description>mRNA</description>
</DoTS::SequenceType>
//
<DoTS::SequenceType>
  <sequence_type_id>8</sequence_type_id>
  <nucleotide_type>RNA</nucleotide_type>
  <sub_type>EST</sub_type>
  <strand>ss</strand>
  <hierarchy>3</hierarchy>
  <parent_sequence_type_id>5</parent_sequence_type_id>
  <name>EST</name>
  <description>EST - could be mRNA,rRNA...</description>
</DoTS::SequenceType>
//
<DoTS::SequenceType>
  <sequence_type_id>9</sequence_type_id>
  <nucleotide_type>RNA</nucleotide_type>
  <sub_type>tRNA</sub_type>
  <strand>ss</strand>
  <hierarchy>3</hierarchy>
  <parent_sequence_type_id>5</parent_sequence_type_id>
  <name>tRNA</name>
  <description>tRNA</description>
</DoTS::SequenceType>
//
<DoTS::SequenceType>
  <sequence_type_id>10</sequence_type_id>
  <nucleotide_type>RNA</nucleotide_type>
  <sub_type>rRNA</sub_type>
  <strand>ss</strand>
  <hierarchy>3</hierarchy>
  <parent_sequence_type_id>5</parent_sequence_type_id>
  <name>rRNA</name>
  <description>rRNA</description>
</DoTS::SequenceType>
//
<DoTS::SequenceType>
  <sequence_type_id>11</sequence_type_id>
  <nucleotide_type>unknown</nucleotide_type>
  <hierarchy>1</hierarchy>
  <parent_sequence_type_id>11</parent_sequence_type_id>
  <name>unknown</name>
  <description>unknown</description>
</DoTS::SequenceType>
//
<DoTS::SequenceType>
  <sequence_type_id>13</sequence_type_id>
  <nucleotide_type>RNA</nucleotide_type>
  <sub_type>predicted_mRNA</sub_type>
  <strand>ss</strand>
  <hierarchy>2</hierarchy>
  <parent_sequence_type_id>1</parent_sequence_type_id>
  <name>predicted_mRNA</name>
  <description>mRNA sequence predicted by an algorithm from genomic DNA</description>
</DoTS::SequenceType>
//
<DoTS::SequenceType>
  <sequence_type_id>20</sequence_type_id>
  <nucleotide_type>virtual</nucleotide_type>
  <hierarchy>1</hierarchy>
  <name>virtual</name>
  <description>virtual nucleic acid sequence</description>
</DoTS::SequenceType>
//
<DoTS::SequenceType>
  <sequence_type_id>21</sequence_type_id>
  <nucleotide_type>DNA</nucleotide_type>
  <sub_type>GSS</sub_type>
  <strand>ds</strand>
  <hierarchy>3</hierarchy>
  <parent_sequence_type_id>3</parent_sequence_type_id>
  <name>GSS</name>
  <description>Genome Survey Sequence</description>
</DoTS::SequenceType>

WARNING: the instructions given in the PDF document supplied by Brett Tyler�s Lab did not work for me. The plugin SubmitRow first checks the primary key exists in the attrlist, since it does it tries to update what it thinks is an existing row. Removing the primary key gave me a another problem - the first entry, DNA, fails to load because the plugin checks to see if the parent key is already present (parent_sequence_type_id) but since this record refers to itself the plugin stops!

ga GUS::Supported::Plugin::SubmitRow --tablename DoTS::SequenceType
   --attrlist sequence_type_id,nucleotide_type,sub_type,strand,hierarchy,parent_sequence_type_id,name,description
   --valuelist "1^^^DNA^^^null^^^null^^^1^^^1^^^DNA^^^unkown strandedness"
   --commit

Taxonomy

( Old page to be deleted: TaxonomyTables )

Use a plugin to setup the tables relating to taxonomy;

Download the taxonomy file 'taxdump.tar.gz' from the NCBI site and save it - we use the directory /usr/local/src/gus-related as an example only. Unpack the file into several smaller files by typing;

$ cd /usr/local/src/gus-related
$ mkdir taxonomy
$ cd taxonomy
$ tar xvfzp ../taxdump.tar.gz

Note: NCBI does not use release numbers (versions) for the taxonomy database. Now, register the LoadTaxon plugin;

$ ga +create GUS::Supported::Plugin::LoadTaxon --commit

Run the plugin using the NCBI files by issuing the command;

$ ga GUS::Supported::Plugin::LoadTaxon --gencode=/usr/local/src/gus-related/taxonomy/gencode.dmp\
                                    --names=/usr/local/src/gus-related/taxonomy/names.dmp\
                                    --nodes=/usr/local/src/gus-related/taxonomy/nodes.dmp\
                                    --commit

Please note: This plugin could many, many hours to complete!


GO Ontology tables

( Old page to be deleted: GOOntologyTables )

Download GO ontology data from the Gene Ontology website, http://www.geneontology.org/. The files needed are:

At the time of writing they could be found here: ftp://ftp.geneontology.org/pub/go/ontology/

Save these files, in our example we put them here /usr/local/src/gus-related/ontology/ but they can go anywhere. Register the LoadGoOntology plugin;

$ ga +create GUS::GOPredict::Plugin::LoadGoOntology --commit

The table SRes::GORelationshipType needs to have two rows added to it before we can run the GO plugin. Save this XML;

<GUS::Model::SRes::GORelationshipType>
  <go_relationship_type_id>1</go_relationship_type_id>
  <name>isa</name>
</GUS::Model::SRes::GORelationshipType>

<GUS::Model::SRes::GORelationshipType>
  <go_relationship_type_id>2</go_relationship_type_id>
  <name>partof</name>
</GUS::Model::SRes::GORelationshipType>

and run this command

ga GUS::Supported::Plugin::LoadGusXml --commit --filename=/my/path/to/SRes.GORelationshipType.xml

If you used the XML from the above section, SRes::ExternalDatabase and SRes::ExternalDatabaseRelease, then the IDs below are correct (process_db_id etc.). If not, please update these values - the *ext_db_rel ids are generated by your database (a sequence) as are _probably the same as the values below. Please check!

ga GUS::GOPredict::Plugin::LoadGoOntology --file_path=/usr/local/src/gus-related/ontology/ \
                                          --process_db_id=3003 \
                                          --process_ext_db_rel=165 \
                                          --function_db_id=3001 \
                                          --function_ext_db_rel=164 \
                                          --component_db_id=3002 \
                                          --component_ext_db_rel=166 \
                                          --commit \
                                          >& /home/apps/GUS/dev/uploads/go-ontology-upload.out

NOTE: IF YOU COMMIT AT THE FIRST INSTANCE IT WORKS. OTHERWISE IF YOU WANT TO MAKE IT TWICE, CHOOSE ANOTHER NAME FOR THE ID FILE PARAMETER e.g., IF YOU CHOSE id go FIRST TIME THEN CHOOSE id go1.


GO Evidence Code

This is a single dictionary table for the GO evidence codes such as ISS, IEA, and NR. Save the below XML and run the LoadGusXml plugin e.g.

ga GUS::Supported::Plugin::LoadGusXml --commit --filename=my_go_evidence_file.xml
<SRes::GOEvidenceCode>
  <NAME>IC</NAME>
  <DESCRIPTION>inferred by curator</DESCRIPTION>
</SRes::GOEvidenceCode>
<SRes::GOEvidenceCode>
  <NAME>IDA</NAME>
  <DESCRIPTION>inferred from direct assay</DESCRIPTION>
</SRes::GOEvidenceCode>
<SRes::GOEvidenceCode>
  <NAME>IEA</NAME>
  <DESCRIPTION>inferred from electronic annotation</DESCRIPTION>
</SRes::GOEvidenceCode>
<SRes::GOEvidenceCode>
  <NAME>IEP</NAME>
  <DESCRIPTION>inferred from expression pattern</DESCRIPTION>
</SRes::GOEvidenceCode>
<SRes::GOEvidenceCode>
  <NAME>IGI</NAME>
  <DESCRIPTION>inferred from genetic interaction</DESCRIPTION>
</SRes::GOEvidenceCode>
<SRes::GOEvidenceCode>
  <NAME>IMP</NAME>
  <DESCRIPTION>inferred from mutant phenotype</DESCRIPTION>
</SRes::GOEvidenceCode>
<SRes::GOEvidenceCode>
  <NAME>IPI</NAME>
  <DESCRIPTION>inferred from physical interaction</DESCRIPTION>
</SRes::GOEvidenceCode>
<SRes::GOEvidenceCode>
  <NAME>ISS</NAME>
  <DESCRIPTION>inferred from sequence or structural similarity</DESCRIPTION>
</SRes::GOEvidenceCode>
<SRes::GOEvidenceCode>
  <NAME>NAS</NAME>
  <DESCRIPTION>non-traceable author statement</DESCRIPTION>
</SRes::GOEvidenceCode>
<SRes::GOEvidenceCode>
  <NAME>ND</NAME>
  <DESCRIPTION>no biological data available</DESCRIPTION>
</SRes::GOEvidenceCode>
<SRes::GOEvidenceCode>
  <NAME>TAS</NAME>
  <DESCRIPTION>traceable author statement</DESCRIPTION>
</SRes::GOEvidenceCode>
<SRes::GOEvidenceCode>
  <NAME>NR</NAME>
  <DESCRIPTION>not recorded</DESCRIPTION>
</SRes::GOEvidenceCode>

Pfam

( Old page to be deleted: LoadPfam )

About Pfam:

Pfam is a collection of protein family alignments which were constructed semi-automatically using HMMs. Sequences that are not covered by Pfam are clustered and aligned automatically, and released as pfamB. pfamA families have permanent accession numbers and contain functional annotation and cross-references to other databases, while pfamB families are re-generated at each release and are un-annotated.

Construction of Pfam:

pfamA is based on a sequence database called pfamseq - pfamseq 11 is based on swissprot 41.25 and SP-TrEMBL 24.14. pfamB is constructed from PRODOM 2002.1.

Tables populated by the LoadPfam module:

LoadPfam.pm module inserts data into dots.pfamentry, dots.dbrefpfamentry and sres.dbref. dots.dbrefpfamentry table is the child of dots.pfamentry and sres.dbref. The db_ref_id and the pfam_entry_id of the dots.dbrefpfamentry table are derived from dots.pfamentry and sres.dbref tables.

Data for LoadPfam plugin:

The data can be downloaded from http://www.sanger.ac.uk/Software/Pfam/ftp.shtml.

Running LoadPfam

First register the plugin;

ga +create GUS::Community::Plugin::LoadPfam --commit

It is wise to do a test run using LoadPfam - the option --parse_only will not insert any rows into the database. This is different to leaing off --commit where the plugin will insert data but the rows will be removed at the end, hence --parse_only is much, much faster. To test, run this;

ga GUS::Community::Plugin::LoadPfam --parse_only --flat_file=/my/path/to/Pfam-A.full.gz --release=14 >& LoadPfam.log

The only thing that should go wrong, if at all, is that a database mentioned in Pfam-A.full is not in the tables ExternalDatabase and ExternalDatabaseRelease. You will have to add the database before continuing. Note: the --release option refers to the release of Pfam you are loading, please change the number as this document will go out-of-date

If everything went ok, you can commit;

ga GUS::Community::Plugin::LoadPfam --flat_file=/my/path/to/Pfam-A.full.gz --release=14 --commit >& ! LoadPfam.log

Note: The file Pfam-A.full.gz can be left in compressed format to save space. The plugin will automatically uncompress the file if it has a .Z or .gz file externsion.