http://theileriadb.org/theileria/index.jsp Home Building TheileriaDB Genome Browser Query History Downloads


Resources and Tools
Installing Software
Populating the Database
A Note on Gene Location Estimates
Building the Site
Project Team

TheileriaDB is a resource on the genome of the veterinary parasite Theileria parva, which was sequenced by The Institute for Genomic Research and the International Livestock Research Institute. It is built entirely with free software.

TheileriaDB is based on the Genomics Unified Schema (GUS). GUS is a relational database schema and associated application framework designed to store, integrate, analyze and present functional genomics data. The GUS Web Development Kit (WDK) enables web sites that query and display data in a GUS instance. TheileriaDB was thus created in three broad stages: installing software, populating the GUS database, and building the site itself. These are described in detail below.

This page describes in detail the building of TheileriaDB. It is intended both to give context for data presented on the site and as a guide for those who seek to publish their own scientific datasets on the web. It is part of the GUS Wiki, which runs PHP Wiki software.

Resources and Tools

The following resources and tools were used in building TheileriaDB:

genome data

pubic genomics data resource

software tools

hardware

Installing Software

initial state

When this project started, the CentOS operating system, Subversion, Apache, Tomcat, PostgreSQL, InterProScan, and BLAT had already been installed. See their respective web sites for detailed installation procedures.

installing GUS and the WDK using subversion

An installation of GUS is organized around a project_home directory, which contains all source code, and a gus_home directory, to which executable code is installed. We created these under a directory named theileria under our home directory, and installed GUS as described in the GUS Installation Guide

make directories and download

cd
mkdir theileria
cd theileria
mkdir project_home
cd project_home
svn checkout https://www.cbil.upenn.edu/svn/gus/GusAppFramework/branches/internal/plasmodb_5-0beta-0 GUS
svn checkout https://www.cbil.upenn.edu/svn/gus/CBIL/branches/internal/api-november-2006 CBIL
svn checkout https://www.cbil.upenn.edu/svn/gus/WDK/branches/WDK_version_1-12 WDK
svn checkout https://www.cbil.upenn.edu/svn/gus/WSF/branches/WDK_version_1-12 WSF
svn checkout https://www.cbil.upenn.edu/svn/gus/GusSchema/branches/3-6-Dev GusSchema
svn checkout https://www.cbil.upenn.edu/svn/gus/install/trunk install

add lines to .bash_profile to initialize environment

export GUS_HOME=${HOME}/theileria/gus_home
export PROJECT_HOME=${HOME}/theileria/project_home
export PATH=${GUS_HOME}/bin:${PROJECT_HOME}/install:${PATH}
export PERL5LIB=${GUS_HOME}/lib/perl
export GUS_CONFIG_FILE=${GUS_HOME}/config/gus.config

set up GUS config file

cp ${PROJECT_HOME}/gus_install/gus.config.sample ${GUS_HOME}/config/gus.config

. . . then edit the following lines in it to define database connection

dbVendor=Postgres
dbiDsn=dbi:Postgres:gus
jdbcDsn=jdbc:postgres:thin:@localhost:5432:gus

install executables and create database schema

build GUS install -append -installDBSchema

Populating the Database

A GUS database instance is populated by means of plugins. We used a variety of plugins to build TheileriaDB, including both existing plugins, which we downloaded, and ones which we created or modified to our own purposes.

loading bootstrap data

The NCBI Taxonomy database and the Gene Ontology were loaded as described on the Bootstrap data page.

loading the InterPro domain database

ga GUS::Community::Plugin::InsertInterproDomainDbs --inPath ~/theileria/data/interpro/iprscan/data/  --commit

defining the Theileria parva genome as an external database

The GUS sres.ExternalDatabase and sres.ExternalDatabaseRelease tables define external datasets (and their versions). The InsertExternalDatabase and InsertExternalDatabaseRls plugins populate these tables with records that subsequent data loads can link to.

ga GUS::Supported::Plugin::InsertExternalDatabase --name 'Theileria parva' --commit
ga GUS::Supported::Plugin::InsertExternalDatabaseRls --databaseName 'Theileria parva' --databaseVersion 1.0 --commit

loading chromosome sequences

The Theileria parva contigs are published as a FASTA file, which we loaded using the LoadFastaSequence plugin. The arguments identify the input file, the name and version of the external dataset, the GUS table to populate, and the taxon from which the sequences came, and give a regular expression to parse the deflines for sequence IDs.

ga GUS::Supported::Plugin::LoadFastaSequences --externalDatabaseName 'Theileria parva' --externalDatabaseVersion 1.0
                   --regexSourceId  '(c[0-9]m[0-9]*)' --tableName DoTS::ExternalNASequence --sequenceFile ~/TPA1.1con
                   --writeFile tpa_chromosomes.fasta --ncbiTaxId 5875 --SOTermName contig --commit

loading CDSs

Gene coding sequences are loaded with an additional run of LoadFastaSequences.

ga GUS::Supported::Plugin::LoadFastaSequences --externalDatabaseName 'Theileria parva' --externalDatabaseVersion 1.0
                   --regexSourceId  '(c[0-9]m[0-9]*)' --tableName DoTS::ExternalNASequence --sequenceFile ~/TPA1.1con
                   --writeFile tpa_chromosomes.fasta --ncbiTaxId 5875 --SOTermName contig --commit

loading BLAT alignment of genes against chromosome

We estimated gene locations by means of a BLAT alignment of genes against contigs. The alignments were loaded with the LoadBLATAlignments plugin.

ga GUS::Community::Plugin::LoadBLATAlignments --blat_files /genomics/binf/theileria/genome/cds-contigs/master/mainresult/out.psl
                   --query_file /home/iodice/theileria/data/2blat/blocked.seq --query_table_id 229 --target_table_id 229
                   --query_taxon_id 85823   --target_taxon_id 85823 --target_db_rel_id 4 --action load --queryRegex '>(\S+) '
                   --max_query_gap 5 --min_pct_id 95 --max_end_mismatch 10 --end_gap_factor 10 -- min_gap_pct 90
                   --ok_internal_gap 15 --ok_end_gap 50 --min_query_pct 10 --commit

creating gene, exon, and location records from stored BLAT alignments

Having loaded the BLAT alignments, we used another plugin to create GeneFeature and ExonFeature records (together with NaLocation records to store their locations). No plugin existed for this task, so we created a new one.

ga GUS::Community::Plugin::MakeGenesFromAlignments --commit

loading InterPro? domain alignments

We used InterProScan? to find protein domains. The InsertInterproscanResults? plugin loaded those domain features, and the concommittant Gene Ontology terms.

ga GUS::Community::Plugin::InsertInterproscanResults --resultFileDir ~/theileria/data/iprscan/master/mainresult
               --confFile ~/theileria/data/interpro/iprscan/data/insertInterpro-config.xml --aaSeqTable ExternalAASequence
                --extDbName InterPro --extDbRlsVer 12.1 --goVersion 'Gene Ontology|5.71' --commit

A Note on Gene Location Estimates

The genomic data published on TIGR's Theileria parva download site included contig sequences and gene sequences, but did not include genomic locations for the genes. Therefore, we estimated those locations by using BLAT to align the genes to the genome, and used GUS plugins to load the alignments and create exon records based on them. See details above in Populating the Database.

We used two techniques to evaluate the quality of these alignments. First, we compared figure 2 in the supplement to Gardner et al. with the corresponsing views in the TheileriaDB genome browser. This covered a total of 22 genes in three different contigs. All had the same orientation and approximate relative size and position in both images. Second, we took advantage of the fact that the genes of T. parva appear to have been nambered consecutively along the chromosomes and compared a list of genes ordered by estimated location with a list ordered by gene name. The differences between the lists involve 41 of the approximately 4000 genes:

TP01_0329 TP01_1228 TP01_1229 TP01_1230 TP01_1231 TP01_1232 TP01_1233

TP02_0596 TP02_0747 TP02_0961 TP02_0962 TP02_0963 TP02_0964 TP02_0965 TP02_0966

TP03_0012 TP03_0285 TP03_0383 TP03_0420 TP03_0893 TP03_0894 TP03_0895 TP03_0896 TP03_0897 TP03_0898 TP03_0899 TP03_0900 TP03_0901 TP03_0902 TP03_0903 TP03_0904 TP03_0929 TP03_0930

TP04_0177 TP04_0388 TP04_0439 TP04_0923 TP04_0924 TP04_0925 TP04_0926 TP04_0927 TP04_0928

Building the Site

TheileriaDB is based on the GUS Web Development Kit. WDK is based on the Java Servlet API, and uses the Apache Tomcat servlet container. The WDK is distributed with a simple demo site. We copied and modified this demo to create TheileriaDB.org

The following sections describe the components of the site.

model file: theileriaModel.xml
This XML file defines the types of records that the site can display. In TheileriaDB's case there's only one type: the gene record. The model file includes queries that let users retrieve sets of genes that match specified criteria, such as the query for genes by associated keyword. Other queries define record attributes (for instance, the gene record has attributes for product name, amino-acid sequence, and Gene Ontology associations; these are populated by queries in the model file).

sanity test file: theileriaSanity.xml
This companion to the model file defines parameters to be used in testing the queries of the model file, together with upper and lower limits on the number of rows that should be returned. This enables automated testing which can uncover potential problems of many kinds.

custom record file: GeneRecordClasses.GeneRecordClass.jsp
This JSP file overrides the WDK's default behavior in displaying the gene record. This lets the TheileriaDB gene page precede a gene's list of exon locations with a notice about how these data were estimated.

Ajax code for InterPro query: AjaxInterpro.js, prototype.js, scriptaculous.js
These files contain the Ajax code that makes the interactive GenesByInterproDomain query.

Project Team

The TheileriaDB development team, John Iodice and Bryan Cardillo, gratefully acknowledge the assistance of:

Jerric Gao
Chris Stoeckert
Lyle Ungar
Steve Fischer
Praveen Chakravarthula
Deborah Pinney
The Computational Biology and Informatics Lab