
Home
Building TheileriaDB
Genome Browser
Query History
Downloads
Resources and Tools
Installing Software
Populating the Database
A Note on Gene Location Estimates
Building the Site
Project Team
TheileriaDB is a resource on the genome of the veterinary parasite Theileria parva, which was
sequenced by
The Institute for Genomic Research and the
International Livestock Research Institute.
It is built entirely with free software.
TheileriaDB is based on the
Genomics Unified Schema (GUS). GUS is a
relational database schema and associated application framework designed to store,
integrate, analyze and present functional genomics data. The
GUS Web Development Kit (WDK)
enables web sites that query and display data in a GUS instance.
TheileriaDB was thus created in three broad stages: installing software, populating the GUS
database, and building the site itself. These are described in detail below.
This page describes in detail the building of TheileriaDB. It is intended both to give context for data presented on the site and as a guide for those who seek to publish their own scientific datasets on the web. It is part of the
GUS Wiki, which runs
PHP Wiki software.
Resources and Tools
The following resources and tools were used in building TheileriaDB:
genome data
- Contig and gene sequences downloaded from the Theileria parva
genome project.
pubic genomics data resource
- The NCBI
Taxonomy database - The
InterPro database of protein families - The
Gene Ontology controlled vocabulary of gene and gene-product attributes - The
Disease Ontology controlled vocabulary of diseases and conditions
software tools
- The
Genomics Unified Schema, a relational schema designed for the storage and manipulation of genomic data - The
Web Development Kit, which is associated with GUS
GBrowse, the generic genome browser- The
MedlineDB publications datamining project
InterProScan, which searches protein sequences for
InterPro domains
BLAT, the Blast-Like Alignment Tool- The
PostgreSQL relational database
Subversion, a version control system- The
Apache web server - The
Apache Tomcat servlet container - The
BioPerl,
BioPython, and
BioSQL projects
CentOS, the Community Enterprise Operating System, a version of the
Linux operating system
hardware
- The TheileriaDB server is a
Dell PowerEdge 2650 with dual 2.4GHz Xeon processors and 4GB of RAM.
Installing Software
initial state
When this project started, the CentOS operating system, Subversion, Apache, Tomcat, PostgreSQL, InterProScan, and BLAT had already been installed. See their respective web sites for detailed installation procedures.
installing GUS and the WDK using subversion
An installation of GUS is organized around a project_home directory, which contains all source code, and a gus_home directory, to which executable code is installed. We created these under a directory named theileria under our home directory, and installed GUS as described in the
GUS Installation Guide
make directories and download
cd mkdir theileria cd theileria mkdir project_home cd project_home svn checkout https://www.cbil.upenn.edu/svn/gus/GusAppFramework/branches/internal/plasmodb_5-0beta-0 GUS svn checkout https://www.cbil.upenn.edu/svn/gus/CBIL/branches/internal/api-november-2006 CBIL svn checkout https://www.cbil.upenn.edu/svn/gus/WDK/branches/WDK_version_1-12 WDK svn checkout https://www.cbil.upenn.edu/svn/gus/WSF/branches/WDK_version_1-12 WSF svn checkout https://www.cbil.upenn.edu/svn/gus/GusSchema/branches/3-6-Dev GusSchema svn checkout https://www.cbil.upenn.edu/svn/gus/install/trunk install
add lines to .bash_profile to initialize environment
export GUS_HOME=${HOME}/theileria/gus_home
export PROJECT_HOME=${HOME}/theileria/project_home
export PATH=${GUS_HOME}/bin:${PROJECT_HOME}/install:${PATH}
export PERL5LIB=${GUS_HOME}/lib/perl
export GUS_CONFIG_FILE=${GUS_HOME}/config/gus.config
set up GUS config file
cp ${PROJECT_HOME}/gus_install/gus.config.sample ${GUS_HOME}/config/gus.config
. . . then edit the following lines in it to define database connection
dbVendor=Postgres dbiDsn=dbi:Postgres:gus jdbcDsn=jdbc:postgres:thin:@localhost:5432:gus
install executables and create database schema
build GUS install -append -installDBSchema
Populating the Database
A GUS database instance is populated by means of
plugins. We used a variety of plugins to build TheileriaDB, including both existing plugins, which we downloaded, and ones which we created or modified to our own purposes.
loading bootstrap data
The NCBI Taxonomy database and the Gene Ontology were loaded as described on the Bootstrap data page.
loading the InterPro domain database
ga GUS::Community::Plugin::InsertInterproDomainDbs --inPath ~/theileria/data/interpro/iprscan/data/ --commit
defining the Theileria parva genome as an external database
The GUS sres.ExternalDatabase and sres.ExternalDatabaseRelease tables define external datasets (and their versions). The InsertExternalDatabase and InsertExternalDatabaseRls plugins populate these tables with records that subsequent data loads can link to.
ga GUS::Supported::Plugin::InsertExternalDatabase --name 'Theileria parva' --commit ga GUS::Supported::Plugin::InsertExternalDatabaseRls --databaseName 'Theileria parva' --databaseVersion 1.0 --commit
loading chromosome sequences
The Theileria parva contigs are published as a FASTA file, which we loaded using the LoadFastaSequence plugin. The arguments identify the input file, the name and version of the external dataset, the GUS table to populate, and the taxon from which the sequences came, and give a regular expression to parse the deflines for sequence IDs.
ga GUS::Supported::Plugin::LoadFastaSequences --externalDatabaseName 'Theileria parva' --externalDatabaseVersion 1.0
--regexSourceId '(c[0-9]m[0-9]*)' --tableName DoTS::ExternalNASequence --sequenceFile ~/TPA1.1con
--writeFile tpa_chromosomes.fasta --ncbiTaxId 5875 --SOTermName contig --commit
loading CDSs
Gene coding sequences are loaded with an additional run of LoadFastaSequences.
ga GUS::Supported::Plugin::LoadFastaSequences --externalDatabaseName 'Theileria parva' --externalDatabaseVersion 1.0
--regexSourceId '(c[0-9]m[0-9]*)' --tableName DoTS::ExternalNASequence --sequenceFile ~/TPA1.1con
--writeFile tpa_chromosomes.fasta --ncbiTaxId 5875 --SOTermName contig --commit
loading BLAT alignment of genes against chromosome
We estimated gene locations by means of a
BLAT alignment of genes against contigs. The alignments were loaded with the LoadBLATAlignments plugin.
ga GUS::Community::Plugin::LoadBLATAlignments --blat_files /genomics/binf/theileria/genome/cds-contigs/master/mainresult/out.psl
--query_file /home/iodice/theileria/data/2blat/blocked.seq --query_table_id 229 --target_table_id 229
--query_taxon_id 85823 --target_taxon_id 85823 --target_db_rel_id 4 --action load --queryRegex '>(\S+) '
--max_query_gap 5 --min_pct_id 95 --max_end_mismatch 10 --end_gap_factor 10 -- min_gap_pct 90
--ok_internal_gap 15 --ok_end_gap 50 --min_query_pct 10 --commit
creating gene, exon, and location records from stored BLAT alignments
Having loaded the BLAT alignments, we used another plugin to create GeneFeature and ExonFeature records (together with NaLocation records to store their locations). No plugin existed for this task, so we created a new one.
ga GUS::Community::Plugin::MakeGenesFromAlignments --commit
loading InterPro
domain alignments
We used InterProScan
to find protein domains. The InsertInterproscanResults
plugin loaded those domain features, and the concommittant Gene Ontology terms.
ga GUS::Community::Plugin::InsertInterproscanResults --resultFileDir ~/theileria/data/iprscan/master/mainresult
--confFile ~/theileria/data/interpro/iprscan/data/insertInterpro-config.xml --aaSeqTable ExternalAASequence
--extDbName InterPro --extDbRlsVer 12.1 --goVersion 'Gene Ontology|5.71' --commit
A Note on Gene Location Estimates
The genomic data published on TIGR's Theileria parva
download site included contig sequences and gene sequences, but did not include genomic locations for the genes. Therefore, we estimated those locations by using
BLAT to align the genes to the genome, and used GUS plugins to load the alignments and create exon records based on them. See details above in Populating the Database.
We used two techniques to evaluate the quality of these alignments. First, we compared figure 2 in the
supplement to Gardner et al. with the corresponsing views in the
TheileriaDB genome browser. This covered a total of 22 genes in three different contigs. All had the same orientation and approximate relative size and position in both images. Second, we took advantage of the fact that the genes of T. parva appear to have been nambered consecutively along the chromosomes and compared a list of genes ordered by estimated location with a list ordered by gene name. The differences between the lists involve 41 of the approximately 4000 genes:
- TP01_0329
TP01_1228
TP01_1229
TP01_1230
TP01_1231
TP01_1232
TP01_1233
TP02_0596 TP02_0747 TP02_0961 TP02_0962 TP02_0963 TP02_0964 TP02_0965 TP02_0966
TP03_0012 TP03_0285 TP03_0383 TP03_0420 TP03_0893 TP03_0894 TP03_0895 TP03_0896 TP03_0897 TP03_0898 TP03_0899 TP03_0900 TP03_0901 TP03_0902 TP03_0903 TP03_0904 TP03_0929 TP03_0930
TP04_0177 TP04_0388 TP04_0439 TP04_0923 TP04_0924 TP04_0925 TP04_0926 TP04_0927 TP04_0928
Building the Site
TheileriaDB is based on the
GUS Web Development Kit. WDK is based on the
Java Servlet API, and uses the
Apache Tomcat servlet container. The WDK is distributed with a simple demo site. We copied and modified this demo to create TheileriaDB.org
The following sections describe the components of the site.
model file: theileriaModel.xml
This XML file defines the types of records that the site can display. In TheileriaDB's case there's only one type: the gene record. The model file includes queries that let users retrieve sets of genes that match specified criteria, such as the query for genes by associated keyword. Other queries define record attributes (for instance, the gene record has attributes for product name, amino-acid sequence, and
Gene Ontology associations; these are populated by queries in the model file).
sanity test file: theileriaSanity.xml
This companion to the model file defines parameters to be used in testing the queries of the model file, together with upper and lower limits on the number of rows that should be returned. This enables automated testing which can uncover potential problems of many kinds.
custom record file: GeneRecordClasses.GeneRecordClass.jsp
This JSP file overrides the WDK's default behavior in displaying the gene record. This lets the TheileriaDB gene page precede a gene's list of exon locations with a notice about how these data were estimated.
Ajax code for InterPro query: AjaxInterpro.js, prototype.js, scriptaculous.js
These files contain the
Ajax code that makes the interactive
GenesByInterproDomain query.
Project Team
The TheileriaDB development team, John Iodice and Bryan Cardillo, gratefully acknowledge the assistance of:
Jerric Gao
Chris Stoeckert
Lyle Ungar
Steve Fischer
Praveen Chakravarthula
Deborah Pinney
The
Computational Biology and Informatics Lab




