Using LoadPfam module:

LoadPfam.pm module inserts data into dots.pfamentry, dots.dbrefpfamentry and sres.dbref. dots.dbrefpfamentry table is the child of dots.pfamentry and sres.dbref. The db_ref_id and the pfam_entry_id of the dots.dbrefpfamentry table are derived from dots.pfamentry and sres.dbref tables.

Data for LoadPfam plugin:

The data can be downloaded from http://www.sanger.ac.uk/Software/Pfam/ftp.shtml.

About Pfam:

Pfam is a collection of protein family alignments which were constructed semi-automatically using HMMs. Sequences that are not covered by Pfam are clustered and aligned automatically, and released as pfamB. pfamA families have permanent accession numbers and contain functional annotation and cross-references to other databases, while pfamB families are re-generated at each release and are un-annotated.

Construction of Pfam:

pfamA is based on a sequence database called pfamseq - pfamseq 11 is based on swissprot 41.25 and SP-TrEMBL 24.14. pfamB is constructed from PRODOM 2002.1.

Commandline options for LoadPfam

It is wise to do a test run using LoadPfam - the option --parse_only will not insert any rows into the database. This is different to leaing off --commit where the plugin will insert data but the rows will be removed at the end, hence --parse_only is much, much faster. To test, run this;

ga +create GUS::Common::Plugin::LoadPfam --parse_only --flat_file=/my/path/to/Pfam-A.full --release=11

The only thing that should go wrong, if at all, is that a database mentioned in Pfam-A.full is not in the tables ExternalDatabase and ExternalDatabaseRelease. You will have to add the database before continuing. Note: the --release option refers to the release of Pfam you are loading, please change the number as this document will go out-of-date

If everything went ok, you can commit;

ga +create GUS::Common::Plugin::LoadPfam --flat_file=/my/path/to/Pfam-A.full --release=11 --commit