A supported plugin must be useful to sites other than the site that developed it. It also must run at other sites without modification.
Plugin names begin with one of four verbs:
insert if the plugin inserts only
delete if the plugin deletes only
update if the plugin updates only
load if the plugin does any two or more of insert, delete or update
Plugin names are concise
for example, a plugin named InsertNewSequences is not concise because Insert and New are redundant
Plugin names are precise
for example, a plugin named InsertData is way too general. The name should reflect the type of data inserted
if a Plugin expects exactly one file type, that file type should be in the name. For example, InsertFastaSequences.
Plugin names are accurate
for example, a plugin named InsertExternalSequences is inaccurate if it can also insert internally generated sequences. A better name would be InsertSequences.
Plugins never directly use (hard-code) GUS primary keys, either in the body of the code or for command line argument values. Instead they use semantically meaningful alternate keys. The reason that plugins cannot use primary keys in their code is that doing so makes the plugin site specific, not portable. The reason they cannot use primary keys as values in their command line arguments is that plugins are often incorporated as steps in a pipeline (using the GUS Pipeline API described elsewhere). The pipelines should be semantically transparent so that people both on site and externally who look at the pipeline will understand it.
Some sites augment GUS with their own application specific tables. These are not permitted in supported plugins.
The name of the argument should be concise and precise
The Plugin API provides a means for you to declare arguments of different types, such integers, strings and files (the section called “Declaring the plugin's command line arguments”). Use the most appropriate type. For example, don't use a string for a file argument.
Use camel caps (eg matrixFile) not underscores (eg matrix_file) in the names of the arguments.
The Plugin API provides a means for you to document the plugin and its arguments. Be thorough in your documentation. the section called “Declaring the Plugin's Documentation”
The GUS object layer assists in writing clean plugin code. The guidelines for their use are:
When writing data to the database, use GUS objects when possible. Avoid using SQL directly.
When forming a relationship between two objects, use the
setParent() or
setChildren() method. Do not
explicitly set the foreign keys of the objects.
The GUS objects are good at writing data to the database. That is because they allow you to build up a tree structure of objects and then to simply submit the root. However they are not as useful at reading the database. You can only read one object at a time (more on this in the Guide to GUS Objects). For this reason, you will need to use SQL to efficiently read data from the database as needed by your plugin.
This is how a typical database access looks:
Example 1.4. Typical Database Access
my $sql =
"SELECT $self->{primaryKeyColumn}, $self->{termColumn}
FROM $self->{table}";
my $queryHandle = $self->getQueryHandle();
my $statementHandle = $queryHandle->prepareAndExecute($sql);
my %vocabFromDb;
while (my ($primaryKey, $term) = $sth->fetchrow_array()) {
$vocabFromDb{$term} = $primaryKey;
}The SQL is formatted on multiple lines for clarity (Perl allows
this), and the SQL keywords are upper case. The Plugin API provides a
method to easily get a query handle, returning a
GUS::ObjRelP::DbiDbHandle. That
object provides an easy-to-use method that prepares and executes the
SQL.
The Plugin API offers a set of logging methods. They print to standard error. Use these and no other means of writing out logging messages.
Do not write to standard output. If your plugin generates data (such as a list of IDs already loaded, for restart) write it to a file.
Less is more with commenting. Comment only the non-obvious. For
example, do not comment a method called
getSize() with a comment
# gets the size. Most methods should
need no commenting, as they should be self-explanatory. In many cases,
if you find that you need to comment because something non-obvious
needs explaining, that is a red flag indicating that your code might
need simplification.
There is only one permissible way to handle errors: call
die(). Never log errors or write them
to standard error or standard out. Doing that masks the error (the
logs are not read reliably) so that what is really happening is the
plugin is failing silently. Causing the plugin to die forces the user
of the plugin or its developer to fix the problem.
When you call die, give it an informative message, including the values of the suspicious variables. Surround the variables in single quotes so that white space errors will be apparent. Provide enough information so that the user can track down the source of the problem in the input files.
If you would like your program to continue past errors, then dedicate a file or directory which will house describing the errors. The user will know that he or she must look there for a list of inputs that caused problems. Typically you use this strategy if you expect the input to be huge, and don't want to abort it because of a few errors. You may want to include as a command line argument the number of errors a user will tolerate before giving up and just aborting.
Plugins abort. They do so for many reasons. When they do, the user must be able to recover from the failure, one way or another.
A few strategies you could adopt are:
If the plugin is inserting data (rather than inserting and
updating) the plugin can check if an object that is about to be
written to the database is already there. If so, it can skip
that object. Because this checking will slow the plugin down,
the plugin should offer a
restart flag on the command
line that turns that check on.
If the plugin is updating it can include a command line
argument that takes a list of
row_alg_invocation_ids, one per
each run of the plugin with this dataset. (Each table in GUS has
a row_alg_invocation_id column
to store the identifier of the particular run of a plugin that
put data there. This is part of the automatic tracking that
plugins do.) The plugin can take the same approach as the
previous strategy, but, must additionally check that the object
has one of the provided
row_alg_invocation_ids.
The plugin can store in dedicated file the identifiers of the objects it has already loaded. In this case, the plugin should offer a command line argument to ask for the name of the file.
A very common error is to open files without dying if the open fails. The proper way to open a file is like this:
One of the most time consuming operations in a plugin is accessing the database. The typical flow of a plugin is that it reads the input and as it goes it constructs and submits GUS objects to the database. Some plugins additionally need to read data from the database to do their work. While it is often impossible to avoid writing to the database with each new input value, it is often possible to avoid reading it.
If most of the values of a table (or tables) will be needed then the plugin should read the table (or tables) outside the loop that processes the input. It should store the values in a hash keyed on a primary or alternate key. Storing multiple megabytes of data this way in memory should not be a problem. Gigabytes may well be a problem.
If only a few values from the table will be needed then an
alternative caching strategy may be appropriate. Wrap the access to
the values in a getter method, such as
getGeneType(). This method stores
values it gets in a hash. When the method is called, it first looks in
the hash for the value. If the hash does not have it, then the method
reads the database and stores the value in the hash to optimize future
accesses.
Complicated regular expressions should be accompanied by a
comment line that shows what the input string looks like. It is
otherwise often very difficult to figure out what the regular
expression is doing. Long regular expressions should be split into
multiple lines with embedded whitespace and comments using the
/x modifier. See the "Readability"
section of Maintaining
Regular Expressions
Choosing good names for your variables and methods makes your code much more understandable. To make your code clear:
Variable and method names should start with a lower case letter.
Use "camel caps"
($sequenceLength) for variable
names and method names, not underscores
($sequence_length).
Variable names should be named after the type of data they
hold (unless there are more than one variable for a given type,
in which case they are qualified). For example a good name for a
sequence would be
$sequence
In plugins, there are typically:
strings parsed from the input
objects created from the input (if you are using an object based parser such as Bioperl)
GUS object layer objects
Input objects or strings should be named with 'input' as a
prefix. For example:
$inputSequence
Object layer objects are named for their type, for example
$NASequence
Method names should be self-explanatory. A bad method name
would be process() (what is
being processed?). Don't "save keystrokes" with short names. If
being self-explanatory requires using a long name, then use a
long name.
Use "structured programming" when you create your methods:
No method should ever be longer than one screen. If it is, refactor part of into its own method.
Never repeat code. Repeated code must be in a method.
Some methods in the API are marked as deprecated. Do not use them. They are for backward compatibility only.
Use C and Java like syntax. Do not use weird Perl specific syntax.
Indenting must be spaces not tabs. Two or four spaces are acceptable
Use $self to refer to the
object itself
Declare method arguments using this syntax:
my ($self, $sequence, $length) = @_;.
Do
not use shift
A controlled vocabulary (CV) is a restricted set of terms that are allowed values for a data type. They may be simple lists or they may be complex trees, graphs or ontologies. In GUS the CVs fall into two categories: standard CVs such as the Gene Ontology, and small application specific CVs such as ReviewStatus.
The complete list of application specific CVs in the GUS 3.5 schema is:
DoTS.BlatAlignmentQuality
DoTS.GOAssociationInstanceLOE
DoTS.GeneInstanceCategory
DoTS.InteractionType
DoTS.MotifRejectionReason
DoTS.ProteinCategory
DoTS.ProteinInstanceCategory
DoTS.ProteinProteinCategory
DoTS.ProteinPropertyType
DoTS.RNACategory
DoTS.RNAInstanceCategory
DoTS.RNARNACategory
DoTS.RepeatType
SRes.BibRefType
SRes.ReviewStatus
Acquiring a standard CV typically involves downloading files from the CV provider and running a plugin to load it.
Application specific CVs are handled by the plugin that will use the CV. For example, a plugin that inserts bibliographic references will use the SRes.BibRefType CV. It is these plugins that are responsible for making sure that the CV they want to use is in the database.
Plugins that use CVs fall into two categories:
those that hard code the CV
those that do not hard code the CV, but, rather, get it from the input
In case 1, the plugin hard codes the CV in the Perl code.
In case 2, the plugin hard codes only a default. It also offers an optional command line argument that takes a file that contains the CV. If the user of the plugin determines that the input has an different CV than the default, the user will provide such a file.
In both cases, the plugin reads the table in GUS that contains the CV and compares it to the CV it expects to use. If the expected vocab is not found, the plugin updates the table.
GUS is a data warehouse so it is very common for plugins to load into GUS data from another source. Whether the source is external or in-house, tracking its origin is often required. The tables in GUS that handle this are SRes.ExternalDatabase and SRes.ExternalDatabaseRelease. The former describes the database, eg, PFam, and the latter describes the particular release of the database that is being loaded, eg, 1.0.0. The data loaded will have a foreign key to the database release, which in turn has a foreign key to the database.
In order to create that relationship, the plugin must know the
primary key of the external database release. To accomplish this, the
plugin takes as command line arguments the name of the database and
its release. It does not take the primary key of the external database
release (that violates the plugin standard). The plugin passes that
information to the API subroutine
getExtDbRlsId($dbName,
$dbVersion).
If the plugin is inserting the dataset as opposed to updating
it, create new entries for the database and the release by using the
plugins
GUS::Supported::Plugin::InsertExternalDatabase
and
GUS::Supported::Plugin::InsertExternalDatabaseRls.