GrailEXP FAQ Section 7: Installation and Customization of GrailEXP
GrailEXP runs on any standard UNIX platform (Alpha OSF1,
i686 Linux, Sun Solaris, SGI IRIX).
A single workstation with 128MB of memory can run
GrailEXP, provided the database is partitioned into
sufficiently small pieces. GrailEXP requires
PERL (Perl 5
at a minimum) in order to run. The parallel search
requires ucspi-tcp
in order to work.
At the top level is the main grailexp PERL script.
In addition, GrailEXP contains the following files/directories:
- bin: Containing perceval, galahad, gawain, gxpfetch,
and gxpindex PERL scripts and the platform-specific perceval,
galahad, and gawain binaries.
- blast: Contains the platform-independent blastall
and formatdb scripts, as well as the platform-specific blastall
and formatdb binaries.
- db: An empty directory where you can build your
database.
- doc: Contains the GrailEXP v3.2 FAQ.
- generation: Contains Generation data files for non-human/mouse organisms.
- parallel: Contains the paralleldb, pfetch, pindex, pformatdb,
gxptalk, tcpclient, and tcpserver PERL scripts. (Platform-specific
binaries for tcpclient and tcpserver are not provided).
- parsers: Contains parsing scripts that parse the
raw output into other formats.
- repbase: An empty directory in which you can put Repbase or some other repetitive database.
- tmp: An empty directory in which GrailEXP writes files.
The only third-party software included with the
installation is NCBI's blastall and formatdb executables.
Everything else you have to go out and get (as explained
below).
GrailEXP expects PERL to be in /usr/local/bin/perl.
Since GrailEXP is often used with multiple platforms of machines,
/usr/local/bin/perl was selected as a standard location on
which to find PERL on any platform.
NOTE: PERL by default is often in /usr/bin/perl
on Linux machines. It is highly recommended that you
add a link on these machines from /usr/local/bin/perl to
/usr/bin/perl.
If you want to use PERL somewhere else, you will have
to modify each of the PERL scripts in the grailexp installation
to point to the location of the PERL you wish to use.
Every user who will be running GRAILEXP from the
command line should set this environment variable.
It must be set to tell GrailEXP where to look for its
executables, Generation data, indexing/fetching scripts,
etc. This variable should be set to the directory
containing the main grailexp PERL script and
subdirectories. It is highly recommended that
you set this environment variable in your
.cshrc, .tcshrc, or appropriate file.
Everything works off this environment variable. If
you encounter problems with the program not finding
something, the environment variable
is the first thing to check.
Syntax for setting an environment variable in csh:
setenv GRAILEXP /usr/local/grailexp
The naming convention for GrailEXP executables is:
galahad.MACHINE.OS
tcpclient.MACHINE.OS
blastall.MACHINE.OS
etc.
where MACHINE is the output of a uname -m command
and OS is the output of a uname -s command.
The most common extensions are i686.Linux,
sun4u.SunOS, and alpha.OSF1.
First, check the GRAILEXP environment variable and
make sure it is set properly. Next, issue uname -m
and uname -s commands and make sure that the
extension to your executables matches the output of these
commands. If it does not, you may have to rename these
executables as appropriate, or you may be missing a
compilation for that platform.
Assuming the GRAILEXP environment variable
is set, no further work need be done. You can modify
the blastall script in $GRAILEXP/blast if you so desire,
either to add a multithreading option (-a numthreads)
or to change the timeout.
As anyone who has run many blastall queries knows,
sometimes BLAST hangs inexplicably. In order to make
GrailEXP a reliable tool for automated batch processing,
it was necessary to add this timeout to deal with
these cases. The default timeout on the blastall
script is ONE HOUR. You can change this by
editing the blastall PERL script.
The only requirements for a repetitive database
are that it must be a valid blastable nucleotide
database. This means you must have run formatdb -p F -i my.repdb
on that database. You can do this using the formatdb
executable provided in the $GRAILEXP/blast subdirectory.
REPBASE
is the most comprehensive up-to-date database of
repetitive elements currently available. It is the
database developed by Jerzy Jurka and Arian Smit
and used with the extremely popular RepeatMasker
program. It is available from the
the GIRI home page.
Once you have the FASTA version of Repbase,
just cat the files together, move the database
into $GRAILEXP/repbase, and format the database
for use with BLAST.
An example, assuming you have all the Repbase
files in the working directory:
87 grail /home/4ph/tmp> cat *.ref > $GRAILEXP/repbase/repbase
88 grail /home/4ph/tmp> cd $GRAILEXP/repbase
89 grail /home/4ph/grailexp/repbase> ../blast/formatdb -p F -i repbase
Only the .ref files are needed for reliable masking,
as the .app files represent older versions of the same
information.
Check the GRAILEXP environment variable.
Look in the $GRAILEXP/generation subdirectory for
files with the extension .gnr. If none
exist, you do not have any other organisms installed.
GrailEXP expects to search a LIST of databases. The list file
can contain one database or many databases. It allows the
user to maintain multiple databases (i.e. keeping human
and mouse separate, or keeping complete cDNAs separate from
ESTs, etc.) but search them with a single list file.
The list file should simply contain the full pathname
to each search database, one per line.
Some preformatted databases
are available at the ORNL web site.
Each database in the list should be in the GrailEXP
Database Format (GXPDF), an explanation of which follows
in the next section.
There are many issues to resolve before installing
the database. Do you want to partition the database into
pieces or maintain it as one huge database? Do you
want to keep mouse separate from human or put them in
a single database? Do you want to put computational
assemblies in with EST fragments? All these issues
must be resolved before you can build your database.
By default, GrailEXP looks for the list of search
databases in $GRAILEXP/db/dblist, so this is the
easiest place to begin.
A database my.db is said to be in GrailEXP Database Format (GXPDF)
if the following conditions are true:
- The file my.db contains only FASTA sequences.
- Each FASTA header is pipe-delimited and of the form
>organism|accession number|database|......,
where organism is the tag for that organism
(i.e. human, mouse, arab, droso) obtained from
doing grailexp --listorgs.
- The database is blastable, i.e. my.db.nhr,
my.db.nin, and my.db.nsq all exist.
- A valid GrailEXP index file my.db.gxp exists.
The database has to be in FASTA format, with the proper
headers, and be formatdb'ed for use with BLAST and indexed
for use with the Galahad alignment program. Thus
the steps to building a database in GXPDF are:
- Build the database, altering the headers as necessary
to put the organism in the first field, the accession number
in the second field, and a database tag in the third field.
- (Optional) Partition the database into
multiple files.
- Formatdb the database(s) for use with BLAST.
- Index the database(s) for use with Galahad.
- Put the path(s) to the database(s) in any dblist files you
want to use.
Each of these steps is described in detail in the
following sections.
Yes, as long as you index them and formatdb them. NCBI
headers also maintain the accession number in the second
field. However, the program will not be able to get organism
or database tag information out of the header.
The database tag (3rd field) is not used by the program; it is merely
returned to the user in the alignment output file. NCBI
has a data source tag in the 3rd field (dbj, emb, gb, etc.),
so this works out all right.
The only remaining problem is the first field, which is
expected to contain organism information by Gawain (the
gene assembly program). However, the only way in which
this information is used is to weight matching alignments
higher in building gene models. This can be important,
as mouse alignments can sometimes create noise that mess
up the human gene models. The way to solve this problem is
to maintain human, mouse, etc. in separate databases,
so that distinguishing between organisms never becomes an
issue.
So the two options are to use NCBI-header databases
but separate them based on organism OR to reformat all
databases into proper GXPDF.
Use the paralleldb script in the $GRAILEXP/parallel
directory. Usage is paralleldb dbfile working_dir part_size,
where dbfile is a FASTA formatted database, working_dir is
the directory where you want to write the partitions, and
part_size is the size in MB of the desired partitions.
Paralleldb will write the partition files, write a dblist
file containing those partitions in the working directory,
formatdb the databases for use with BLAST, and index the
databases for use with Galahad.
Even in serial mode, the program searches the list of
databases in order. Instead of searching a 3GB database
of ESTs, a single workstation with 128MB of memory
could search the 3GB database in 30 100MB partitions
without any problems. In addition, BLAST is
simply more likely to hang the larger its search
space is. If the database size exceeds
the ability of the single machine's ability to search
it in one go, then perhaps you should think about
partitioning your database.
Use $GRAILEXP/blast/formatdb:
formatdb -p F -i my.db
You can formatdb all the databases in a list by
using $GRAILEXP/parallel/pformatdb:
pformatdb my.list
Pformatdb will not exit if a single formatdb fails;
however, it will notify you of the failure upon
completion.
Use $GRAILEXP/bin/gxpindex:
gxpindex my.db
You can gxpindex all the databases in a list by
using $GRAILEXP/parallel/pindex:
pindex my.list
Pindex will not exit if a single gxpindex fails;
however, it will notify you of the failure upon
completion.
DoTS is a database of computationally assembled ESTs
and mRNAs developed at the
Computational
Biology and Informatics Laboratory at the University
of Pennsylvania. For more information, see the
allgenes.org
web site.
The TIGR EGAD (Expressed Gene Anatomy Database) transcript
database is a set of
high quality mRNAs. For more information, see the
EGAD home page.
NCBI's Refseq is an attempt at the creation of standard
reference sequences for all the various sequences, including
mRNAs. These mRNAs can be located within the
nt database, available via
FTP from NCBI. RefSeq mRNAs have accession numbers
that begin with NM_. For more information on
RefSeq, see the RefSeq home page.
There are many miscellaneous mRNAs in Genbank that
are not present in the dbEST database. The folks at Baylor
have grabbed as many as they could find and assembled them into
the Human Transcript Database.
The FASTA file containing the sequences is downloadable from their web site.
At the core of any GrailEXP search database must be the
ESTs. They still represent an enormous amount of information
(and disinformation). The EST database is available via
FTP from NCBI in three files: est_human,
est_mouse,
and est_others.
For more information on the dbEST database, see the
dbEST home page.
In order to make parallel search work, you will need
to complete the following steps:
- Make sure PERL is at /usr/local/bin/perl on all
machines that will be conducting the search.
- Make sure you have a list of valid GXPDF databases
to search.
- Make sure you have a hostfile containing the
list of machines you want to search. By default,
GrailEXP expects this to be in $GRAILEXP/parallel/hostfile.
- Download ucspi-tcp, compile the executables,
and put them in the appropriate location.
- Start up tcpserver on each of the machines that
will be performing the search.
These steps are described in more detail below.
ucspi-tcp is an amazing set of command-line tools for building
TCP/IP applications written by Daniel Bernstein.
It provides an extremely robust
server/client system that can handle multiple connections
without any problems.
ucspi-tcp is obtainable from the ucspi-tcp home page.
Compilation is effortless; just follow the instructions
on the web page. Once you have the executables for
tcpclient and tcpserver compiled, rename
them with the appropriate extensions (see above
if you've forgotten the naming convention) and
place them in the $GRAILEXP/parallel directory.
Start up the tcpserver on each machine with the following
command:
tcpserver mymachine 5600 $GRAILEXP/parallel/gxptalk server &
Or, if you are testing and wish to monitor the server
more closely:
tcpserver -v mymachine 5600 $GRAILEXP/parallel/gxptalk server
By default, GrailEXP uses port 5600. If you wish
to modify this value, you will have to edit the main
grailexp PERL script.
Run some of your favorite sequences through and see what you get back!
You can also try the
test examples.
Since databases will vary from installation to installation,
it is not always easy to compare results. If you have any
questions or comments, feel free to contact
grailmail@ornl.gov.
The author and maintainer of this FAQ is Doug Hyatt (hyattpd@ornl.gov).
This FAQ applies to GrailEXP version 3.2, released February, 2001.
This FAQ was last updated February, 2001.