GrailEXP FAQ Section 7: Installation and Customization of GrailEXP


What are the supported platforms for GrailEXP?

GrailEXP runs on any standard UNIX platform (Alpha OSF1, i686 Linux, Sun Solaris, SGI IRIX).

What hardware/software is required to run GrailEXP?

A single workstation with 128MB of memory can run GrailEXP, provided the database is partitioned into sufficiently small pieces. GrailEXP requires PERL (Perl 5 at a minimum) in order to run. The parallel search requires ucspi-tcp in order to work.

What comes in the installation package?

At the top level is the main grailexp PERL script. In addition, GrailEXP contains the following files/directories:

The only third-party software included with the installation is NCBI's blastall and formatdb executables. Everything else you have to go out and get (as explained below).

How do I tell GrailEXP where PERL is located?

GrailEXP expects PERL to be in /usr/local/bin/perl. Since GrailEXP is often used with multiple platforms of machines, /usr/local/bin/perl was selected as a standard location on which to find PERL on any platform.

NOTE: PERL by default is often in /usr/bin/perl on Linux machines. It is highly recommended that you add a link on these machines from /usr/local/bin/perl to /usr/bin/perl.

If you want to use PERL somewhere else, you will have to modify each of the PERL scripts in the grailexp installation to point to the location of the PERL you wish to use.

What is the GRAILEXP environment variable?

Every user who will be running GRAILEXP from the command line should set this environment variable. It must be set to tell GrailEXP where to look for its executables, Generation data, indexing/fetching scripts, etc. This variable should be set to the directory containing the main grailexp PERL script and subdirectories. It is highly recommended that you set this environment variable in your .cshrc, .tcshrc, or appropriate file.

Everything works off this environment variable. If you encounter problems with the program not finding something, the environment variable is the first thing to check.

Syntax for setting an environment variable in csh:

setenv GRAILEXP /usr/local/grailexp

What is the naming convention for executables?

The naming convention for GrailEXP executables is:

galahad.MACHINE.OS
tcpclient.MACHINE.OS
blastall.MACHINE.OS
etc.
where MACHINE is the output of a uname -m command and OS is the output of a uname -s command. The most common extensions are i686.Linux, sun4u.SunOS, and alpha.OSF1.

What do I do if GrailEXP says it can't find executables?

First, check the GRAILEXP environment variable and make sure it is set properly. Next, issue uname -m and uname -s commands and make sure that the extension to your executables matches the output of these commands. If it does not, you may have to rename these executables as appropriate, or you may be missing a compilation for that platform.

How do I set up blastall and formatdb?

Assuming the GRAILEXP environment variable is set, no further work need be done. You can modify the blastall script in $GRAILEXP/blast if you so desire, either to add a multithreading option (-a numthreads) or to change the timeout.

How do I change the timeout on blastall?

As anyone who has run many blastall queries knows, sometimes BLAST hangs inexplicably. In order to make GrailEXP a reliable tool for automated batch processing, it was necessary to add this timeout to deal with these cases. The default timeout on the blastall script is ONE HOUR. You can change this by editing the blastall PERL script.

How do I install a repetitive element database?

The only requirements for a repetitive database are that it must be a valid blastable nucleotide database. This means you must have run formatdb -p F -i my.repdb on that database. You can do this using the formatdb executable provided in the $GRAILEXP/blast subdirectory.

What is REPBASE and where do I get it?

REPBASE is the most comprehensive up-to-date database of repetitive elements currently available. It is the database developed by Jerzy Jurka and Arian Smit and used with the extremely popular RepeatMasker program. It is available from the the GIRI home page.

Once you have the FASTA version of Repbase, just cat the files together, move the database into $GRAILEXP/repbase, and format the database for use with BLAST.

An example, assuming you have all the Repbase files in the working directory:

87 grail /home/4ph/tmp> cat *.ref > $GRAILEXP/repbase/repbase
88 grail /home/4ph/tmp> cd $GRAILEXP/repbase
89 grail /home/4ph/grailexp/repbase> ../blast/formatdb -p F -i repbase

Only the .ref files are needed for reliable masking, as the .app files represent older versions of the same information.

What do I do if GrailEXP can't find any organisms besides human and mouse?

Check the GRAILEXP environment variable. Look in the $GRAILEXP/generation subdirectory for files with the extension .gnr. If none exist, you do not have any other organisms installed.

How do I install a cDNA/EST/mRNA search database?

GrailEXP expects to search a LIST of databases. The list file can contain one database or many databases. It allows the user to maintain multiple databases (i.e. keeping human and mouse separate, or keeping complete cDNAs separate from ESTs, etc.) but search them with a single list file. The list file should simply contain the full pathname to each search database, one per line.

Some preformatted databases are available at the ORNL web site.

Each database in the list should be in the GrailEXP Database Format (GXPDF), an explanation of which follows in the next section.

There are many issues to resolve before installing the database. Do you want to partition the database into pieces or maintain it as one huge database? Do you want to keep mouse separate from human or put them in a single database? Do you want to put computational assemblies in with EST fragments? All these issues must be resolved before you can build your database.

By default, GrailEXP looks for the list of search databases in $GRAILEXP/db/dblist, so this is the easiest place to begin.

What is the GrailEXP Database Format (GXPDF)?

A database my.db is said to be in GrailEXP Database Format (GXPDF) if the following conditions are true:

The database has to be in FASTA format, with the proper headers, and be formatdb'ed for use with BLAST and indexed for use with the Galahad alignment program. Thus the steps to building a database in GXPDF are:

Each of these steps is described in detail in the following sections.

Can I use NCBI databases with GrailEXP?

Yes, as long as you index them and formatdb them. NCBI headers also maintain the accession number in the second field. However, the program will not be able to get organism or database tag information out of the header.

The database tag (3rd field) is not used by the program; it is merely returned to the user in the alignment output file. NCBI has a data source tag in the 3rd field (dbj, emb, gb, etc.), so this works out all right.

The only remaining problem is the first field, which is expected to contain organism information by Gawain (the gene assembly program). However, the only way in which this information is used is to weight matching alignments higher in building gene models. This can be important, as mouse alignments can sometimes create noise that mess up the human gene models. The way to solve this problem is to maintain human, mouse, etc. in separate databases, so that distinguishing between organisms never becomes an issue.

So the two options are to use NCBI-header databases but separate them based on organism OR to reformat all databases into proper GXPDF.

How do I partition the database for parallel search?

Use the paralleldb script in the $GRAILEXP/parallel directory. Usage is paralleldb dbfile working_dir part_size, where dbfile is a FASTA formatted database, working_dir is the directory where you want to write the partitions, and part_size is the size in MB of the desired partitions.

Paralleldb will write the partition files, write a dblist file containing those partitions in the working directory, formatdb the databases for use with BLAST, and index the databases for use with Galahad.

Why should I partition the database even if I am not doing a parallel search?

Even in serial mode, the program searches the list of databases in order. Instead of searching a 3GB database of ESTs, a single workstation with 128MB of memory could search the 3GB database in 30 100MB partitions without any problems. In addition, BLAST is simply more likely to hang the larger its search space is. If the database size exceeds the ability of the single machine's ability to search it in one go, then perhaps you should think about partitioning your database.

How do I formatdb a single database?

Use $GRAILEXP/blast/formatdb:

formatdb -p F -i my.db

How do I pformatdb a list of databases?

You can formatdb all the databases in a list by using $GRAILEXP/parallel/pformatdb:

pformatdb my.list

Pformatdb will not exit if a single formatdb fails; however, it will notify you of the failure upon completion.

How do I index a single database?

Use $GRAILEXP/bin/gxpindex:

gxpindex my.db

How do I pindex a list of databases?

You can gxpindex all the databases in a list by using $GRAILEXP/parallel/pindex:

pindex my.list

Pindex will not exit if a single gxpindex fails; however, it will notify you of the failure upon completion.

What is DOTS and where do I get it?

DoTS is a database of computationally assembled ESTs and mRNAs developed at the Computational Biology and Informatics Laboratory at the University of Pennsylvania. For more information, see the allgenes.org web site.

What is the TIGR EGAD transcript database and where do I get it?

The TIGR EGAD (Expressed Gene Anatomy Database) transcript database is a set of high quality mRNAs. For more information, see the EGAD home page.

What is NCBI Refseq and where do I get it?

NCBI's Refseq is an attempt at the creation of standard reference sequences for all the various sequences, including mRNAs. These mRNAs can be located within the nt database, available via FTP from NCBI. RefSeq mRNAs have accession numbers that begin with NM_. For more information on RefSeq, see the RefSeq home page.

What is the Baylor Human Transcript database and where do I get it?

There are many miscellaneous mRNAs in Genbank that are not present in the dbEST database. The folks at Baylor have grabbed as many as they could find and assembled them into the Human Transcript Database. The FASTA file containing the sequences is downloadable from their web site.

What is the dbEST database and where do I get it?

At the core of any GrailEXP search database must be the ESTs. They still represent an enormous amount of information (and disinformation). The EST database is available via FTP from NCBI in three files: est_human, est_mouse, and est_others. For more information on the dbEST database, see the dbEST home page.

How do I make parallel search work?

In order to make parallel search work, you will need to complete the following steps:

These steps are described in more detail below.

What is ucspi-tcp and where do I get it?

ucspi-tcp is an amazing set of command-line tools for building TCP/IP applications written by Daniel Bernstein. It provides an extremely robust server/client system that can handle multiple connections without any problems.

ucspi-tcp is obtainable from the ucspi-tcp home page.

How do I compile and install tcpclient/tcpserver?

Compilation is effortless; just follow the instructions on the web page. Once you have the executables for tcpclient and tcpserver compiled, rename them with the appropriate extensions (see above if you've forgotten the naming convention) and place them in the $GRAILEXP/parallel directory.

How do I start up the parallel servers?

Start up the tcpserver on each machine with the following command:

tcpserver mymachine 5600 $GRAILEXP/parallel/gxptalk server &

Or, if you are testing and wish to monitor the server more closely:

tcpserver -v mymachine 5600 $GRAILEXP/parallel/gxptalk server

By default, GrailEXP uses port 5600. If you wish to modify this value, you will have to edit the main grailexp PERL script.

How do I know if everything is working?

Run some of your favorite sequences through and see what you get back! You can also try the test examples. Since databases will vary from installation to installation, it is not always easy to compare results. If you have any questions or comments, feel free to contact grailmail@ornl.gov.


The author and maintainer of this FAQ is Doug Hyatt (hyattpd@ornl.gov). This FAQ applies to GrailEXP version 3.2, released February, 2001. This FAQ was last updated February, 2001.