GrailEXP FAQ Section 1: Introduction


What is GrailEXP?

GrailEXP is a suite of tools for analyzing DNA sequences. While its primary use is to locate protein coding genes within DNA sequence, GrailEXP can also locate EST/mRNA alignments, certain types of promoters, polyadenylation sites, CpG islands, and repetitive elements. GrailEXP is a gene finder, an EST alignment utility, an exon prediction program, a promoter/polya recognizer, a CpG island finer, and a repeat masker, all rolled into one convenient package.

GrailEXP is currently used primarily to analyze human and mouse, but many more systems are under development, including arabidopsis, drosophila, rice, corn, wheat, and many more.

Who developed GrailEXP?

GrailEXP v3.0-v3.2 were developed by Doug Hyatt at Oak Ridge National Laboratory. GrailEXP v2.0 was developed at Oak Ridge National Laboratory by Doug Hyatt, Manesh Shah, Richard Mural, and Edward C. Uberbacher. The human and mouse donor recognition systems for v2.0-v3.2 were written by Victor Olman.

The original GrailEXP (v1.0) was developed by Ying Xu, Manesh Shah, Richard Mural, and Edward C. Uberbacher.

A complete list of credits/acknowledgments is available online. This includes all authors who have worked on all versions.

How is GrailEXP used?

GrailEXP is used at the Computational Biology Section at ORNL to annotate the entire known portion of the human and mouse genomes, including both finished and draft data. GrailEXP is also included in the annotations offered by Celera and DoubleTwist, among others.

GrailEXP provides users with many different capabilities. Some use it just to get an idea of where coding is in their favorite sequence. Others are interested in using it as a sim4-like EST alignment program. Some are interested in comparing human genomic sequence with mouse ESTs. Others are more interested in the development of gene-finders for model organisms such as rice, corn, and wheat. Others use the program just to align a single mRNA with a genomic sequence suspected of containing that gene. Finally, some use the program to quickly mask a sequence for repetitives. GrailEXP is used in many different ways, depending upon the needs of its users.

Why should I use GrailEXP?

GrailEXP is a single convenient, easy-to-use package. It runs on all the standard UNIX platforms, and has been compiled and tested on all of them (unlike much academic code). Its genefinding capabilities are among the finest currently available, including recognition of alternative splicing based on EST evidence and clustering of ESTs associated with a particular gene model. Because its alignment utility is based on BLAST, GrailEXP runs much faster than corresponding programs that rely upon Smith-Waterman or other algorithms. Speed, sound engineering, and accuracy are all among reasons to use GrailEXP. In addition, the known weaknesses of the program are well-documented in this FAQ; there is no attempt to mislead the reader with claims of perfection. The elimination of these weaknesses is being actively worked on for future versions.

If you're interested in obtaining fast, accurate analysis of DNA sequence, then GrailEXP may be the program for you.

What is the difference between Grail and GrailEXP?

Grail 1.3 is a suite of tools developed at the Computational Biology section at ORNL. It recognizes simple repeats, polyas, promoters, exon candidates, genes, and complex repetitive elements. An X client (XGrail) is among the most popular ways to access this system. Grail's gene modeling is based merely on how the exon candidates can be spliced together to form genes, not on any kind of similarity search.

GrailEXP features a Grail-like exon finder (with improved splice site recognition and other minor changes) adapted from the Grail 1.3 code. However, the gene modeling has been vastly improved by searching a database of known gene messages (complete and partial) and building gene models based on the corresponding alignments. It is these two additional powerful tools (the gene message alignment program and the gene assembly program) which distinguish GrailEXP from Grail. In addition, the Smith-Waterman-like complex repeat Grail finder, which takes forever to run, has been replaced by a BLAST-based method which is much faster, although admittedly less precise.

How is GrailEXP structured?

GrailEXP consists of three binaries wrapped by a single Perl script which calls the binaries in the appropriate way based on the user's requests. The first of these programs is the exon prediction program (called Perceval, and containing the latest version of the Grail suite, including repetitive finding, exon prediction, CpG island location, and other features). The second of these programs is the gene message alignment program (called Galahad), which can search against any number of databases, align with a single mRNA/EST, either relying on Grail or Genscan exons as a seed or not. The final program is the gene assembly program (called Gawain), which assembles genes from the EST alignments, recognizes alternative splicing, finds 5' and 3' untranslated regions, and predicts polya and promoter elements. The programs are called through a single, easy-to-use Perl script called 'grailexp'.

How does GrailEXP compare to other tools?

It is difficult to compare GrailEXP to other tools because the sheer scope of what it does exceeds that of many programs to which it is compared.

As far as raw exon prediction goes, Grail 1.3 performs slightly worse than Genscan on human and mouse. However, the addition of similarity search information causes GrailEXP to outperform Genscan consistently. Genscan is particularly weak in predicting the beginning and end of genes. GrailEXP vastly outperforms Genscan in regions of EST/cDNA similarity. Where there is partial EST information, GrailEXP also outperforms Genscan. In regions where there is no similarity with known genes, Genscan predicts exon edges slightly better than GrailEXP, but GrailEXP still predicts gene begin and end better. In such regions, Genscan's tendency is to create genes that are too long; GrailEXP tends to produce genes that are too short (i.e. it breaks genes). Regardless, Genscan exons can be fed to GrailEXP's alignment program, effectively creating a GenscanEXP, if that is the user's desire.

GrailEXP also has the capability of running reliably on unmasked sequence, since it can filter its exons against a repetitive element database. This means that GrailEXP can produce exons that overlap repetitive elements only slightly, but are real exons. Genscan lacks this capability; it must be run on a repeat-masked sequence (thus possibly obliterating good splice sites) or else a lot of garbage comes back. One caveat with running on unrepeat-masked sequence: there will be a few genes predicted that overlap repetitive elements. However, we consider this to be an acceptable price to pay to obtain the untranslated regions of genes that contain repetitive elements (and would be lost if GrailEXP were run on repeatmasked sequence).

An additional feature of GrailEXP is that it can run reliably on draft sequence. The user can specify for the program not to build gene models across gaps unless there is EST/mRNA evidence supporting such a build. This is another option lacking in most other genefinding tools.

GrailEXP was evaluated using the Guigo et al. test set from their recent study on gene prediction accuracy in large-scale genomic sequences. The results of running GrailEXP on this test set are available online.

GrailEXP's gene message alignment program (Galahad) is one of the best publicly available. Because it relies on exon information to seed its search AND because it uses BLAST to get its initial "ball-park" alignments AND because it runs in parallel, the program literally runs hundreds of times faster than programs like sim4. GrailEXP does not recognize short exons well currently, and repeating zinc finger genes produce some crazy-looking alignments. However, its speed and reliability in finding splice sites make it a very useful utility. In addition, the gene assembly program asks the "next step" questions, like which ESTs agree on a gene model, which ESTs indicate an alternative splice, as well as being able to assemble overlapping ESTs on the fly into gene models. A comparison of EST alignment programs was also performed.

As far as repetitive finding goes, RepeatMasker provides far more rigorous alignments with a repetitive database. If your interest is in eliminating repetitives from your sequence quickly, however, then GrailEXP is a more useful tool. A caveat, however: using BLAST to locate repetitives is MUCH LESS SENSITIVE than RepeatMasker (a factor of two). The BLAST method, however, does work well to eliminate most genes containing repetitives.

Can I access GrailEXP online?

Yes. The home page for GrailEXP is http://compbio.ornl.gov/grailexp/. From this page, you can perform GrailEXP analysis, view precomputed annotation, and get the latest information about the program.

Where do I send bug reports/comments/questions?

Send any bugs/comments/questions to grailmail@ornl.gov.

Are there any references for GrailEXP?

Yes. An online list of references is available.

How do I obtain GrailEXP?

GrailEXP v3.2 is obtainable by academic and nonprofit institutions free of charge. Send email to grailmail@ornl.gov if you are interested in obtaining an academic copy.

It is also available commercially through ApoCom Genomics and Genome Informatics Corporation.


The author and maintainer of this FAQ is Doug Hyatt (hyattpd@ornl.gov). This FAQ applies to GrailEXP version 3.2, released February, 2001. This FAQ was last updated February, 2001.