GrailEXP FAQ Section 3: The Gene Message Alignment Program


What is Galahad?

GALAHAD, GrailEXP's gene message alignment program, has three different tasks it can perform:

A caveat regarding the use of the word "alignment". Throughout this FAQ, it is said that GrailEXP produces "alignments". What it actually provides are boundaries and scores; it does NOT display the actual bases lined up against each other and connected with little pipes. For the purposes of convenience, I use the word "alignment" nonetheless throughout this document to describe what GrailEXP's gene message alignment program returns.

What is a gene message alignment?

A gene message alignment is an alignment between a genomic sequence (in which genes are chopped up into pieces) and a gene message (in which a gene is a single contiguous piece of DNA built from spliced together exons). It differs from a regular global alignment because there is additional information that can help with the alignment process, most notably that the internal "edges" of each alignment piece should fall on standard splice site (GT....AG) boundaries. A gene message alignment program aligns a gene message with a genomic sequence such that the exon/intron boundaries are clearly identified.

What databases are searched by the program?

The program can search any database of gene messages, provided the database is in the GXPDF (GrailEXP Database Format). At the Computational Biology Section at ORNL, GrailEXP searches the following databases:

For more information on obtaining and installing these databases, see Section 7: Installation and Customization of GrailEXP.

How are gene message alignments calculated?

The program reads in a sequence file and a list of Grail Exon Candidates. It replaces all non-exonic portions of the sequence with "n"'s and blasts that sequence against the search database. The program examines the resulting alignments briefly to determine which ESTs/cDNAs are worth examining further. It throws away quite a few ESTs at this stage.

The program then does a second blast, in which it blasts the entire sequence against the EST/cDNAs of interest. It takes these raw BLAST alignments and applies a dynamic programming algorithm to arrange the alignments into gene structures. It does a merge check, in which it locates the pieces that BLAST failed to join and merges those alignments into one. It then optimizes all the edges, left, right, and internal, doing an exhaustive search of all splice sites in the vicinity of each edge. At this stage, the donor splice site scoring system (developed by Victor Olman) is used to help get the correct splice site. The resulting alignments, with exons, introns, and splice sites calculated, are the final gene message alignments.

How does the program handle repetitives?

The program assumes one of the following:

However, the program contains plenty of measures to avoid nonsense alignments. When it does its initial BLAST search, GrailEXP looks for the word "repetitive" in the FASTA header of the sequence that is hit. The program looks for clusters of single alignments in the same location in the sequence that hit portions of many ESTs. If ESTs in this cluster are labeled "repetitive", it throws all those alignments out. No EST is ever eliminated that hits in 2 or more places in what looks like it may be a gene; the program can occasionally be fooled by repetitives that hit in two places in gene-like fashion, but the gene modeling program seldom builds genes out of these alignments. In addition, any single-piece alignments with an EST must cover 50% of that EST, or the alignment is eliminated.

Overall, GrailEXP is very good at "taking out the trash", or eliminating the nonsensical EST alignments that do not indicate genes.

How does the parallel version work?

The gene message database can often be quite large (as of the time of this writing, ours is 2.9GB). GrailEXP searches this database in partitions. In "serial mode", it searches each partition one after the other (although the calls to BLAST can still be multithreaded). This only allows limited speedup, however, as a lot of the "aligning" is done by the GrailEXP search program itself.

The true parallel version simply runs each GrailEXP search separately on different machines, where each task searches a single partition. This has several advantages over running BLAST multithreaded against a huge database, namely that:

Of course, you can also run BLAST multithreaded on partitions! You can do whatever you like; the program's parallel search is fully configurable.

The parallel search is accomplished through TCP/IP (using Daniel Bernstein's ucspi-tcp program). So there is nothing messy in the parallelism like having to use PVM or MPI; each task just runs independently and the master PERL script cats the alignment files together when every task has completed.

What are the advantages of using Blast?

BLAST is fast. Many other programs try to use Smith-Waterman-like algorithms to do these alignments, but this only slows the program down immensely. GrailEXP uses BLAST to get the "ball park" alignments, because it is fast and everyone is familiar with it.

What are the disadvantages of using Blast?

BLAST doesn't look for short pieces. This means any short exon finding must be done by GrailEXP's alignment program. As such, the program is weak at finding internal short exons, although hopefully this will be remedied in future versions.

BLAST does simple repeat filtering that causes it to break good alignments. This is an amazingly nonsensical "feature", as fairly simple algorithms can tell when the pieces should be put back together. An example is that BLAST will align bases 1 to 100 with bases 1 to 100 of a reference, and bases 201 to 300 with bases 201 to 300 of a sequence, all with 100% identity. But bases 101 to 200 will be left out if they are a simple repeat, even if they are 100% identical. One can run BLAST with simple repeat filtering turned off, but this results in an inordinate amount of noise. Ideally, BLAST should filter for simple repeats initially, but then "de-filter" to join fragments together that it incorrectly split.

GrailEXP handles these cases, but the maximum gap allowed for merging two pieces together is 200 bases. So it is possible you may see inexplicably broken alignments (although I have not encountered any with a > 200 base gap where BLAST didn't align anything in that 200 base region, but that doesn't mean it isn't possible).

How does the program deal with repeating genes?

Repeating genes, such as back-to-back-to-back zinc fingers, cause nonsensical alignments to be reported IF the gene consists of multiple exons and the genes are relatively close together. The first exon of gene one will often be joined with the second exon of gene two, etc. This happens because GrailEXP uses dynamic programming to pick the "best" alignments, so those with the highest percent identity get joined together (although distance also plays a role; it's not going to align part of one zinc finger with part of another 100KB away).

This is a known weakness of the program, that repeating genes show up as a big mess of contradictory alignments. Hopefully, this will be remedied by a "messy region" recognizer in future versions, which will determine when it is examining a region containing repetitive genes and order them correctly.

How does the program handle short exons?

GrailEXP can identify exons as short as 10 bases (although this happens when the BLAST alignment is longer than this and gets shrunk to align to splice sites). It is likely that GrailEXP could recognize short exons very well, first determining the possible presence of a short exon through detection of a "bump" in the alignment, and then through checking the donor splice site score to see if it's good. This has simply not been done to date, however, so the current version does not recognize short exons well. It does put its alignment pieces in the proper frames, though, so missing a short exon will only cause a slight ripple in the resulting translation.

GrailEXP finds short exons (as short as 5 bases) at the edges of the alignments (with the criterion that the bases in the message must exactly match the genomic sequence). This is critical for aligning CDS portions of transcripts with genomic sequence, where one may have very short exons at the edges of the gene.

Can I align just one cDNA with one genomic sequence?

Yes. GrailEXP does this very quickly. See Section 6: Syntax for Command-Line GrailEXP for instructions on how to do this.

Can I use exons from other exon prediction programs such as Genscan?

Yes. GrailEXP knows how to read Genscan output files. You can use exons from any other program you want also, as long as you put them in the GrailEXP exon format or the Genscan output format. For a description of the Grail exon format, see Section 2: The Exon Prediction Program.

Do I have to use exons as seeds or can I align with the whole sequence?

No, you do not have to supply an exon candidate file. However, if you do not supply one, the program will align the entire sequence with the search database, which can take an extremely long time. If you do not supply an exon file, you should make absolutely certain that you are submitting a repeat-masked sequence. Submitting an unrepeat-masked sequence with no exon file to the alignment program will result in an explosion of hits.

How does the program compare to other EST alignment tools?

Speed and flexibility are its primary advantages (not to mention its accuracy, which is at least as good as whatever else is out there). See Section 1: Introduction for more information.

What is the pretty output format?

Here is an example of the pretty output format for Galahad:

--------------------------------------------------------------------------------
GrailEXP v3.2                                  http://compbio.ornl.gov/grailexp/

Authors:  Doug Hyatt, Manesh Shah, Victor Olman, Richard Mural, Ying Xu, and 
  Edward C. Uberbacher, 1996-2001

Reference:  "Automated Gene Identification in Large-Scale Genomic Sequences",
  Xu, Y. and Uberbacher, E.C., Journal of Computational Biology, Volume 4,
  Number 3, 1997

Sequence:  >GrailEXP Input Sequence (36741 bp)
--------------------------------------------------------------------------------
GALAHAD Gene Alignments (97 located: 25 displayed, 72 redundant)

 Index Std    Begin       End         Accession     Database   Organism   Length

      1 -        320        702         AA633758.1      dbest      human     383

  1 piece    Seq exons = (320..702)
 99% ident   Ref exons = (383..1)

      2 +       3936      35973            HT36257       tigr      human    1337

 12 pieces   Seq exons = (3936..4063,19230..19295,26411..26466,28908..29051,
                          29823..29938,31273..31303,32425..32496,32573..32674,
                          32851..32915,34354..34483,35100..35202,35651..35973)
 99% ident   Ref exons = (1..128,129..195,196..253,254..397,398..515,516..542,
                          543..614,615..716,717..781,782..911,912..1014,
                          1015..1337)

      3 +       3936      35975             HT1292       tigr      human    1498

 12 pieces   Seq exons = (3936..4063,19230..19291,26344..26466,28908..29051,
                          29823..29938,31176..31303,32425..32496,32573..32674,
                          32851..32915,34354..34483,35100..35202,35651..35975)
 99% ident   Ref exons = (1..128,129..190,191..313,314..457,458..573,574..701,
                          702..773,774..875,876..940,941..1070,1071..1173,
                          1174..1498)

      4 +       3939      31296         BE407945.1      dbest      human     756

  6 pieces   Seq exons = (3939..4063,19230..19291,26344..26466,28908..29055,
                          29825..29938,31176..31296)
 96% ident   Ref exons = (1..124,125..186,187..309,310..459,460..578,579..697)

      5 +       4028      31253         BE689485.1      dbest      mouse     716

  6 pieces   Seq exons = (4028..4063,19230..19291,26344..26466,28908..29051,
                          29823..29938,31176..31253)
 86% ident   Ref exons = (75..110,111..172,173..295,296..439,440..554,
                          555..632)

      6 +      19230      29939         AW658477.1      dbest      other     567

  4 pieces   Seq exons = (19230..19291,26344..26466,28908..29051,29823..29939)
 89% ident   Ref exons = (122..183,184..306,307..450,451..567)

      7 +      19230      32482         BE877635.1      dbest      human     804

  6 pieces   Seq exons = (19230..19291,26344..26466,28908..29051,29823..29938,
                          31176..31307,32428..32482)
 97% ident   Ref exons = (90..151,152..274,275..418,419..533,534..666,
                          666..724)

      8 +      19242      31278         BF681806.1      dbest      mouse    1144

  5 pieces   Seq exons = (19242..19291,26344..26466,28908..29051,29823..29938,
                          31176..31278)
 86% ident   Ref exons = (124..173,174..296,297..440,441..556,557..659)

      9 +      19242      31278            ET63508       tigr      mouse    1399

  5 pieces   Seq exons = (19242..19291,26344..26466,28908..29051,29823..29938,
                          31176..31278)
 86% ident   Ref exons = (138..187,188..310,311..454,455..570,571..673)

     10 +      26399      31260         BF558069.1      dbest      other     435

  4 pieces   Seq exons = (26399..26466,28908..29051,29823..29938,31176..31260)
 87% ident   Ref exons = (20..87,88..231,232..347,348..432)

     11 +      26404      32497         AW659603.1      dbest      other     544

  5 pieces   Seq exons = (26404..26466,28908..29051,29823..29938,31176..31303,
                          32425..32497)
 89% ident   Ref exons = (2..64,65..208,209..324,325..452,453..525)

     12 +      26457      32484         AA251421.1      dbest      human     471

  5 pieces   Seq exons = (26457..26466,28908..29051,29823..29938,31176..31312,
                          32428..32484)
 96% ident   Ref exons = (1..10,11..152,153..268,269..404,403..463)

     13 +      31276      32916         AA354600.1      dbest      human     348

  3 pieces   Seq exons = (31276..31303,32425..32674,32851..32916)
 98% ident   Ref exons = (1..28,29..278,279..344)

     14 +      32440      32619         BF550639.1      dbest      other     355

  2 pieces   Seq exons = (32440..32496,32573..32619)
 90% ident   Ref exons = (202..258,259..305)

     15 +      32440      32620         AA871702.1      dbest      mouse     567

  2 pieces   Seq exons = (32440..32496,32573..32620)
 86% ident   Ref exons = (185..241,242..289)

     16 +      32440      32620         AA871917.1      dbest      mouse     645

  2 pieces   Seq exons = (32440..32496,32573..32620)
 87% ident   Ref exons = (131..187,188..235)

     17 +      32440      32620            ET63508       tigr      mouse    1399

  2 pieces   Seq exons = (32440..32496,32573..32620)
 87% ident   Ref exons = (714..770,771..818)

     18 +      32571      35202         AW631750.1      dbest      other     572

  4 pieces   Seq exons = (32571..32674,32851..32915,34354..34483,35100..35202)
 89% ident   Ref exons = (15..118,119..181,182..313,314..416)

     19 +      32578      34473         AI877029.1      dbest      mouse     501

  3 pieces   Seq exons = (32578..32674,32851..32915,34354..34473)
 86% ident   Ref exons = (12..108,109..171,172..293)

     20 +      32597      35113         AW142307.1      dbest      other     586

  4 pieces   Seq exons = (32597..32674,32851..32915,34354..34483,35100..35113)
 89% ident   Ref exons = (10..86,87..151,151..281,282..295)

     21 +      32657      34473         AI152550.1      dbest      mouse     592

  3 pieces   Seq exons = (32657..32674,32851..32915,34354..34473)
 92% ident   Ref exons = (51..68,69..131,132..253)

     22 +      32852      34473         AA646681.1      dbest      mouse     628

  2 pieces   Seq exons = (32852..32915,34354..34473)
 91% ident   Ref exons = (245..306,307..428)

     23 +      32852      34473         AA871917.1      dbest      mouse     645

  2 pieces   Seq exons = (32852..32915,34354..34473)
 91% ident   Ref exons = (291..352,353..474)

     24 +      32852      34473            ET63508       tigr      mouse    1399

  2 pieces   Seq exons = (32852..32915,34354..34473)
 91% ident   Ref exons = (874..935,936..1057)

     25 +      34353      35976         AV735384.1      dbest      human     567

  3 pieces   Seq exons = (34353..34483,35100..35202,35651..35976)
 97% ident   Ref exons = (4..130,131..232,233..553)

--------------------------------------------------------------------------------
Unlike the GCA and raw output formats, the pretty output does redundancy checking and only shows alignments that provide unique information. Seq exons indicate the bounds of the predicted exons within the genomic sequence. Ref exons indicate the bounds of the predicted exons within the reference sequence (mRNA, EST, cDNA).

What is the raw output format?

An example of the raw output of the GrailEXP genomic alignment program follows:

begin alignments
 tigr human HT1292 none 1498 100 5 0 12 3936 4063 1 128
 tigr human HT1292 none 1498 100 12 5 6 19230 19291 129 190
 tigr human HT1292 none 1498 100 6 12 2 26344 26466 191 313
 tigr human HT1292 none 1498 100 2 6 7 28908 29051 314 457
 tigr human HT1292 none 1498 100 7 2 3 29823 29942 458 572
 tigr human HT1292 none 1498 95 3 7 10 31176 31307 573 699
 tigr human HT1292 none 1498 100 10 3 9 32425 32496 700 773
 tigr human HT1292 none 1498 100 9 10 11 32573 32674 774 875
 tigr human HT1292 none 1498 98 11 9 4 32851 32915 876 940
 tigr human HT1292 none 1498 100 4 11 8 34354 34483 941 1070
 tigr human HT1292 none 1498 100 8 4 1 35100 35202 1071 1173
 tigr human HT1292 none 1498 100 1 8 0 35651 35975 1174 1498
 genbank human K02567.1 none 1478 100 7 0 12 3955 4063 1 109
 genbank human K02567.1 none 1478 100 12 7 5 19230 19291 110 171
 genbank human K02567.1 none 1478 99 5 12 2 26344 26466 172 294
 genbank human K02567.1 none 1478 100 2 5 6 28908 29051 295 438
 genbank human K02567.1 none 1478 100 6 2 4 29823 29942 439 553
 genbank human K02567.1 none 1478 94 4 6 10 31176 31307 554 680
 genbank human K02567.1 none 1478 100 10 4 9 32425 32496 681 754
 genbank human K02567.1 none 1478 100 9 10 11 32573 32674 755 856
 genbank human K02567.1 none 1478 98 11 9 3 32851 32915 857 921
 genbank human K02567.1 none 1478 100 3 11 8 34354 34483 922 1051
 genbank human K02567.1 none 1478 100 8 3 1 35100 35202 1052 1154
 genbank human K02567.1 none 1478 99 1 8 0 35651 35975 1155 1478
end alignments

Reverse strand alignments are indicated by the final two fields being in DESCENDING order. All alignments are shown from the forward strand's perspective, in ASCENDING order. The third and fourth fields from the back will always be in ascending order.

The fields are, in order:
Database Tag, Organism Tag, Accession Number, Clone Name, Length, Percent ID, Cont Ind, Cont Left, Cont Right, Query Begin, Query End, Ref Begin, Ref End

The fields are separated by spaces. A leading space begins each data line. The alignment list is enclosed by "begin alignments" and "end alignments" tags.

What is the Genome Channel output format?

Here is a valid Genome Channel file representing the same information as in the above two sections:


align_grailexp_v3=1|HT1292|tigr|human|1498|12|f|3936|4063|1|128|100|19230|19291|129|190|100|26344|26466|191|313|100|28908|29051|314|457|100|29823|29942|458|572|100|31176|31307|573|699|95|32425|32496|700|773|100|32573|32674|774|875|100|32851|32915|876|940|98|34354|34483|941|1070|100|35100|35202|1071|1173|100|35651|35975|1174|1498|100
align_grailexp_v3=2|K02567.1|genbank|human|1478|12|f|3955|4063|1|109|100|19230|19291|110|171|100|26344|26466|172|294|99|28908|29051|295|438|100|29823|29942|439|553|100|31176|31307|554|680|94|32425|32496|681|754|100|32573|32674|755|856|100|32851|32915|857|921|98|34354|34483|922|1051|100|35100|35202|1052|1154|100|35651|35975|1155|1478|99

A technical description of the format:


align_grailexp_v3=id|acc|db|org|len|num hits|strand|(hit list)...

  id = id
  acc = accession number of db element hit
  db = database of db element hit (dots, tigr, refseq, dbest, swissprot)
  org = organism of element if known
  len = length of the element
  num hits = number of hits
  strand = f or r

  hit consists of:    seqb|seqe|refb|refe|pctid
    (Boundaries of the alignment and percent identity)

Genome Channel output format is always indexed from the TARGET STRAND'S PERSPECTIVE. This means all forward strand objects are indexed relative to the forward strand, and all reverse strand objects are indexed relative to the reverse strand.

What is planned for future versions?

The alignment module is an area of very active development, so several improvements are planned in the near future, including improved recognition of short exons and of areas with repeating genes.


The author and maintainer of this FAQ is Doug Hyatt (hyattpd@ornl.gov). This FAQ applies to GrailEXP version 3.2, released February, 2001. This FAQ was last updated February, 2001.