The program then does a second blast, in which it blasts the entire sequence against the EST/cDNAs of interest. It takes these raw BLAST alignments and applies a dynamic programming algorithm to arrange the alignments into gene structures. It does a merge check, in which it locates the pieces that BLAST failed to join and merges those alignments into one. It then optimizes all the edges, left, right, and internal, doing an exhaustive search of all splice sites in the vicinity of each edge. At this stage, the donor splice site scoring system (developed by Victor Olman) is used to help get the correct splice site. The resulting alignments, with exons, introns, and splice sites calculated, are the final gene message alignments.
Overall, GrailEXP is very good at "taking out the trash", or eliminating the nonsensical EST alignments that do not indicate genes.
The true parallel version simply runs each GrailEXP search separately on different machines, where each task searches a single partition. This has several advantages over running BLAST multithreaded against a huge database, namely that:
The parallel search is accomplished through TCP/IP (using Daniel Bernstein's ucspi-tcp program). So there is nothing messy in the parallelism like having to use PVM or MPI; each task just runs independently and the master PERL script cats the alignment files together when every task has completed.
BLAST does simple repeat filtering that causes it to break good alignments. This is an amazingly nonsensical "feature", as fairly simple algorithms can tell when the pieces should be put back together. An example is that BLAST will align bases 1 to 100 with bases 1 to 100 of a reference, and bases 201 to 300 with bases 201 to 300 of a sequence, all with 100% identity. But bases 101 to 200 will be left out if they are a simple repeat, even if they are 100% identical. One can run BLAST with simple repeat filtering turned off, but this results in an inordinate amount of noise. Ideally, BLAST should filter for simple repeats initially, but then "de-filter" to join fragments together that it incorrectly split.
GrailEXP handles these cases, but the maximum gap allowed for merging two pieces together is 200 bases. So it is possible you may see inexplicably broken alignments (although I have not encountered any with a > 200 base gap where BLAST didn't align anything in that 200 base region, but that doesn't mean it isn't possible).
This is a known weakness of the program, that repeating genes show up as a big mess of contradictory alignments. Hopefully, this will be remedied by a "messy region" recognizer in future versions, which will determine when it is examining a region containing repetitive genes and order them correctly.
GrailEXP finds short exons (as short as 5 bases) at the edges of the alignments (with the criterion that the bases in the message must exactly match the genomic sequence). This is critical for aligning CDS portions of transcripts with genomic sequence, where one may have very short exons at the edges of the gene.
--------------------------------------------------------------------------------
GrailEXP v3.2 http://compbio.ornl.gov/grailexp/
Authors: Doug Hyatt, Manesh Shah, Victor Olman, Richard Mural, Ying Xu, and
Edward C. Uberbacher, 1996-2001
Reference: "Automated Gene Identification in Large-Scale Genomic Sequences",
Xu, Y. and Uberbacher, E.C., Journal of Computational Biology, Volume 4,
Number 3, 1997
Sequence: >GrailEXP Input Sequence (36741 bp)
--------------------------------------------------------------------------------
GALAHAD Gene Alignments (97 located: 25 displayed, 72 redundant)
Index Std Begin End Accession Database Organism Length
1 - 320 702 AA633758.1 dbest human 383
1 piece Seq exons = (320..702)
99% ident Ref exons = (383..1)
2 + 3936 35973 HT36257 tigr human 1337
12 pieces Seq exons = (3936..4063,19230..19295,26411..26466,28908..29051,
29823..29938,31273..31303,32425..32496,32573..32674,
32851..32915,34354..34483,35100..35202,35651..35973)
99% ident Ref exons = (1..128,129..195,196..253,254..397,398..515,516..542,
543..614,615..716,717..781,782..911,912..1014,
1015..1337)
3 + 3936 35975 HT1292 tigr human 1498
12 pieces Seq exons = (3936..4063,19230..19291,26344..26466,28908..29051,
29823..29938,31176..31303,32425..32496,32573..32674,
32851..32915,34354..34483,35100..35202,35651..35975)
99% ident Ref exons = (1..128,129..190,191..313,314..457,458..573,574..701,
702..773,774..875,876..940,941..1070,1071..1173,
1174..1498)
4 + 3939 31296 BE407945.1 dbest human 756
6 pieces Seq exons = (3939..4063,19230..19291,26344..26466,28908..29055,
29825..29938,31176..31296)
96% ident Ref exons = (1..124,125..186,187..309,310..459,460..578,579..697)
5 + 4028 31253 BE689485.1 dbest mouse 716
6 pieces Seq exons = (4028..4063,19230..19291,26344..26466,28908..29051,
29823..29938,31176..31253)
86% ident Ref exons = (75..110,111..172,173..295,296..439,440..554,
555..632)
6 + 19230 29939 AW658477.1 dbest other 567
4 pieces Seq exons = (19230..19291,26344..26466,28908..29051,29823..29939)
89% ident Ref exons = (122..183,184..306,307..450,451..567)
7 + 19230 32482 BE877635.1 dbest human 804
6 pieces Seq exons = (19230..19291,26344..26466,28908..29051,29823..29938,
31176..31307,32428..32482)
97% ident Ref exons = (90..151,152..274,275..418,419..533,534..666,
666..724)
8 + 19242 31278 BF681806.1 dbest mouse 1144
5 pieces Seq exons = (19242..19291,26344..26466,28908..29051,29823..29938,
31176..31278)
86% ident Ref exons = (124..173,174..296,297..440,441..556,557..659)
9 + 19242 31278 ET63508 tigr mouse 1399
5 pieces Seq exons = (19242..19291,26344..26466,28908..29051,29823..29938,
31176..31278)
86% ident Ref exons = (138..187,188..310,311..454,455..570,571..673)
10 + 26399 31260 BF558069.1 dbest other 435
4 pieces Seq exons = (26399..26466,28908..29051,29823..29938,31176..31260)
87% ident Ref exons = (20..87,88..231,232..347,348..432)
11 + 26404 32497 AW659603.1 dbest other 544
5 pieces Seq exons = (26404..26466,28908..29051,29823..29938,31176..31303,
32425..32497)
89% ident Ref exons = (2..64,65..208,209..324,325..452,453..525)
12 + 26457 32484 AA251421.1 dbest human 471
5 pieces Seq exons = (26457..26466,28908..29051,29823..29938,31176..31312,
32428..32484)
96% ident Ref exons = (1..10,11..152,153..268,269..404,403..463)
13 + 31276 32916 AA354600.1 dbest human 348
3 pieces Seq exons = (31276..31303,32425..32674,32851..32916)
98% ident Ref exons = (1..28,29..278,279..344)
14 + 32440 32619 BF550639.1 dbest other 355
2 pieces Seq exons = (32440..32496,32573..32619)
90% ident Ref exons = (202..258,259..305)
15 + 32440 32620 AA871702.1 dbest mouse 567
2 pieces Seq exons = (32440..32496,32573..32620)
86% ident Ref exons = (185..241,242..289)
16 + 32440 32620 AA871917.1 dbest mouse 645
2 pieces Seq exons = (32440..32496,32573..32620)
87% ident Ref exons = (131..187,188..235)
17 + 32440 32620 ET63508 tigr mouse 1399
2 pieces Seq exons = (32440..32496,32573..32620)
87% ident Ref exons = (714..770,771..818)
18 + 32571 35202 AW631750.1 dbest other 572
4 pieces Seq exons = (32571..32674,32851..32915,34354..34483,35100..35202)
89% ident Ref exons = (15..118,119..181,182..313,314..416)
19 + 32578 34473 AI877029.1 dbest mouse 501
3 pieces Seq exons = (32578..32674,32851..32915,34354..34473)
86% ident Ref exons = (12..108,109..171,172..293)
20 + 32597 35113 AW142307.1 dbest other 586
4 pieces Seq exons = (32597..32674,32851..32915,34354..34483,35100..35113)
89% ident Ref exons = (10..86,87..151,151..281,282..295)
21 + 32657 34473 AI152550.1 dbest mouse 592
3 pieces Seq exons = (32657..32674,32851..32915,34354..34473)
92% ident Ref exons = (51..68,69..131,132..253)
22 + 32852 34473 AA646681.1 dbest mouse 628
2 pieces Seq exons = (32852..32915,34354..34473)
91% ident Ref exons = (245..306,307..428)
23 + 32852 34473 AA871917.1 dbest mouse 645
2 pieces Seq exons = (32852..32915,34354..34473)
91% ident Ref exons = (291..352,353..474)
24 + 32852 34473 ET63508 tigr mouse 1399
2 pieces Seq exons = (32852..32915,34354..34473)
91% ident Ref exons = (874..935,936..1057)
25 + 34353 35976 AV735384.1 dbest human 567
3 pieces Seq exons = (34353..34483,35100..35202,35651..35976)
97% ident Ref exons = (4..130,131..232,233..553)
--------------------------------------------------------------------------------
Unlike the GCA and raw output formats, the pretty output
does redundancy checking and only shows alignments that provide
unique information. Seq exons indicate the bounds of
the predicted exons within the genomic sequence. Ref exons
indicate the bounds of the predicted exons within the reference sequence
(mRNA, EST, cDNA).
begin alignments tigr human HT1292 none 1498 100 5 0 12 3936 4063 1 128 tigr human HT1292 none 1498 100 12 5 6 19230 19291 129 190 tigr human HT1292 none 1498 100 6 12 2 26344 26466 191 313 tigr human HT1292 none 1498 100 2 6 7 28908 29051 314 457 tigr human HT1292 none 1498 100 7 2 3 29823 29942 458 572 tigr human HT1292 none 1498 95 3 7 10 31176 31307 573 699 tigr human HT1292 none 1498 100 10 3 9 32425 32496 700 773 tigr human HT1292 none 1498 100 9 10 11 32573 32674 774 875 tigr human HT1292 none 1498 98 11 9 4 32851 32915 876 940 tigr human HT1292 none 1498 100 4 11 8 34354 34483 941 1070 tigr human HT1292 none 1498 100 8 4 1 35100 35202 1071 1173 tigr human HT1292 none 1498 100 1 8 0 35651 35975 1174 1498 genbank human K02567.1 none 1478 100 7 0 12 3955 4063 1 109 genbank human K02567.1 none 1478 100 12 7 5 19230 19291 110 171 genbank human K02567.1 none 1478 99 5 12 2 26344 26466 172 294 genbank human K02567.1 none 1478 100 2 5 6 28908 29051 295 438 genbank human K02567.1 none 1478 100 6 2 4 29823 29942 439 553 genbank human K02567.1 none 1478 94 4 6 10 31176 31307 554 680 genbank human K02567.1 none 1478 100 10 4 9 32425 32496 681 754 genbank human K02567.1 none 1478 100 9 10 11 32573 32674 755 856 genbank human K02567.1 none 1478 98 11 9 3 32851 32915 857 921 genbank human K02567.1 none 1478 100 3 11 8 34354 34483 922 1051 genbank human K02567.1 none 1478 100 8 3 1 35100 35202 1052 1154 genbank human K02567.1 none 1478 99 1 8 0 35651 35975 1155 1478 end alignments
Reverse strand alignments are indicated by the final two fields being in DESCENDING order. All alignments are shown from the forward strand's perspective, in ASCENDING order. The third and fourth fields from the back will always be in ascending order.
The fields are, in order:
Database Tag, Organism Tag, Accession Number, Clone Name, Length, Percent ID, Cont Ind, Cont Left, Cont Right, Query Begin, Query End, Ref Begin, Ref End
The fields are separated by spaces. A leading space begins each data line. The alignment list is enclosed by "begin alignments" and "end alignments" tags.
align_grailexp_v3=1|HT1292|tigr|human|1498|12|f|3936|4063|1|128|100|19230|19291|129|190|100|26344|26466|191|313|100|28908|29051|314|457|100|29823|29942|458|572|100|31176|31307|573|699|95|32425|32496|700|773|100|32573|32674|774|875|100|32851|32915|876|940|98|34354|34483|941|1070|100|35100|35202|1071|1173|100|35651|35975|1174|1498|100 align_grailexp_v3=2|K02567.1|genbank|human|1478|12|f|3955|4063|1|109|100|19230|19291|110|171|100|26344|26466|172|294|99|28908|29051|295|438|100|29823|29942|439|553|100|31176|31307|554|680|94|32425|32496|681|754|100|32573|32674|755|856|100|32851|32915|857|921|98|34354|34483|922|1051|100|35100|35202|1052|1154|100|35651|35975|1155|1478|99
A technical description of the format:
align_grailexp_v3=id|acc|db|org|len|num hits|strand|(hit list)...
id = id
acc = accession number of db element hit
db = database of db element hit (dots, tigr, refseq, dbest, swissprot)
org = organism of element if known
len = length of the element
num hits = number of hits
strand = f or r
hit consists of: seqb|seqe|refb|refe|pctid
(Boundaries of the alignment and percent identity)
Genome Channel output format is always indexed from the TARGET STRAND'S PERSPECTIVE. This means all forward strand objects are indexed relative to the forward strand, and all reverse strand objects are indexed relative to the reverse strand.