The GrailEXP gene assembly module has been rewritten from the ground up for GrailEXP v3.0+. It includes many new capabilities that were lacking in older versions, such as recognition of alternative splicing and the clustering together of all ESTs/cDNAs that support a particular gene model.
The program then uses an elaborate dynamic programming algorithm to build gene models. Each node in the dynamic programming is a Grail exon candidate or an alignment-based exon. Connections between the various nodes are scored, with bonuses for "good" connections and penalties for "bad" connections.
Once the maximally scoring dynamic programming model has been calculated, the program takes the preliminary genes and does some further calculations on them. It can join or break genes at this point, as well as manipulate the exon edges to better match splice sites. Alternative splices are then located and added to the gene table.
Finally, the program looks for the coding begin and end within each mRNA. It evaluates possible starts within the gene and attempts to determine which is the correct one. This is a fallible process; without protein similarity, there is no way to get this 100% correct. In addition, false stop codons in the genomic sequence (due either to sequencing errors or to pseudogenes) can cause incorrect start and stop codons to be identified.
Polyas and promoters are located last. Polya recognition uses a simple Markov model. The promoter system uses a neural net and looks for specific types of signals (TATA, CAAT, GC-box). Not every gene will have a polya or a promoter.
Why doesn't the program identify other possible cases of alternative splicing? The biggest problem is distinguishing between an UNSPLICED product that has slipped into the EST database, an error in the genomic alignment, and a genuine case of alternative splicing. I am skeptical of many reports of alternative splicing, because they do not take this into account. Just because one rogue EST disagrees with the pack does not necessarily mean it is an alternative splice. Even if you have 3 or 4 that say one thing, how can you be certain it isn't just a mistake in the alignment or in the ESTs?
For that reason, GrailEXP currently recognizes only the most obvious cases of alternative splicing. Outside of repeating gene regions (which can be incorrectly reported as regions with a lot of alternative splices), GrailEXP identifies inserted/omitted exons with absolute certainty.
In future versions, more cases of alternative splicing will be recognized. There are issues to resolve first with the algorithm, however, to determine when what one is looking at is really an alternative splice and not just garbage.
The promoter recognition system looks for a TATA or ATA and examines the region around these bases. In particular, the neural net is fed information on GC content, information on CAAT position and the surrounding area, GGGCGG position and the surrounding area, and ATG sequences and the surrounding areas, The neural net evaluates the scores and assigns a total score to the promoter. Again, the promoter must be within 5,000 bases of a start codon of a predicted GrailEXP gene model to be retained and cannot fall within an intron. At most one promoter element is assigned to each gene model.
Running with closed ends, on the other hand, will penalize partial genes at the ends of the sequence and will try to "close off" all gene models. This is ideal if you know a clone only contains one gene you want to examine, and you don't want GrailEXP to predict partial genes on the ends of the sequence.
GrailEXP runs with closed ends by default. This is due to the fact that there is almost always an initial or terminal Grail Exon Candidate that can be predicted near the beginning or end of a sequence. If run in open ended mode, you will see a lot of these single-exon partial genes predicted near the edges of your sequence.
You can run GrailEXP on just the forward strand or just the reverse strand. In fact, the most accurate way to perform gene modeling with GrailEXP may very well be to predict exons/alignments double-stranded, then run the gene assembly program twice, once on the forward strand, once on the reverse strand. Further tests on this subject are needed.
EST/cDNAs are flagged that are inconsistent with gene models but are not currently returned to the user. It would be relatively easy to cluster these EST/cDNAs together using GrailEXP's consistency-check function, but this is not currently being done. Suggestions are welcome on what would be useful to users in future versions. The program is primarily a gene assembly program, not an EST/cDNA clustering program, so its emphasis currently is on clustering together the EST/cDNAs that support a particular gene model and ignoring the rest.
--------------------------------------------------------------------------------
GrailEXP v3.2 http://compbio.ornl.gov/grailexp/
Authors: Doug Hyatt, Manesh Shah, Victor Olman, Richard Mural, Ying Xu, and
Edward C. Uberbacher, 1996-2001
Reference: "Automated Gene Identification in Large-Scale Genomic Sequences",
Xu, Y. and Uberbacher, E.C., Journal of Computational Biology, Volume 4,
Number 3, 1997
Sequence: >GrailEXP Input Sequence (36741 bp)
--------------------------------------------------------------------------------
GAWAIN Gene Predictions (2 predicted, 2 with database similarity)
Genes with Database Similarity (2 predicted, 0 with alternative splices)
Gene 1, Variant 1 Strand: - Bounds: 320-702 Exons: 1
Start Codon: Yes Stop Codon: Yes
Top-Scoring Reference: AA633758.1 (383 bp) (99% id, 320-702)
>human|AA633758.1|dbest|AA633758 ac27b12.s1 Stratagene ovary (#937217)
Homo sapiens cDNA clone IMAGE:857663 3'
Reference Path: AA633758.1 (383 bp) (99%, 320-702)
---Index---- --------Exons-------- ---------CDS--------- -Ph- -Fr- -Len- -Scr-
1.1.1 320 702 595 687 0 0 383 99
Gene 2, Variant 1 Strand: + Bounds: 3936-35976 Exons: 12
Start Codon: Yes Stop Codon: Yes
Top-Scoring Reference: HT1292 (1498 bp) (99% id, 3936-35975)
>human|HT1292|tigr|adenosine deaminase, alt. transcript 1, 3'
Reference Path: HT1292 (1498 bp) (99%, 3936-35975)
AV735384.1 (567 bp) (97%, 34354-35976)
---Index---- --------Exons-------- ---------CDS--------- -Ph- -Fr- -Len- -Scr-
Promoter 1978 2055 ... ... . . 78 69
2.1.1 3936 4063 4031 4063 1 1 128 100
2.1.2 19230 19291 19230 19291 0 2 62 100
2.1.3 26344 26466 26344 26466 2 1 123 100
2.1.4 28908 29051 28908 29051 2 0 144 100
2.1.5 29823 29938 29823 29938 2 0 116 100
2.1.6 31176 31303 31176 31303 1 1 128 100
2.1.7 32425 32496 32425 32496 0 0 72 100
2.1.8 32573 32674 32573 32674 0 1 102 100
2.1.9 32851 32915 32851 32915 0 0 65 98
2.1.10 34354 34483 34354 34483 2 1 130 100
2.1.11 35100 35202 35100 35202 0 2 103 100
2.1.12 35651 35976 35651 35664 1 0 326 100
PolyA 35947 35952 ... ... . . 6 89
>GrailEXP Gene 1, Var 1 mRNA|Similar to AA633758.1
cctccccagcctctcatgacattcaagtccctcttcaacatcagtcccaccacttcccaccacattgctg
aaatgccaatcaacctagatgagttgctggttgcttaagggtgacacaaattatttgcaattcttcacat
tgcaagatggagtctaggtttcctcctcttggacctgggctggcctgagtgacttgttggagtaatagag
cgtgatgggagcaagcttcctggatttccgagccttggtcataaagaagctttgtggtgtcctttggggc
ctcttgaacattctctctgggaacccaagcttctagaagagaccatcatgctaagagagtccacatatgg
ccattccagttgacagtcccactgagcccaggc
>GrailEXP Gene 1, Var 1 protein|Derived from similarity to AA633758.1
MTFKSLFNISPTTSHHIAEMPINLDELLVA*
>GrailEXP Gene 2, Var 1 mRNA|Similar to HT1292
accgctggccccagggaaagccgagcggccaccgagccggcagagacccaccgagcggcggcggagggag
cagcgccggggcgcacgagggcaccatggcccagacgcccgccttcgacaagcccaaagtagaactgcat
gtccacctagacggatccatcaagcctgaaaccatcttatactatggcaggaggagagggatcgccctcc
cagctaacacagcagaggggctgctgaacgtcattggcatggacaagccgctcacccttccagacttcct
ggccaagtttgactactacatgcctgctatcgcgggctgccgggaggctatcaaaaggatcgcctatgag
tttgtagagatgaaggccaaagagggcgtggtgtatgtggaggtgcggtacagtccgcacctgctggcca
actccaaagtggagccaatcccctggaaccaggctgaaggggacctcaccccagacgaggtggtggccct
agtgggccagggcctgcaggagggggagcgagacttcggggtcaaggcccggtccatcctgtgctgcatg
cgccaccagcccaactggtcccccaaggtggtggagctgtgtaagaagtaccagcagcagaccgtggtag
ccattgacctggctggagatgagaccatcccaggaagcagcctcttgcctggacatgtccaggcctacca
ggaggctgtgaagagcggcattcaccgtactgtccacgccggggaggtgggctcggccgaagtagtaaaa
gaggctgtggacatactcaagacagagcggctgggacacggctaccacaccctggaagaccaggcccttt
ataacaggctgcggcaggaaaacatgcacttcgagatctgcccctggtccagctacctcactggtgcctg
gaagccggacacggagcatgcagtcattcggctcaaaaatgaccaggctaactactcgctcaacacagat
gacccgctcatcttcaagtccaccctggacactgattaccagatgaccaaacgggacatgggctttactg
aagaggagtttaaaaggctgaacatcaatgcggccaaatctagtttcctcccagaagatgaaaagaggga
gcttctcgacctgctctataaagcctatgggatgccaccttcagcctctgcagggcagaacctctgaaga
cgccactcctccaagccttcaccctgtggagtcaccccaactctgtggggctgagcaacatttttacatt
tattccttccaagaagaccatgatctcaatagtcagttactgatgctcctgaaccctatgtgtccatttc
tgcacacacgtatacctcggcatggccgcgtcacttctctgattatgtgccctggccagggaccagcgcc
cttgcacatgggcatggttgaatctgaaaccctccttctgtggcaacttgtactgaaaatctggtgctca
ataaagaagcccatggctggtggcatgca
>GrailEXP Gene 2, Var 1 protein|Derived from similarity to HT1292
MAQTPAFDKPKVELHVHLDGSIKPETILYYGRRRGIALPANTAEGLLNVIGMDKPLTLPDFLAKFDYYMP
AIAGCREAIKRIAYEFVEMKAKEGVVYVEVRYSPHLLANSKVEPIPWNQAEGDLTPDEVVALVGQGLQEG
ERDFGVKARSILCCMRHQPNWSPKVVELCKKYQQQTVVAIDLAGDETIPGSSLLPGHVQAYQEAVKSGIH
RTVHAGEVGSAEVVKEAVDILKTERLGHGYHTLEDQALYNRLRQENMHFEICPWSSYLTGAWKPDTEHAV
IRLKNDQANYSLNTDDPLIFKSTLDTDYQMTKRDMGFTEEEFKRLNINAAKSSFLPEDEKRELLDLLYKA
YGMPPSASAGQNL*
Genes with No Database Evidence (0 predicted)
--------------------------------------------------------------------------------
The pretty output format provides information about each
gene model, including the FASTA header for the highest scoring
reference, a reference path, a complete list of exons and CDs regions,
and the mRNA and predicted protein for that gene. Unlike the
other output formats (raw and GCA), the pretty output format does
not show all the ESTs/cDNAs associated with each gene model; it
merely shows the minimum amount of evidence that spans the gene.
begin genes 1 1 summary r 1 320 702 595 687 320 702 1 1 exon 320 702 0 0 99 1 1 evidence dbest human AA633758.1 383 320 702 383 1 99 99 1 1 mrna cctccccagcctctcatgacattcaagtccctcttcaacatcagtcccaccacttcccaccacattgctgaaatgccaatcaacctagatgagttgctggttgcttaagggtgacacaaattatttgcaattcttcacattgcaagatggagtctaggtttcctcctcttggacctgggctggcctgagtgacttgttggagtaatagagcgtgatgggagcaagcttcctggatttccgagccttggtcataaagaagctttgtggtgtcctttggggcctcttgaacattctctctgggaacccaagcttctagaagagaccatcatgctaagagagtccacatatggccattccagttgacagtcccactgagcccaggc 1 1 translation MTFKSLFNISPTTSHHIAEMPINLDELLVA* 2 1 summary f 12 3936 35976 4031 35664 3936 35976 2 1 promoter 1978 2055 69 2 1 exon 3936 4063 1 1 100 2 1 exon 19230 19291 0 2 100 2 1 exon 26344 26466 2 1 100 2 1 exon 28908 29051 2 0 100 2 1 exon 29823 29942 2 0 100 2 1 exon 31176 31307 1 1 95 2 1 exon 32425 32496 0 0 100 2 1 exon 32573 32674 0 1 100 2 1 exon 32851 32915 0 0 98 2 1 exon 34354 34483 2 1 100 2 1 exon 35100 35202 0 2 100 2 1 exon 35651 35976 1 0 100 2 1 polya 35947 35952 89 2 1 evidence genbank human K02567.1 1478 3955 35975 1 1478 99 97 2 1 evidence tigr human HT1292 1498 3936 35975 1 1498 99 98 2 1 mrna accgctggccccagggaaagccgagcggccaccgagccggcagagacccaccgagcggcggcggagggagcagcgccggggcgcacgagggcaccatggcccagacgcccgccttcgacaagcccaaagtagaactgcatgtccacctagacggatccatcaagcctgaaaccatcttatactatggcaggaggagagggatcgccctcccagctaacacagcagaggggctgctgaacgtcattggcatggacaagccgctcacccttccagacttcctggccaagtttgactactacatgcctgctatcgcgggctgccgggaggctatcaaaaggatcgcctatgagtttgtagagatgaaggccaaagagggcgtggtgtatgtggaggtgcggtacagtccgcacctgctggccaactccaaagtggagccaatcccctggaaccaggctgaaggggacctcaccccagacgaggtggtggccctagtgggccagggcctgcaggagggggagcgagacttcggggtcaaggcccggtccatcctgtgctgcatgcgccaccagcccagtgaxxactggtcccccaaggtggtggagctgtgtaagaagtaccagcagcagaccgtggtagccattgacctggctggagatgagaccatcccaggaagcagcctcttgcctggacatgtccaggcctaccaggtggxxgaggctgtgaagagcggcattcaccgtactgtccacgccggggaggtgggctcggccgaagtagtaaaagaggctgtggacatactcaagacagagcggctgggacacggctaccacaccctggaagaccaggccctttataacaggctgcggcaggaaaacatgcacttcgagatctgcccctggtccagctacctcactggtgcctggaagccggacacggagcatgcagtcattcggctcaaaaatgaccaggctaactactcgctcaacacagatgacccgctcatcttcaagtccaccctggacactgattaccagatgaccaaacgggacatgggctttactgaagaggagtttaaaaggctgaacatcaatgcggccaaatctagtttcctcccagaagatgaaaagagggagcttctcgacctgctctataaagcctatgggatgccaccttcagcctctgcagggcagaacctctgaagacgccactcctccaagccttcaccctgtggagtcaccccaactctgtggggctgagcaacatttttacatttattccttccaagaagaccatgatctcaatagtcagttactgatgctcctgaaccctatgtgtccatttctgcacacacgtatacctcggcatggccgcgtcacttctctgattatgtgccctggccagggaccagcgcccttgcacatgggcatggttgaatctgaaaccctccttctgtggcaacttgtactgaaaatctggtgctcaataaagaagcccatggctggtggcatgca 2 1 translation MAQTPAFDKPKVELHVHLDGSIKPETILYYGRRRGIALPANTAEGLLNVIGMDKPLTLPDFLAKFDYYMPAIAGCREAIKRIAYEFVEMKAKEGVVYVEVRYSPHLLANSKVEPIPWNQAEGDLTPDEVVALVGQGLQEGERDFGVKARSILCCMRHQPS--WSPKVVELCKKYQQQTVVAIDLAGDETIPGSSLLPGHVQAYQV-EAVKSGIHRTVHAGEVGSAEVVKEAVDILKTERLGHGYHTLEDQALYNRLRQENMHFEICPWSSYLTGAWKPDTEHAVIRLKNDQANYSLNTDDPLIFKSTLDTDYQMTKRDMGFTEEEFKRLNINAAKSSFLPEDEKRELLDLLYKAYGMPPSASAGQNL* end genes
All coordinates in this format are ASCENDING, with strand indicated clearly on the summary line. For coordinates labeled "begin" and "end", this can be somewhat deceptive! The gene labeled "r 320 702", for example, begins at base 702 and ends at base 320, although its "begin" field is 320. Just be aware that everything is ordered in ascending order.
Each gene record begins with a summary line, whose fields are, in order:
Gene ID, Gene Variant, Summary Tag, Strand, Number of Exons, Gene Begin, Gene End, CDs Begin, CDs End, Evidence Begin, Evidence End
Gene ID, Gene Variant, Exon Tag, Exon Begin, Exon End, Phase, Reading Frame, Score
Each promoter or polya is indicated by a promoter or polya line, whose fields are, in order:
Gene ID, Gene Variant, Polya/Promoter Tag, Begin, End, Score
Each EST/cDNA/mRNA that is consistent with a particular gene model is presented as an evidence line, the fields of which are, in order:
Gene ID, Gene Variant, Evidence Tag, Database Tag, Organism Tag, Accession Number, Length, Query Begin, Query End, Ref Begin, Ref End, Percent ID, Percent Length
mRNAs and protein translatiosn are presented on mrna and translation lines, the fields of which are:
Gene ID, Gene Variant, mRNA/Protein Tag, mRNA/Protein.
The fields are separated by spaces. A leading space begins each data line. The gene model list is enclosed by "begin genes" and "end genes" tags.
gene_grailexp=1|1|36040|36422|36055|36147|36040|36422|1|r|36040|36422|0|0.99 evidence_gene_grailexp=1|1|AA633758.1|dbest|human|383|36040|36422|1|383|0.99|0.99 mrna_gene_grailexp=1|1|cctccccagcctctcatgacattcaagtccctcttcaacatcagtcccaccacttcccaccacattgctgaaatgccaatcaacctagatgagttgctggttgcttaagggtgacacaaattatttgcaattcttcacattgcaagatggagtctaggtttcctcctcttggacctgggctggcctgagtgacttgttggagtaatagagcgtgatgggagcaagcttcctggatttccgagccttggtcataaagaagctttgtggtgtcctttggggcctcttgaacattctctctgggaacccaagcttctagaagagaccatcatgctaagagagtccacatatggccattccagttgacagtcccactgagcccaggc protein_gene_grailexp=1|1|MTFKSLFNISPTTSHHIAEMPINLDELLVA* promoter_gene_grailexp=2|1|1978|2055|0.69 gene_grailexp=2|1|3936|35976|4031|35664|3936|35976|12|f|3936|4063|1|1|19230|19291|2|1|26344|26466|1|1|28908|29051|0|1|29823|29942|0|1|31176|31307|1|0.95|32425|32496|0|1|32573|32674|1|1|32851|32915|0|0.98|34354|34483|1|1|35100|35202|2|1|35651|35976|0|1 polya_gene_grailexp=2|1|35947|35952|0.89 evidence_gene_grailexp=2|1|K02567.1|genbank|human|1478|3955|35975|1|1478|0.99|0.97 evidence_gene_grailexp=2|1|HT1292|tigr|human|1498|3936|35975|1|1498|0.99|0.98 mrna_gene_grailexp=2|1|accgctggccccagggaaagccgagcggccaccgagccggcagagacccaccgagcggcggcggagggagcagcgccggggcgcacgagggcaccatggcccagacgcccgccttcgacaagcccaaagtagaactgcatgtccacctagacggatccatcaagcctgaaaccatcttatactatggcaggaggagagggatcgccctcccagctaacacagcagaggggctgctgaacgtcattggcatggacaagccgctcacccttccagacttcctggccaagtttgactactacatgcctgctatcgcgggctgccgggaggctatcaaaaggatcgcctatgagtttgtagagatgaaggccaaagagggcgtggtgtatgtggaggtgcggtacagtccgcacctgctggccaactccaaagtggagccaatcccctggaaccaggctgaaggggacctcaccccagacgaggtggtggccctagtgggccagggcctgcaggagggggagcgagacttcggggtcaaggcccggtccatcctgtgctgcatgcgccaccagcccagtgaxxactggtcccccaaggtggtggagctgtgtaagaagtaccagcagcagaccgtggtagccattgacctggctggagatgagaccatcccaggaagcagcctcttgcctggacatgtccaggcctaccaggtggxxgaggctgtgaagagcggcattcaccgtactgtccacgccggggaggtgggctcggccgaagtagtaaaagaggctgtggacatactcaagacagagcggctgggacacggctaccacaccctggaagaccaggccctttataacaggctgcggcaggaaaacatgcacttcgagatctgcccctggtccagctacctcactggtgcctggaagccggacacggagcatgcagtcattcggctcaaaaatgaccaggctaactactcgctcaacacagatgacccgctcatcttcaagtccaccctggacactgattaccagatgaccaaacgggacatgggctttactgaagaggagtttaaaaggctgaacatcaatgcggccaaatctagtttcctcccagaagatgaaaagagggagcttctcgacctgctctataaagcctatgggatgccaccttcagcctctgcagggcagaacctctgaagacgccactcctccaagccttcaccctgtggagtcaccccaactctgtggggctgagcaacatttttacatttattccttccaagaagaccatgatctcaatagtcagttactgatgctcctgaaccctatgtgtccatttctgcacacacgtatacctcggcatggccgcgtcacttctctgattatgtgccctggccagggaccagcgcccttgcacatgggcatggttgaatctgaaaccctccttctgtggcaacttgtactgaaaatctggtgctcaataaagaagcccatggctggtggcatgca protein_gene_grailexp=2|1|MAQTPAFDKPKVELHVHLDGSIKPETILYYGRRRGIALPANTAEGLLNVIGMDKPLTLPDFLAKFDYYMPAIAGCREAIKRIAYEFVEMKAKEGVVYVEVRYSPHLLANSKVEPIPWNQAEGDLTPDEVVALVGQGLQEGERDFGVKARSILCCMRHQPS--WSPKVVELCKKYQQQTVVAIDLAGDETIPGSSLLPGHVQAYQV-EAVKSGIHRTVHAGEVGSAEVVKEAVDILKTERLGHGYHTLEDQALYNRLRQENMHFEICPWSSYLTGAWKPDTEHAVIRLKNDQANYSLNTDDPLIFKSTLDTDYQMTKRDMGFTEEEFKRLNINAAKSSFLPEDEKRELLDLLYKAYGMPPSASAGQNL*
A technical description of the format:
gene_grailexp=id|var|beg|end|cbeg|cend|ebeg|eend|num exons|strand|(exon list)
an exon consists of: beg|end|frame|score
id = id
var = variant, usually 1, in the case of alternative splices, a gene can
have multiple variants
beg, end = coordinates of gene
cbeg, cend = coordinates of coding portion of gene (CDs)
ebeg, eend = coordinates of boundaries of experimental evidence for gene
num exons = number of exons in gene
strand = f or r
evidence_gene_grailexp=id|var|acc|db|org|len|beg|end|pctid|lenwt
id, var = identifier of gene this goes with
acc, db, org, len = characteristics of the db entry (see align_grailexp above)
beg, end = coordinates where this est begins and ends in the gene
pctid = percent identity over all the hits for this est
lenwt = a weight of how much of the gene this est covers... lenwt * pctid is a good score
for a piece of evidence
polya_gene_grailexp=id|var|beg|end|score
promoter_gene_grailexp=id|var|beg|end|score
id, var = identifier of gene this goes with
beg, end = coordinates of the polya or promoter
score = score from 0.0 to 1.0
mrna_gene_grailexp=id|var|mrna
protein_gene_grailexp=id|var|protein
id, var = identifier of gene this goes with
self-explanatory... mrna and protein for a gene
Genome Channel output format is always indexed from the TARGET STRAND'S PERSPECTIVE. This means all forward strand objects are indexed relative to the forward strand, and all reverse strand objects are indexed relative to the reverse strand.
However, once you have these genomic alignments and computationally predicted exons, what questions do you ask next? So far, the gene assembly module tries to answer the "basic questions": Where are the genes? What ESTs/cDNAs provide evidence for those genes? Where are the regulatory regions? What genes have splice variants and what ESTs provide evidence for this?
There are many more questions that can be asked, and future versions of the program hopefully will be able to answer them.
Current planned improvements include: