GrailEXP FAQ Section 4: The Gene Assembly Program


What is Gawain?

GAWAIN stands for Gene Assemblies With Alignment INformation. Gawain, the GrailEXP gene assembly program, builds gene models out of Grail Exon Candidates and/or GrailEXP genomic alignments. A gene model consists of a 5' untranslated region, coding portion, 3' untranslated region, exons, introns, a polyadenylation site, a promoter, and all reference sequence evidence consistent with the gene model. The program also outputs the mRNA and protein translation for each gene model.

The GrailEXP gene assembly module has been rewritten from the ground up for GrailEXP v3.0+. It includes many new capabilities that were lacking in older versions, such as recognition of alternative splicing and the clustering together of all ESTs/cDNAs that support a particular gene model.

Do I need Grail Exon Candidates AND alignments?

You can build gene models from either exon candidates or alignments, or use both.

How does the algorithm work?

The program begins by doing quite a bit of postprocessing on the genomic alignments. They are clustered, and each piece is assigned a frame using a recursive frame-scoring function that considers the Grail coding score for each potential frame, the frames of Grail exons that match the alignment pieces, and splicability among the various pieces.

The program then uses an elaborate dynamic programming algorithm to build gene models. Each node in the dynamic programming is a Grail exon candidate or an alignment-based exon. Connections between the various nodes are scored, with bonuses for "good" connections and penalties for "bad" connections.

Once the maximally scoring dynamic programming model has been calculated, the program takes the preliminary genes and does some further calculations on them. It can join or break genes at this point, as well as manipulate the exon edges to better match splice sites. Alternative splices are then located and added to the gene table.

Finally, the program looks for the coding begin and end within each mRNA. It evaluates possible starts within the gene and attempts to determine which is the correct one. This is a fallible process; without protein similarity, there is no way to get this 100% correct. In addition, false stop codons in the genomic sequence (due either to sequencing errors or to pseudogenes) can cause incorrect start and stop codons to be identified.

Polyas and promoters are located last. Polya recognition uses a simple Markov model. The promoter system uses a neural net and looks for specific types of signals (TATA, CAAT, GC-box). Not every gene will have a polya or a promoter.

How does the alternative splicing recognition work?

GrailEXP currently looks for a very specific case of alternative splicing. It looks for fully determined EST evidence that indicates the insertion or deletion of an exon. In other words, if one gene model contains exons A, B, and C, and another contains only exons A and C, then the program identifies this case as an alternative splice.

Why doesn't the program identify other possible cases of alternative splicing? The biggest problem is distinguishing between an UNSPLICED product that has slipped into the EST database, an error in the genomic alignment, and a genuine case of alternative splicing. I am skeptical of many reports of alternative splicing, because they do not take this into account. Just because one rogue EST disagrees with the pack does not necessarily mean it is an alternative splice. Even if you have 3 or 4 that say one thing, how can you be certain it isn't just a mistake in the alignment or in the ESTs?

For that reason, GrailEXP currently recognizes only the most obvious cases of alternative splicing. Outside of repeating gene regions (which can be incorrectly reported as regions with a lot of alternative splices), GrailEXP identifies inserted/omitted exons with absolute certainty.

In future versions, more cases of alternative splicing will be recognized. There are issues to resolve first with the algorithm, however, to determine when what one is looking at is really an alternative splice and not just garbage.

How does the program locate 5' and 3' untranslated regions?

The primary indicator of 5' and 3' untranslated regions is an additional stop codon in the mRNA. In such cases, the program always attempts to find the 5' and 3' untranslated region boundaries. GrailEXP examines the mRNA and finds the highest scoring stop-codon-less run (the score being based on the length of the run and the Grail coding score). It then examines that "run" to find the highest scoring start codon (based on proximity to the beginning of the run, which exon it falls in, Grail start site score, and coding/noncoding scores around the boundary). GrailEXP correctly identifies the start site in a complete mRNA about 95% of the time. The only way to make this number close to 100% is to add protein similarity searching to the system, which is planned for future versions.

Can the program recognize pseudogenes?

Sort of. While it doesn't flag pseudogenes as such, it does provide certain indications that what one is looking at could be a pseudogene. If you see "-" in the protein translation, this indicates a frame shift between exons. A high number of these (or even 1, in many cases) may indicate a pseudogene. In addition, if GrailEXP reports a long mRNA but a tiny protein translation, this may indicate a pseudogene containing several stop codons; it may also indicate an error in the genomic sequence, however. While there are "clues" to search for that indicate possible pseudogenes, there is still the problem of distinguishing between actual pseudogenes and mistakes in the alignment program (such as a missed short exon which might cause a frame shift or a sequencing-error-caused stop codon which throws off the 5'/3' UTR finder).

How do the polya and promoter algorithms work?

The polyadenylation site recognizer looks for the sequence AATAAA. It then examines a 72-base region around the sequence using a simple Markov model and reports back a score for that site. If a polya site is within 5,000 bases of the stop codon of a gene and does not fall in an intron, then it is retained. At most one polya (the highest scoring, in the case of multiples) is assigned to each gene model.

The promoter recognition system looks for a TATA or ATA and examines the region around these bases. In particular, the neural net is fed information on GC content, information on CAAT position and the surrounding area, GGGCGG position and the surrounding area, and ATG sequences and the surrounding areas, The neural net evaluates the scores and assigns a total score to the promoter. Again, the promoter must be within 5,000 bases of a start codon of a predicted GrailEXP gene model to be retained and cannot fall within an intron. At most one promoter element is assigned to each gene model.

What is the difference between open and closed ends?

GrailEXP can run with either open or closed end gene modeling. Open ended gene modeling allows PARTIAL genes to be predicted at the ends, i.e. it assigns no penalties for genes running off the edges of the sequence. This is the ideal way to run the program for contigs which you know do not begin or end a region.

Running with closed ends, on the other hand, will penalize partial genes at the ends of the sequence and will try to "close off" all gene models. This is ideal if you know a clone only contains one gene you want to examine, and you don't want GrailEXP to predict partial genes on the ends of the sequence.

GrailEXP runs with closed ends by default. This is due to the fact that there is almost always an initial or terminal Grail Exon Candidate that can be predicted near the beginning or end of a sequence. If run in open ended mode, you will see a lot of these single-exon partial genes predicted near the edges of your sequence.

Can I do single strand gene modeling and when should I do this?

By default GrailEXP runs with a double-stranded gene modeling algorithm. However, this can cause problems with recognition of embedded genes, which may be omitted under this algorithm (either that or they will be reported and the gene they are embedded in will be broken).

You can run GrailEXP on just the forward strand or just the reverse strand. In fact, the most accurate way to perform gene modeling with GrailEXP may very well be to predict exons/alignments double-stranded, then run the gene assembly program twice, once on the forward strand, once on the reverse strand. Further tests on this subject are needed.

What about ESTs/cDNAs that aren't listed as gene evidence?

If an EST/cDNA is not listed as gene evidence, but it is in the same location as the gene model, then there is something about that EST/cDNA that made it inconsistent with the gene model. Either it didn't align to the same splice sites or it trickled over the edges into a predicted intron. This happens in instances where an EST's first base falls very near a splice site edge; often there are not enough bases to align with the next exon upstream, so the program winds up extending the alignment into an intron. This problem can be solved by an examination of the alignments in the gene modeling phase and subsequent correction of such cases.

EST/cDNAs are flagged that are inconsistent with gene models but are not currently returned to the user. It would be relatively easy to cluster these EST/cDNAs together using GrailEXP's consistency-check function, but this is not currently being done. Suggestions are welcome on what would be useful to users in future versions. The program is primarily a gene assembly program, not an EST/cDNA clustering program, so its emphasis currently is on clustering together the EST/cDNAs that support a particular gene model and ignoring the rest.

What is the pretty output format?

Here is an example of the pretty output format for Gawain:

--------------------------------------------------------------------------------
GrailEXP v3.2                                  http://compbio.ornl.gov/grailexp/

Authors:  Doug Hyatt, Manesh Shah, Victor Olman, Richard Mural, Ying Xu, and 
  Edward C. Uberbacher, 1996-2001

Reference:  "Automated Gene Identification in Large-Scale Genomic Sequences",
  Xu, Y. and Uberbacher, E.C., Journal of Computational Biology, Volume 4,
  Number 3, 1997

Sequence:  >GrailEXP Input Sequence (36741 bp)
--------------------------------------------------------------------------------
GAWAIN Gene Predictions (2 predicted, 2 with database similarity)



Genes with Database Similarity (2 predicted, 0 with alternative splices)


Gene 1, Variant 1        Strand: -  Bounds: 320-702  Exons: 1
                         Start Codon:  Yes     Stop Codon:  Yes

Top-Scoring Reference:  AA633758.1 (383 bp) (99% id, 320-702)
     >human|AA633758.1|dbest|AA633758 ac27b12.s1 Stratagene ovary (#937217)
      Homo sapiens cDNA clone IMAGE:857663 3'
Reference Path:  AA633758.1 (383 bp) (99%, 320-702)

 ---Index---- --------Exons-------- ---------CDS--------- -Ph- -Fr- -Len- -Scr-
 1.1.1               320        702        595        687   0    0    383   99

Gene 2, Variant 1        Strand: +  Bounds: 3936-35976  Exons: 12
                         Start Codon:  Yes     Stop Codon:  Yes

Top-Scoring Reference:  HT1292 (1498 bp) (99% id, 3936-35975)
     >human|HT1292|tigr|adenosine deaminase, alt. transcript 1, 3'
Reference Path:  HT1292 (1498 bp) (99%, 3936-35975)
                 AV735384.1 (567 bp) (97%, 34354-35976)

 ---Index---- --------Exons-------- ---------CDS--------- -Ph- -Fr- -Len- -Scr-
 Promoter           1978       2055        ...        ...   .    .     78   69
 2.1.1              3936       4063       4031       4063   1    1    128  100
 2.1.2             19230      19291      19230      19291   0    2     62  100
 2.1.3             26344      26466      26344      26466   2    1    123  100
 2.1.4             28908      29051      28908      29051   2    0    144  100
 2.1.5             29823      29938      29823      29938   2    0    116  100
 2.1.6             31176      31303      31176      31303   1    1    128  100
 2.1.7             32425      32496      32425      32496   0    0     72  100
 2.1.8             32573      32674      32573      32674   0    1    102  100
 2.1.9             32851      32915      32851      32915   0    0     65   98
 2.1.10            34354      34483      34354      34483   2    1    130  100
 2.1.11            35100      35202      35100      35202   0    2    103  100
 2.1.12            35651      35976      35651      35664   1    0    326  100
 PolyA             35947      35952        ...        ...   .    .      6   89


>GrailEXP Gene 1, Var 1 mRNA|Similar to AA633758.1
cctccccagcctctcatgacattcaagtccctcttcaacatcagtcccaccacttcccaccacattgctg
aaatgccaatcaacctagatgagttgctggttgcttaagggtgacacaaattatttgcaattcttcacat
tgcaagatggagtctaggtttcctcctcttggacctgggctggcctgagtgacttgttggagtaatagag
cgtgatgggagcaagcttcctggatttccgagccttggtcataaagaagctttgtggtgtcctttggggc
ctcttgaacattctctctgggaacccaagcttctagaagagaccatcatgctaagagagtccacatatgg
ccattccagttgacagtcccactgagcccaggc

>GrailEXP Gene 1, Var 1 protein|Derived from similarity to AA633758.1
MTFKSLFNISPTTSHHIAEMPINLDELLVA*

>GrailEXP Gene 2, Var 1 mRNA|Similar to HT1292
accgctggccccagggaaagccgagcggccaccgagccggcagagacccaccgagcggcggcggagggag
cagcgccggggcgcacgagggcaccatggcccagacgcccgccttcgacaagcccaaagtagaactgcat
gtccacctagacggatccatcaagcctgaaaccatcttatactatggcaggaggagagggatcgccctcc
cagctaacacagcagaggggctgctgaacgtcattggcatggacaagccgctcacccttccagacttcct
ggccaagtttgactactacatgcctgctatcgcgggctgccgggaggctatcaaaaggatcgcctatgag
tttgtagagatgaaggccaaagagggcgtggtgtatgtggaggtgcggtacagtccgcacctgctggcca
actccaaagtggagccaatcccctggaaccaggctgaaggggacctcaccccagacgaggtggtggccct
agtgggccagggcctgcaggagggggagcgagacttcggggtcaaggcccggtccatcctgtgctgcatg
cgccaccagcccaactggtcccccaaggtggtggagctgtgtaagaagtaccagcagcagaccgtggtag
ccattgacctggctggagatgagaccatcccaggaagcagcctcttgcctggacatgtccaggcctacca
ggaggctgtgaagagcggcattcaccgtactgtccacgccggggaggtgggctcggccgaagtagtaaaa
gaggctgtggacatactcaagacagagcggctgggacacggctaccacaccctggaagaccaggcccttt
ataacaggctgcggcaggaaaacatgcacttcgagatctgcccctggtccagctacctcactggtgcctg
gaagccggacacggagcatgcagtcattcggctcaaaaatgaccaggctaactactcgctcaacacagat
gacccgctcatcttcaagtccaccctggacactgattaccagatgaccaaacgggacatgggctttactg
aagaggagtttaaaaggctgaacatcaatgcggccaaatctagtttcctcccagaagatgaaaagaggga
gcttctcgacctgctctataaagcctatgggatgccaccttcagcctctgcagggcagaacctctgaaga
cgccactcctccaagccttcaccctgtggagtcaccccaactctgtggggctgagcaacatttttacatt
tattccttccaagaagaccatgatctcaatagtcagttactgatgctcctgaaccctatgtgtccatttc
tgcacacacgtatacctcggcatggccgcgtcacttctctgattatgtgccctggccagggaccagcgcc
cttgcacatgggcatggttgaatctgaaaccctccttctgtggcaacttgtactgaaaatctggtgctca
ataaagaagcccatggctggtggcatgca

>GrailEXP Gene 2, Var 1 protein|Derived from similarity to HT1292
MAQTPAFDKPKVELHVHLDGSIKPETILYYGRRRGIALPANTAEGLLNVIGMDKPLTLPDFLAKFDYYMP
AIAGCREAIKRIAYEFVEMKAKEGVVYVEVRYSPHLLANSKVEPIPWNQAEGDLTPDEVVALVGQGLQEG
ERDFGVKARSILCCMRHQPNWSPKVVELCKKYQQQTVVAIDLAGDETIPGSSLLPGHVQAYQEAVKSGIH
RTVHAGEVGSAEVVKEAVDILKTERLGHGYHTLEDQALYNRLRQENMHFEICPWSSYLTGAWKPDTEHAV
IRLKNDQANYSLNTDDPLIFKSTLDTDYQMTKRDMGFTEEEFKRLNINAAKSSFLPEDEKRELLDLLYKA
YGMPPSASAGQNL*


Genes with No Database Evidence (0 predicted)


--------------------------------------------------------------------------------
The pretty output format provides information about each gene model, including the FASTA header for the highest scoring reference, a reference path, a complete list of exons and CDs regions, and the mRNA and predicted protein for that gene. Unlike the other output formats (raw and GCA), the pretty output format does not show all the ESTs/cDNAs associated with each gene model; it merely shows the minimum amount of evidence that spans the gene.

What is the raw output format?

An example of the raw output of the GrailEXP gene assembly program follows (with most of the evidence omitted to save space):

begin genes
 1 1 summary r 1 320 702 595 687 320 702
 1 1 exon 320 702 0 0 99
 1 1 evidence dbest human AA633758.1 383 320 702 383 1 99 99
 1 1 mrna cctccccagcctctcatgacattcaagtccctcttcaacatcagtcccaccacttcccaccacattgctgaaatgccaatcaacctagatgagttgctggttgcttaagggtgacacaaattatttgcaattcttcacattgcaagatggagtctaggtttcctcctcttggacctgggctggcctgagtgacttgttggagtaatagagcgtgatgggagcaagcttcctggatttccgagccttggtcataaagaagctttgtggtgtcctttggggcctcttgaacattctctctgggaacccaagcttctagaagagaccatcatgctaagagagtccacatatggccattccagttgacagtcccactgagcccaggc
 1 1 translation MTFKSLFNISPTTSHHIAEMPINLDELLVA*
 2 1 summary f 12 3936 35976 4031 35664 3936 35976
 2 1 promoter 1978 2055 69
 2 1 exon 3936 4063 1 1 100
 2 1 exon 19230 19291 0 2 100
 2 1 exon 26344 26466 2 1 100
 2 1 exon 28908 29051 2 0 100
 2 1 exon 29823 29942 2 0 100
 2 1 exon 31176 31307 1 1 95
 2 1 exon 32425 32496 0 0 100
 2 1 exon 32573 32674 0 1 100
 2 1 exon 32851 32915 0 0 98
 2 1 exon 34354 34483 2 1 100
 2 1 exon 35100 35202 0 2 100
 2 1 exon 35651 35976 1 0 100
 2 1 polya 35947 35952 89
 2 1 evidence genbank human K02567.1 1478 3955 35975 1 1478 99 97
 2 1 evidence tigr human HT1292 1498 3936 35975 1 1498 99 98
 2 1 mrna accgctggccccagggaaagccgagcggccaccgagccggcagagacccaccgagcggcggcggagggagcagcgccggggcgcacgagggcaccatggcccagacgcccgccttcgacaagcccaaagtagaactgcatgtccacctagacggatccatcaagcctgaaaccatcttatactatggcaggaggagagggatcgccctcccagctaacacagcagaggggctgctgaacgtcattggcatggacaagccgctcacccttccagacttcctggccaagtttgactactacatgcctgctatcgcgggctgccgggaggctatcaaaaggatcgcctatgagtttgtagagatgaaggccaaagagggcgtggtgtatgtggaggtgcggtacagtccgcacctgctggccaactccaaagtggagccaatcccctggaaccaggctgaaggggacctcaccccagacgaggtggtggccctagtgggccagggcctgcaggagggggagcgagacttcggggtcaaggcccggtccatcctgtgctgcatgcgccaccagcccagtgaxxactggtcccccaaggtggtggagctgtgtaagaagtaccagcagcagaccgtggtagccattgacctggctggagatgagaccatcccaggaagcagcctcttgcctggacatgtccaggcctaccaggtggxxgaggctgtgaagagcggcattcaccgtactgtccacgccggggaggtgggctcggccgaagtagtaaaagaggctgtggacatactcaagacagagcggctgggacacggctaccacaccctggaagaccaggccctttataacaggctgcggcaggaaaacatgcacttcgagatctgcccctggtccagctacctcactggtgcctggaagccggacacggagcatgcagtcattcggctcaaaaatgaccaggctaactactcgctcaacacagatgacccgctcatcttcaagtccaccctggacactgattaccagatgaccaaacgggacatgggctttactgaagaggagtttaaaaggctgaacatcaatgcggccaaatctagtttcctcccagaagatgaaaagagggagcttctcgacctgctctataaagcctatgggatgccaccttcagcctctgcagggcagaacctctgaagacgccactcctccaagccttcaccctgtggagtcaccccaactctgtggggctgagcaacatttttacatttattccttccaagaagaccatgatctcaatagtcagttactgatgctcctgaaccctatgtgtccatttctgcacacacgtatacctcggcatggccgcgtcacttctctgattatgtgccctggccagggaccagcgcccttgcacatgggcatggttgaatctgaaaccctccttctgtggcaacttgtactgaaaatctggtgctcaataaagaagcccatggctggtggcatgca
 2 1 translation MAQTPAFDKPKVELHVHLDGSIKPETILYYGRRRGIALPANTAEGLLNVIGMDKPLTLPDFLAKFDYYMPAIAGCREAIKRIAYEFVEMKAKEGVVYVEVRYSPHLLANSKVEPIPWNQAEGDLTPDEVVALVGQGLQEGERDFGVKARSILCCMRHQPS--WSPKVVELCKKYQQQTVVAIDLAGDETIPGSSLLPGHVQAYQV-EAVKSGIHRTVHAGEVGSAEVVKEAVDILKTERLGHGYHTLEDQALYNRLRQENMHFEICPWSSYLTGAWKPDTEHAVIRLKNDQANYSLNTDDPLIFKSTLDTDYQMTKRDMGFTEEEFKRLNINAAKSSFLPEDEKRELLDLLYKAYGMPPSASAGQNL* 
end genes

All coordinates in this format are ASCENDING, with strand indicated clearly on the summary line. For coordinates labeled "begin" and "end", this can be somewhat deceptive! The gene labeled "r 320 702", for example, begins at base 702 and ends at base 320, although its "begin" field is 320. Just be aware that everything is ordered in ascending order.

Each gene record begins with a summary line, whose fields are, in order:

Gene ID, Gene Variant, Summary Tag, Strand, Number of Exons, Gene Begin, Gene End, CDs Begin, CDs End, Evidence Begin, Evidence End

Each exon is indicated by an exon line, whose fields are, in order:

Gene ID, Gene Variant, Exon Tag, Exon Begin, Exon End, Phase, Reading Frame, Score

Each promoter or polya is indicated by a promoter or polya line, whose fields are, in order:

Gene ID, Gene Variant, Polya/Promoter Tag, Begin, End, Score

Each EST/cDNA/mRNA that is consistent with a particular gene model is presented as an evidence line, the fields of which are, in order:

Gene ID, Gene Variant, Evidence Tag, Database Tag, Organism Tag, Accession Number, Length, Query Begin, Query End, Ref Begin, Ref End, Percent ID, Percent Length

mRNAs and protein translatiosn are presented on mrna and translation lines, the fields of which are:

Gene ID, Gene Variant, mRNA/Protein Tag, mRNA/Protein.

The fields are separated by spaces. A leading space begins each data line. The gene model list is enclosed by "begin genes" and "end genes" tags.

What is the Genome Channel output format?

Here is a valid Genome Channel file representing the same information as in the above two sections (again with much of the evidence for gene 2 omitted):


gene_grailexp=1|1|36040|36422|36055|36147|36040|36422|1|r|36040|36422|0|0.99
evidence_gene_grailexp=1|1|AA633758.1|dbest|human|383|36040|36422|1|383|0.99|0.99
mrna_gene_grailexp=1|1|cctccccagcctctcatgacattcaagtccctcttcaacatcagtcccaccacttcccaccacattgctgaaatgccaatcaacctagatgagttgctggttgcttaagggtgacacaaattatttgcaattcttcacattgcaagatggagtctaggtttcctcctcttggacctgggctggcctgagtgacttgttggagtaatagagcgtgatgggagcaagcttcctggatttccgagccttggtcataaagaagctttgtggtgtcctttggggcctcttgaacattctctctgggaacccaagcttctagaagagaccatcatgctaagagagtccacatatggccattccagttgacagtcccactgagcccaggc
protein_gene_grailexp=1|1|MTFKSLFNISPTTSHHIAEMPINLDELLVA*
promoter_gene_grailexp=2|1|1978|2055|0.69
gene_grailexp=2|1|3936|35976|4031|35664|3936|35976|12|f|3936|4063|1|1|19230|19291|2|1|26344|26466|1|1|28908|29051|0|1|29823|29942|0|1|31176|31307|1|0.95|32425|32496|0|1|32573|32674|1|1|32851|32915|0|0.98|34354|34483|1|1|35100|35202|2|1|35651|35976|0|1
polya_gene_grailexp=2|1|35947|35952|0.89
evidence_gene_grailexp=2|1|K02567.1|genbank|human|1478|3955|35975|1|1478|0.99|0.97
evidence_gene_grailexp=2|1|HT1292|tigr|human|1498|3936|35975|1|1498|0.99|0.98
mrna_gene_grailexp=2|1|accgctggccccagggaaagccgagcggccaccgagccggcagagacccaccgagcggcggcggagggagcagcgccggggcgcacgagggcaccatggcccagacgcccgccttcgacaagcccaaagtagaactgcatgtccacctagacggatccatcaagcctgaaaccatcttatactatggcaggaggagagggatcgccctcccagctaacacagcagaggggctgctgaacgtcattggcatggacaagccgctcacccttccagacttcctggccaagtttgactactacatgcctgctatcgcgggctgccgggaggctatcaaaaggatcgcctatgagtttgtagagatgaaggccaaagagggcgtggtgtatgtggaggtgcggtacagtccgcacctgctggccaactccaaagtggagccaatcccctggaaccaggctgaaggggacctcaccccagacgaggtggtggccctagtgggccagggcctgcaggagggggagcgagacttcggggtcaaggcccggtccatcctgtgctgcatgcgccaccagcccagtgaxxactggtcccccaaggtggtggagctgtgtaagaagtaccagcagcagaccgtggtagccattgacctggctggagatgagaccatcccaggaagcagcctcttgcctggacatgtccaggcctaccaggtggxxgaggctgtgaagagcggcattcaccgtactgtccacgccggggaggtgggctcggccgaagtagtaaaagaggctgtggacatactcaagacagagcggctgggacacggctaccacaccctggaagaccaggccctttataacaggctgcggcaggaaaacatgcacttcgagatctgcccctggtccagctacctcactggtgcctggaagccggacacggagcatgcagtcattcggctcaaaaatgaccaggctaactactcgctcaacacagatgacccgctcatcttcaagtccaccctggacactgattaccagatgaccaaacgggacatgggctttactgaagaggagtttaaaaggctgaacatcaatgcggccaaatctagtttcctcccagaagatgaaaagagggagcttctcgacctgctctataaagcctatgggatgccaccttcagcctctgcagggcagaacctctgaagacgccactcctccaagccttcaccctgtggagtcaccccaactctgtggggctgagcaacatttttacatttattccttccaagaagaccatgatctcaatagtcagttactgatgctcctgaaccctatgtgtccatttctgcacacacgtatacctcggcatggccgcgtcacttctctgattatgtgccctggccagggaccagcgcccttgcacatgggcatggttgaatctgaaaccctccttctgtggcaacttgtactgaaaatctggtgctcaataaagaagcccatggctggtggcatgca
protein_gene_grailexp=2|1|MAQTPAFDKPKVELHVHLDGSIKPETILYYGRRRGIALPANTAEGLLNVIGMDKPLTLPDFLAKFDYYMPAIAGCREAIKRIAYEFVEMKAKEGVVYVEVRYSPHLLANSKVEPIPWNQAEGDLTPDEVVALVGQGLQEGERDFGVKARSILCCMRHQPS--WSPKVVELCKKYQQQTVVAIDLAGDETIPGSSLLPGHVQAYQV-EAVKSGIHRTVHAGEVGSAEVVKEAVDILKTERLGHGYHTLEDQALYNRLRQENMHFEICPWSSYLTGAWKPDTEHAVIRLKNDQANYSLNTDDPLIFKSTLDTDYQMTKRDMGFTEEEFKRLNINAAKSSFLPEDEKRELLDLLYKAYGMPPSASAGQNL*

A technical description of the format:


gene_grailexp=id|var|beg|end|cbeg|cend|ebeg|eend|num exons|strand|(exon list)

  an exon consists of:    beg|end|frame|score

  id = id
  var = variant, usually 1, in the case of alternative splices, a gene can 
    have multiple variants
  beg, end = coordinates of gene
  cbeg, cend = coordinates of coding portion of gene (CDs)
  ebeg, eend = coordinates of boundaries of experimental evidence for gene
  num exons = number of exons in gene
  strand = f or r

evidence_gene_grailexp=id|var|acc|db|org|len|beg|end|pctid|lenwt

  id, var = identifier of gene this goes with
  acc, db, org, len = characteristics of the db entry (see align_grailexp above)
  beg, end = coordinates where this est begins and ends in the gene
  pctid = percent identity over all the hits for this est
  lenwt = a weight of how much of the gene this est covers... lenwt * pctid is a good score
    for a piece of evidence

polya_gene_grailexp=id|var|beg|end|score
promoter_gene_grailexp=id|var|beg|end|score

  id, var = identifier of gene this goes with
  beg, end = coordinates of the polya or promoter 
  score = score from 0.0 to 1.0

mrna_gene_grailexp=id|var|mrna
protein_gene_grailexp=id|var|protein

  id, var = identifier of gene this goes with
  self-explanatory... mrna and protein for a gene

Genome Channel output format is always indexed from the TARGET STRAND'S PERSPECTIVE. This means all forward strand objects are indexed relative to the forward strand, and all reverse strand objects are indexed relative to the reverse strand.

What is planned for future versions?

The gene assembly module is very much a "work in progress". The information sought from the previous two programs (exon prediction and genomic alignment) is fairly simple and easy to understand.

However, once you have these genomic alignments and computationally predicted exons, what questions do you ask next? So far, the gene assembly module tries to answer the "basic questions": Where are the genes? What ESTs/cDNAs provide evidence for those genes? Where are the regulatory regions? What genes have splice variants and what ESTs provide evidence for this?

There are many more questions that can be asked, and future versions of the program hopefully will be able to answer them.

Current planned improvements include:


The author and maintainer of this FAQ is Doug Hyatt (hyattpd@ornl.gov). This FAQ applies to GrailEXP version 3.2, released February, 2001. This FAQ was last updated February, 2001.