GrailEXP FAQ Section 2: The Exon Prediction Program (Perceval)


What is Perceval?

PERCEVAL stands for Protein-coding Exon, Repetitive, and CpG-Island EVALuator.

Perceval reads in a DNA sequence and produces a list of possible Grail Exon Candidates. It also filters these candidates against a repetitive element database. It also locates repetitive elements and CpG islands.

What are Grail Exon Candidates?

A Grail Exon Candidate is a region identified by the Grail neural network as being a potential exon. Candidates have a begin, end, strand, and frame. They begin with either a start codon or AG acceptor splice site and end with either a stop codon or donor splice site. The candidates are subdivided into clusters, with the highest scoring exon in each cluster clearly indicated. These best exons in each cluster are traditionally referred to as "Grail exons".

How are Grail Exon Candidates predicted?

All potential splice sites within the sequence are examined by neural networks (start, AAG acceptor, YAG acceptor, or GT donor) and assigned scores. All possible candidates within a sequence are then examined. For each candidate, coding scores are calculated. These scores, along with the splice site scores and GC content information, are fed to the final neural net, which produces a final score for that exon. This entire process is "thresholded", i.e. the splice sites must score sufficiently high to continue, then the coding, then finally the overall score. All candidates with a score above a preset threshold are maintained.

The raw list of exons is then organized into clusters. Each cluster is filtered for repetitives. Candidates flagged as repetitive elements are eliminated. Next a strand resolution process is applied, wherein overlapping exons on opposite strands are examined and the lower scoring cluster (containing what we call "shadow exons") is eliminated.

The final list of exon candidates is then output (with the eliminated shadow and repetitive exons clearly indicated).

How does the program handle repetitives?

The program substitutes all non-exonic regions of the sequence with "n"'s and blasts this sequence against a database of repetitive elements. If a repetitive element is determined to have a significant overlap (10% of the exon if overlapping an edge, 50% if embedded inside the exon) with that exon candidate, then that exon candidate is eliminated.

One could also mask a sequence for repetitives prior to submitting it to GrailEXP's exon prediction program, but this is not recommended and may lead to the loss of legitimate exons.

What is the pretty output format?

Here is a sample of the pretty output for a 30KB sequence:

--------------------------------------------------------------------------------
GrailEXP v3.2                                  http://compbio.ornl.gov/grailexp/

Authors:  Doug Hyatt, Manesh Shah, Victor Olman, Richard Mural, Ying Xu, and 
  Edward C. Uberbacher, 1996-2001

Reference:  "Automated Gene Identification in Large-Scale Genomic Sequences",
  Xu, Y. and Uberbacher, E.C., Journal of Computational Biology, Volume 4,
  Number 3, 1997

Sequence:  >GrailEXP Input Sequence (36741 bp)
--------------------------------------------------------------------------------
PERCEVAL Exon Candidates (15 predicted)

 Index Std   Begin       End     Frm    Type     Len    Scr    Quality

      1 -        200        386   2   Internal    187    79        Good
      2 -        595        693   0   Terminal     99    80        Good
      3 -       9207       9254   0   Internal     48    66    Marginal
      4 -       9910       9986   0    Initial     77    70        Good
      5 -      14287      14794   2   Terminal    508    49    Marginal
      6 +      19230      19291   0   Internal     62    99   Excellent
      7 +      26344      26466   2   Internal    123   100   Excellent
      8 +      28908      29051   2   Internal    144   100   Excellent
      9 +      29823      29938   2   Internal    116   100   Excellent
     10 +      31176      31303   1   Internal    128    98   Excellent
     11 +      32425      32496   0   Internal     72   100   Excellent
     12 +      32573      32674   0   Internal    102   100   Excellent
     13 +      32851      32915   0   Internal     65    60    Marginal
     14 +      34354      34483   2   Internal    130   100   Excellent
     15 +      35100      35202   0   Internal    103    94   Excellent
--------------------------------------------------------------------------------
Unlike the remaining outputs (GCA and raw), the pretty output only reports the highest-scoring exon in each cluster. All indexing is from the forward strand perspective.

What is the raw output format?

Here is a sample of the raw output for a 30KB sequence:

begin exons
 f 1 176 313 1 0 57 1 176 454 1
 f 2 412 549 3 0 48 1 391 549 1
 f 3 4031 4063 0 0 54 1 3626 4288 1
 f 4 6846 7091 3 0 43 2 6789 7091 0
 f 4 6852 7091 3 0 45 2 6789 7091 1
 f 5 19230 19291 1 0 99 0 19170 19295 1
 f 5 19234 19291 1 1 91 0 19170 19295 0
 f 6 26291 26466 1 0 86 0 26267 26470 0
 f 6 26291 26470 2 0 78 0 26267 26470 0
 f 6 26344 26466 1 2 100 0 26267 26470 1
 f 6 26344 26470 2 2 94 0 26267 26470 0
 f 6 26402 26466 0 0 84 0 26267 26470 0
 f 7 28837 29051 0 0 88 0 28750 29055 0
 f 7 28908 28976 1 2 88 0 28750 29055 0
 f 7 28908 28986 1 2 82 0 28750 29055 0
 f 7 28908 29022 1 2 91 0 28750 29055 0
 f 7 28908 29051 1 2 100 0 28750 29055 1
 f 7 28908 29055 2 2 79 0 28750 29055 0
 f 7 28954 29051 0 0 93 0 28750 29055 0
 f 7 28954 29055 3 0 13 0 28750 29055 0
 f 8 29795 29895 1 1 83 0 29662 29946 0
 f 8 29795 29906 1 1 78 0 29662 29946 0
 f 8 29795 29915 1 1 79 0 29662 29946 0
 f 8 29795 29938 1 1 91 0 29662 29946 0
 f 8 29795 29942 1 1 80 0 29662 29946 0
 f 8 29795 29946 2 1 79 0 29662 29946 0
 f 8 29823 29938 1 2 100 0 29662 29946 1
 f 8 29823 29942 1 2 89 0 29662 29946 0
 f 8 29823 29946 2 2 89 0 29662 29946 0
 f 9 31151 31303 1 0 89 0 31148 31345 0
 f 9 31151 31312 1 0 88 0 31148 31345 0
 f 9 31151 31341 1 0 84 0 31148 31345 0
 f 9 31176 31303 1 1 98 0 31148 31345 1
 f 9 31176 31312 1 1 97 0 31148 31345 0
 f 9 31176 31341 1 1 93 0 31148 31345 0
 f 9 31176 31345 2 1 89 0 31148 31345 0
 f 10 32359 32469 0 0 84 0 32335 32643 0
 f 10 32359 32496 0 0 91 0 32335 32643 0
 f 10 32394 32496 1 2 96 0 32335 32643 0
 f 10 32401 32469 0 0 87 0 32335 32643 0
 f 10 32401 32487 0 0 87 0 32335 32643 0
 f 10 32401 32496 0 0 93 0 32335 32643 0
 f 10 32425 32469 1 0 90 0 32335 32643 0
 f 10 32425 32487 1 0 89 0 32335 32643 0
 f 10 32425 32496 1 0 100 0 32335 32643 1
 f 11 32573 32674 1 0 100 0 32501 33106 1
 f 11 32573 32690 1 0 89 0 32501 33106 0
 f 11 32573 32713 1 0 96 0 32501 33106 0
 f 11 32595 32674 1 1 83 0 32501 33106 0
 f 12 32851 32915 1 0 60 0 32644 32919 1
 f 13 34289 34483 0 0 88 0 34283 34513 0
 f 13 34289 34487 0 0 77 0 34283 34513 0
 f 13 34289 34491 0 0 76 0 34283 34513 0
 f 13 34289 34513 3 0 79 0 34283 34513 0
 f 13 34354 34470 1 2 92 0 34283 34513 0
 f 13 34354 34483 1 2 100 0 34283 34513 1
 f 13 34354 34487 1 2 92 0 34283 34513 0
 f 13 34354 34491 1 2 92 0 34283 34513 0
 f 13 34354 34493 1 2 90 0 34283 34513 0
 f 13 34354 34513 2 2 95 0 34283 34513 0
 f 13 34370 34483 1 0 93 0 34283 34513 0
 f 13 34370 34487 1 0 85 0 34283 34513 0
 f 13 34392 34483 1 1 83 0 34283 34513 0
 f 14 35022 35202 0 0 72 0 35013 35273 0
 f 14 35100 35202 1 0 94 0 35013 35273 1
 f 14 35100 35206 1 0 87 0 35013 35273 0
 r 9 200 321 1 1 57 0 170 415 0
 r 9 200 386 1 2 79 0 170 415 1
 r 8 595 693 2 0 80 0 595 1047 1
 r 8 595 818 2 1 32 0 595 1047 0
 r 7 9207 9254 1 0 66 0 8943 9257 1
 r 6 9910 9977 0 0 59 0 9894 10025 0
 r 6 9910 9980 0 0 60 0 9894 10025 0
 r 6 9910 9983 0 0 68 0 9894 10025 0
 r 6 9910 9986 0 0 70 0 9894 10025 1
 r 5 14287 14794 2 2 49 0 14287 15027 1
 r 4 25704 25740 2 2 53 1 25704 25889 0
 r 4 25708 25740 1 2 61 1 25704 25889 1
 r 3 26096 26277 0 0 50 1 25453 26286 1
 r 2 29488 29570 1 0 55 1 29466 29618 1
 r 1 31214 31310 1 2 66 1 31043 31432 1
end exons

All indexing is from the forward strand's perspective. All coordinates are in ASCENDING order; strand is indicated by a separate field.

The fields are, in order:
Strand, Cluster ID, Begin, End, Type, Phase, Score, Status, ORF Begin, ORF End, Grail Exon Flag

The fields are separated by spaces. A leading space begins each data line. The exon candidate list is enclosed by "begin exons" and "end exons" tags.

What is the Genome Channel output format?

Here is a valid Genome Channel file representing the same information as in the above two sections:

exon_grailexp_v3=1|f|1|2|19230|19291|19170|19295|0.99|1
exon_grailexp_v3=2|f|1|2|19234|19291|19170|19295|0.91|0
exon_grailexp_v3=3|f|1|1|26291|26466|26267|26470|0.86|0
exon_grailexp_v3=4|f|2|1|26291|26470|26267|26470|0.78|0
exon_grailexp_v3=5|f|1|1|26344|26466|26267|26470|1|1
exon_grailexp_v3=6|f|2|1|26344|26470|26267|26470|0.94|0
exon_grailexp_v3=7|f|0|1|26402|26466|26267|26470|0.84|0
exon_grailexp_v3=8|f|0|0|28837|29051|28750|29055|0.88|0
exon_grailexp_v3=9|f|1|0|28908|28976|28750|29055|0.88|0
exon_grailexp_v3=10|f|1|0|28908|28986|28750|29055|0.82|0
exon_grailexp_v3=11|f|1|0|28908|29022|28750|29055|0.91|0
exon_grailexp_v3=12|f|1|0|28908|29051|28750|29055|1|1
exon_grailexp_v3=13|f|2|0|28908|29055|28750|29055|0.79|0
exon_grailexp_v3=14|f|0|0|28954|29051|28750|29055|0.93|0
exon_grailexp_v3=15|f|3|0|28954|29055|28750|29055|0.13|0
exon_grailexp_v3=16|f|1|0|29795|29895|29662|29946|0.83|0
exon_grailexp_v3=17|f|1|0|29795|29906|29662|29946|0.78|0
exon_grailexp_v3=18|f|1|0|29795|29915|29662|29946|0.79|0
exon_grailexp_v3=19|f|1|0|29795|29938|29662|29946|0.91|0
exon_grailexp_v3=20|f|1|0|29795|29942|29662|29946|0.8|0
exon_grailexp_v3=21|f|2|0|29795|29946|29662|29946|0.79|0
exon_grailexp_v3=22|f|1|0|29823|29938|29662|29946|1|1
exon_grailexp_v3=23|f|1|0|29823|29942|29662|29946|0.89|0
exon_grailexp_v3=24|f|2|0|29823|29946|29662|29946|0.89|0
exon_grailexp_v3=25|f|1|1|31151|31303|31148|31345|0.89|0
exon_grailexp_v3=26|f|1|1|31151|31312|31148|31345|0.88|0
exon_grailexp_v3=27|f|1|1|31151|31341|31148|31345|0.84|0
exon_grailexp_v3=28|f|1|1|31176|31303|31148|31345|0.98|1
exon_grailexp_v3=29|f|1|1|31176|31312|31148|31345|0.97|0
exon_grailexp_v3=30|f|1|1|31176|31341|31148|31345|0.93|0
exon_grailexp_v3=31|f|2|1|31176|31345|31148|31345|0.89|0
exon_grailexp_v3=32|f|0|0|32359|32469|32335|32643|0.84|0
exon_grailexp_v3=33|f|0|0|32359|32496|32335|32643|0.91|0
exon_grailexp_v3=34|f|1|0|32394|32496|32335|32643|0.96|0
exon_grailexp_v3=35|f|0|0|32401|32469|32335|32643|0.87|0
exon_grailexp_v3=36|f|0|0|32401|32487|32335|32643|0.87|0
exon_grailexp_v3=37|f|0|0|32401|32496|32335|32643|0.93|0
exon_grailexp_v3=38|f|1|0|32425|32469|32335|32643|0.9|0
exon_grailexp_v3=39|f|1|0|32425|32487|32335|32643|0.89|0
exon_grailexp_v3=40|f|1|0|32425|32496|32335|32643|1|1
exon_grailexp_v3=41|f|1|1|32573|32674|32501|33106|1|1
exon_grailexp_v3=42|f|1|1|32573|32690|32501|33106|0.89|0
exon_grailexp_v3=43|f|1|1|32573|32713|32501|33106|0.96|0
exon_grailexp_v3=44|f|1|1|32595|32674|32501|33106|0.83|0
exon_grailexp_v3=45|f|1|0|32851|32915|32644|32919|0.6|1
exon_grailexp_v3=46|f|0|1|34289|34483|34283|34513|0.88|0
exon_grailexp_v3=47|f|0|1|34289|34487|34283|34513|0.77|0
exon_grailexp_v3=48|f|0|1|34289|34491|34283|34513|0.76|0
exon_grailexp_v3=49|f|3|1|34289|34513|34283|34513|0.79|0
exon_grailexp_v3=50|f|1|1|34354|34470|34283|34513|0.92|0
exon_grailexp_v3=51|f|1|1|34354|34483|34283|34513|1|1
exon_grailexp_v3=52|f|1|1|34354|34487|34283|34513|0.92|0
exon_grailexp_v3=53|f|1|1|34354|34491|34283|34513|0.92|0
exon_grailexp_v3=54|f|1|1|34354|34493|34283|34513|0.9|0
exon_grailexp_v3=55|f|2|1|34354|34513|34283|34513|0.95|0
exon_grailexp_v3=56|f|1|1|34370|34483|34283|34513|0.93|0
exon_grailexp_v3=57|f|1|1|34370|34487|34283|34513|0.85|0
exon_grailexp_v3=58|f|1|1|34392|34483|34283|34513|0.83|0
exon_grailexp_v3=59|f|0|2|35022|35202|35013|35273|0.72|0
exon_grailexp_v3=60|f|1|2|35100|35202|35013|35273|0.94|1
exon_grailexp_v3=61|f|1|2|35100|35206|35013|35273|0.87|0
exon_grailexp_v3=62|r|1|2|36421|36542|36327|36572|0.57|0
exon_grailexp_v3=63|r|1|2|36356|36542|36327|36572|0.79|1
exon_grailexp_v3=64|r|2|0|36049|36147|35695|36147|0.8|1
exon_grailexp_v3=65|r|2|0|35924|36147|35695|36147|0.32|0
exon_grailexp_v3=66|r|1|1|27488|27535|27485|27799|0.66|1
exon_grailexp_v3=67|r|0|1|26765|26832|26717|26848|0.59|0
exon_grailexp_v3=68|r|0|1|26762|26832|26717|26848|0.6|0
exon_grailexp_v3=69|r|0|1|26759|26832|26717|26848|0.68|0
exon_grailexp_v3=70|r|0|1|26756|26832|26717|26848|0.7|1
exon_grailexp_v3=71|r|2|0|21948|22455|21715|22455|0.49|1

A technical description of the format:

exon_grailexp_v3=id|strand|type|frame|begin|end|orfbeg|orfend|score|best_flag

  id = id
  strand = f or r
  type = 0-3, where 0=Initial,1=Internal,2=Terminal,3=Single
  frame = 0-2
  begin, end = coordinates
  orfbeg, orfend = open reading frame coords
  score = score from 0.0 to 1.0
  best_flag = 0 or 1, 1 indicates the exon is a BEST EXON,
    i.e. the best in a cluster, 0 indicates the exon is
    not a best exon

Genome Channel output format is always indexed from the TARGET STRAND'S PERSPECTIVE. This means all forward strand objects are indexed relative to the forward strand, and all reverse strand objects are indexed relative to the reverse strand.

What is planned for future versions?

Future planned improvements to the system include:


The author and maintainer of this FAQ is Doug Hyatt (hyattpd@ornl.gov). This FAQ applies to GrailEXP version 3.2, released February, 2001. This FAQ was last updated February, 2001.