Wednesday, April 11, 2007

HOWTO: inferring SO compliant features for splice_donor_site and splice_acceptor_site given a gene model

Correctly inferring SO compliant features for splice_donor_site and splice_acceptor_site given a gene model can be tricky.

I hope the following simplified example is useful to understanding the issue.


EXAMPLE
============================

Given this simplified gene model containing two exon each being 3bp
long:

123456789
EEEIIIEEE
>>>--->>>

and given these SO definitions:

splice_donor_site: The junction between the 3 prime end of an exon and the following intron.
splice_acceptor_site: The junction between the 3 prime end of an intron and the following exon.

...we should encode the gene as:
exon(1,3,+)
splice_donor_site(3,3,+)
intron(4,6,+)
splice_acceptor_site(6,6,+)
exon(7,9,+)

HOWEVER, if the gene codes the other way, viz.

123456789
EEEIIIEEE
<<<---<<<

...we should encode it as:
exon(7,9,-)
splice_donor_site(6,6,-)
intron(4,6,-)
splice_acceptor_site(3,3,-)
exon(1,3,-)

Note that the coordinates of the exon and intron are the same in both encodings, only the strand is different; AND, the coordinates of the
splice sites are also the same between encodings, due to understanding GFF3: "For zero-length features, such as insertion sites, start equals end and the implied site is to the right of the indicated base in the direction of the landmark"

"to the right of the indicated base in the direction of the landmark." as "1 plus the indicated base, in interbase coordinates"

It is this understanding that I hope to have clarified by this example, demonstrating in particular that the splice sites should NOT be encoded in the second model as:

splice_donor_site(7,7,-)
splice_acceptor_site(4,4,+)