Thursday, June 09, 2011

Tabulating gaps in BAM files

If you're analyzing RNA-SEQ data and care about quantifying splicing, read on...
Here is an approach to tabulating the counts of gaps present in a bam file.
It is implemented in R using bioconductor packages
It takes about a minute to tabulate 33 million reads on my hardware.
It produces a data.frame for further downstream analysis of your choosing, and, optionally, a corresponding "junctions.bed" file (i.e. as is produced by Tophat and can be visualized in IGV).
Here's an example of using it and inspecting its output:
> btg <- bamTabulateGaps('t/t1.sam.sorted.bam')
> names(btg)
[1] "bedPath" "tabulatedGaps"
> btg$bedPath
[1] "t/t1.sam.sorted.junctions.bed"
> system(paste('head ', btg$bedPath))
track name=t1 graphType=junctions
2L 11343 11410 . 25
2L 11517 11779 . 51
2L 12220 12286 . 121
2L 12927 13520 . 109
2L 13624 13683 . 109
2L 14873 14933 . 114
2L 15710 17053 . 38
2L 17211 18026 . 4
2L 17211 18261 . 5
> head(btg$tabulatedGaps)
space start end score
1 2L 11344 11410 25
2 2L 11518 11779 51
3 2L 12221 12286 121
4 2L 12928 13520 109
5 2L 13625 13683 109
6 2L 14874 14933 114

Enjoy the code: