Introduction
These auxilliary Bio.SeqLoc modules provide functions to read and
write gene annotations in GTF and BED format. I separated them to
minimize the number of dependencies for the main seqloc package.
GTF
A single GTF annotation can span multiple lines and the "[o]rder of rows is not important", so the entire file must be loaded before any transcripts can be assembled.
The following program reads a full human GTF annotation, takes the
first 10 transcripts, writes them to a small test file, re-reads them,
and verifies that the results of a cycle of transcriptToGtf followed
by readGtfTranscripts does not change anything.
module Main
where
import Control.Monad
import qualified Data.ByteString.Char8 as BS
import Data.List
import System.IO
import Bio.SeqLoc.GTF
import Bio.SeqLoc.LocRepr
import Bio.SeqLoc.OnSeq
import Bio.SeqLoc.Transcript
main :: IO ()
main = do trx <- readGtfTranscripts "/data/genomes/Homo_sapiens/hg19_knownGene.gtf"
let trx10 = take 10 trx'
BS.writeFile "test/gtf-out10.gtf" . BS.concat . map (transcriptToGtf "TestGtf") $ trx10
trx10' <- readGtfTranscripts "test/gtf-out10.gtf"
print $ (sort . map location $ trx10) == (sort . map location $ trx10')
BED
A single BED annotation occupies a single line, so it is possible to
process BED format annotations iteratively. This interface uses
Data.Iteratee iteration, as shown below.
The Iter.mapM_ function generates an Iteratee that maps a monadic
action over each element of the input stream. Here, the input stream
will be a list of transcripts, which will be written to the output
file using BS.hPutStrLn hout . transcriptToBedStd. The
bedTranscriptEnum encloses the BED format parser, allowing it to
convert an iteratee for a stream of [Transcript] into an iteratee
for a stream of BS.ByteString containing the BED format data. The
Iter.fileDriver driver then applies the transformed bedIter to the
data from a file.
module Main
where
import Control.Monad
import qualified Data.ByteString.Char8 as BS
import Data.List
import System.IO
import qualified Data.Iteratee as Iter
import Bio.SeqLoc.Bed
import Bio.SeqLoc.LocRepr
import Bio.SeqLoc.OnSeq
import Bio.SeqLoc.Transcript
main :: IO ()
main = do withFile "test/bed-copy.bed" WriteMode $ \hout ->
let bedIter = bedTranscriptEnum $ Iter.mapM_ (BS.hPutStrLn hout . transcriptToBedStd)
in Iter.fileDriver bedIter "/data/genomes/Homo_sapiens/hg19_knownGene.bed"
A simpler interface in which the entire contents of an annotation file are read or written together is also provided, just as for GTF.