Introduction
These auxilliary Bio.SeqLoc
modules provide functions to read and
write gene annotations in GTF and BED format. I separated them to
minimize the number of dependencies for the main seqloc
package.
GTF
A single GTF annotation can span multiple lines and the "[o]rder of rows is not important", so the entire file must be loaded before any transcripts can be assembled.
The following program reads a full human GTF annotation, takes the
first 10 transcripts, writes them to a small test file, re-reads them,
and verifies that the results of a cycle of transcriptToGtf
followed
by readGtfTranscripts
does not change anything.
module Main
where
import Control.Monad
import qualified Data.ByteString.Char8 as BS
import Data.List
import System.IO
import Bio.SeqLoc.GTF
import Bio.SeqLoc.LocRepr
import Bio.SeqLoc.OnSeq
import Bio.SeqLoc.Transcript
main :: IO ()
main = do trx <- readGtfTranscripts "/data/genomes/Homo_sapiens/hg19_knownGene.gtf"
let trx10 = take 10 trx'
BS.writeFile "test/gtf-out10.gtf" . BS.concat . map (transcriptToGtf "TestGtf") $ trx10
trx10' <- readGtfTranscripts "test/gtf-out10.gtf"
print $ (sort . map location $ trx10) == (sort . map location $ trx10')
BED
A single BED annotation occupies a single line, so it is possible to
process BED format annotations iteratively. This interface uses
Data.Iteratee
iteration, as shown below.
The Iter.mapM_
function generates an Iteratee
that maps a monadic
action over each element of the input stream. Here, the input stream
will be a list of transcripts, which will be written to the output
file using BS.hPutStrLn hout . transcriptToBedStd
. The
bedTranscriptEnum
encloses the BED format parser, allowing it to
convert an iteratee for a stream of [Transcript]
into an iteratee
for a stream of BS.ByteString
containing the BED format data. The
Iter.fileDriver
driver then applies the transformed bedIter
to the
data from a file.
module Main
where
import Control.Monad
import qualified Data.ByteString.Char8 as BS
import Data.List
import System.IO
import qualified Data.Iteratee as Iter
import Bio.SeqLoc.Bed
import Bio.SeqLoc.LocRepr
import Bio.SeqLoc.OnSeq
import Bio.SeqLoc.Transcript
main :: IO ()
main = do withFile "test/bed-copy.bed" WriteMode $ \hout ->
let bedIter = bedTranscriptEnum $ Iter.mapM_ (BS.hPutStrLn hout . transcriptToBedStd)
in Iter.fileDriver bedIter "/data/genomes/Homo_sapiens/hg19_knownGene.bed"
A simpler interface in which the entire contents of an annotation file are read or written together is also provided, just as for GTF.