Ingolia Lab at UC Berkeley

Introduction

These auxilliary Bio.SeqLoc modules provide functions to read and write gene annotations in GTF and BED format. I separated them to minimize the number of dependencies for the main seqloc package.

GTF

A single GTF annotation can span multiple lines and the "[o]rder of rows is not important", so the entire file must be loaded before any transcripts can be assembled.

The following program reads a full human GTF annotation, takes the first 10 transcripts, writes them to a small test file, re-reads them, and verifies that the results of a cycle of transcriptToGtf followed by readGtfTranscripts does not change anything.

module Main
       where

import Control.Monad
import qualified Data.ByteString.Char8 as BS
import Data.List
import System.IO

import Bio.SeqLoc.GTF
import Bio.SeqLoc.LocRepr
import Bio.SeqLoc.OnSeq
import Bio.SeqLoc.Transcript

main :: IO ()
main = do trx <- readGtfTranscripts "/data/genomes/Homo_sapiens/hg19_knownGene.gtf"
          let trx10 = take 10 trx'
          BS.writeFile "test/gtf-out10.gtf" . BS.concat . map (transcriptToGtf "TestGtf") $ trx10
          trx10' <- readGtfTranscripts "test/gtf-out10.gtf"
          print $ (sort . map location $ trx10) == (sort . map location $ trx10')

BED

A single BED annotation occupies a single line, so it is possible to process BED format annotations iteratively. This interface uses Data.Iteratee iteration, as shown below.

The Iter.mapM_ function generates an Iteratee that maps a monadic action over each element of the input stream. Here, the input stream will be a list of transcripts, which will be written to the output file using BS.hPutStrLn hout . transcriptToBedStd. The bedTranscriptEnum encloses the BED format parser, allowing it to convert an iteratee for a stream of [Transcript] into an iteratee for a stream of BS.ByteString containing the BED format data. The Iter.fileDriver driver then applies the transformed bedIter to the data from a file.

module Main
       where

import Control.Monad
import qualified Data.ByteString.Char8 as BS
import Data.List
import System.IO

import qualified Data.Iteratee as Iter

import Bio.SeqLoc.Bed
import Bio.SeqLoc.LocRepr
import Bio.SeqLoc.OnSeq
import Bio.SeqLoc.Transcript

main :: IO ()
main = do withFile "test/bed-copy.bed" WriteMode $ \hout ->
            let bedIter = bedTranscriptEnum $ Iter.mapM_ (BS.hPutStrLn hout . transcriptToBedStd)
            in Iter.fileDriver bedIter "/data/genomes/Homo_sapiens/hg19_knownGene.bed"

A simpler interface in which the entire contents of an annotation file are read or written together is also provided, just as for GTF.