Ingolia Lab at UC Berkeley

Introduction

The samtools package provides a wrapper around the C samtools library. The library maintains data in the C native data structures but provides a more idiomatic Haskell interface.

I begin this example by importing modules that I will use.

{-# LANGUAGE OverloadedStrings #-}
module Main
       where 

import Control.Exception
import Control.Monad
import qualified Data.ByteString.Char8 as BS
import Data.Maybe
import System.Environment
import System.Exit
import System.FilePath
import System.IO
import System.Process

import Bio.SeqLoc.LocRepr

import qualified Bio.SamTools.Bam as Bam
import qualified Bio.SamTools.BamIndex as BamIndex
import qualified Bio.SamTools.Cigar as Cigar

I also provide a small function to loop over a pair of monadic actions. The first is an input, which returns Maybe a, and the second consumes the a. The input and output actions are repeated until the input action returns Nothing. This will be used to loop over all alignments from a file and process each with a monadic action.

loop :: (IO (Maybe a)) -> (a -> IO ()) -> IO ()
loop mi mo = go
  where go = mi >>= maybe (return ()) (\i -> mo i >> go)

BAM File Headers

The samtools library keeps track of target sequences in a header structure, and internally the target is recorded as an index into the header sequence list. Here I open an alignment file, which automatically loads the header, and write just the header data to a second file. This header data includes the name and length of each alignment target.

headerToIndex :: FilePath -> IO FilePath
headerToIndex inname = let outname = dropExtension inname ++ "-index.txt"
                       in bracket (Bam.openBamInFile inname) Bam.closeInHandle $ \hin ->
                       withFile outname WriteMode $ \hout -> 
                       let hseqs = Bam.targetSeqList $ Bam.inHeader hin
                       in do forM_ hseqs $ \hseq -> hPutStrLn hout . concat $ 
                                                    [ "@SQ\tSN:", BS.unpack . Bam.name $ hseq, "\tLN:", show . Bam.len $ hseq ]
                             return outname

Reading and Writing Alignments

I start with simple functions to convert between binary and text format files by reading the alignments using one interface and writing them using another. There are separate InHandle and OutHandle data types for reading and writing files. This example opens an InHandle with openBamInFile to read binary format alignments. It then opens an OutHandle with openTamOutFile to write text format alignments. The output file requires the header from the input file, which is extracted from the InHandle with inHeader. The loop utility above is then used to first read an alignment with get1 and then write it with put1.

bamToSam :: FilePath -> IO FilePath
bamToSam inname = let outname = dropExtension inname ++ "-test.sam"
                  in bracket (Bam.openBamInFile inname) Bam.closeInHandle $ \hin ->
                  bracket (Bam.openTamOutFile outname (Bam.inHeader hin)) Bam.closeOutHandle $ \hout -> do
                    loop (Bam.get1 hin) (Bam.put1 hout)
                    return outname

An analogous conversion can be performed to read text format alignments and write then to a binary format file.

samToBam :: FilePath -> IO FilePath
samToBam inname = let outname = dropExtension inname ++ "-test.bam"
                  in bracket (Bam.openTamInFile inname) Bam.closeInHandle $ \hin ->
                  bracket (Bam.openBamOutFile outname (Bam.inHeader hin)) Bam.closeOutHandle $ \hout -> do
                    loop (Bam.get1 hin) (Bam.put1 hout)
                    return outname

Accessing Alignment Data

The Bam format alignment provides accessor functions for the alignment data. It also provides a Show instance that returns the text-format alignment data line. Here I write a function that compares the results of the accessor functions to the results of manually extracting fields from the text format data. I test four accessors against the name of the query sequence, the name of the alignment target, the position of the alignment (which is 1-based in the text SAM format), and the query sequence itself.

parseBam :: Bam.Header -> Bam.Bam1 -> IO ()
parseBam hdr b = sequence_ [ verify (Bam.queryName b)  (bamfields !! 0)
                           , verify (Bam.targetName b) (Just $ bamfields !! 2)
                           , verify (liftM (BS.pack . show . succ) . Bam.position $ b) (Just $ bamfields !! 3)
                           , verify (Bam.querySeq b)   (Just $ bamfields !! 9) 
                           ]
  where bamfields = BS.split '\t' . BS.pack . show $ b
        verify s1 s2 | s1 == s2 = return ()
                     | otherwise = error $ "Mismatch: " ++ show (s1, s2)

Querying a Sorted BAM File

One great advantage of the SAM format is that alignments can be sorted and indexed to allow rapid random access to all alignments that lie within a specific region. This makes it possible to write a genome browser that displays alignments by retrieving just those alignments that overlap the genomic region on the screen, or a script that processes alignments mapping to a single locus, without loading all alignments into memory. The BamIndex module provides access to this interface.

First I take an abritrary region corresponding to a gene near the start of the human genome annotation data file.

geneTargetName = "chr1"
geneTargetBounds = (14362 - 100, 16764 + 100)
geneRegion = BS.concat [ geneTargetName, ":"
                          , BS.pack . show . fst $ geneTargetBounds, "-"
                          , BS.pack . show . snd $ geneTargetBounds
                          ]

Next I extract the alignments within that region using the BamIndex interface. I look up the id of the named target sequence, "chr1", in the header. I use the id, along with the coordinates, to query the indexed BAM file. I then use the iterator-style interface on the query, provided by next, to retrieve alignments and re-write them to a text-format data file as well as writing some summary information as text to a second file.

extract :: FilePath -> IO FilePath
extract inname = let outname = dropExtension inname ++ "-gene.sam"
                     outname2 = dropExtension inname ++ "-gene-sploc.sam"
                 in bracket (BamIndex.open inname) (BamIndex.close) $ \idxin ->
                 let header = BamIndex.idxHeader idxin
                     tid = fromMaybe (error $ "No sequence " ++ show geneTargetName) $ 
                           Bam.lookupTarget (BamIndex.idxHeader idxin) $ geneTargetName
                 in bracket (Bam.openTamOutFile outname header) Bam.closeOutHandle $ \hout -> 
                 withFile outname2 WriteMode $ \hout2 -> do
                   unless (Bam.targetSeqName header tid == geneTargetName) $ error "Bad target name"
                   q <- BamIndex.query idxin tid geneTargetBounds
                   loop (BamIndex.next q) $ \b -> do
                     Bam.put1 hout b
                     hPutStrLn hout2 . concat $ 
                       [ maybe "n/a" (BS.unpack . repr) . Bam.refSeqLoc $ b
                       , "\t"
                       , show b
                       ]
                   return outname

Putting it All Together

I start with two small utilities to run system commands with verbose error handling.

rawSystemE :: String -> [String] -> IO ()
rawSystemE prog args = rawSystem prog args >>= checkExit
  where checkExit ExitSuccess = return ()
        checkExit (ExitFailure err) = error $ show (prog : args) ++ " => " ++ show err

rawSystem_ :: String -> [String] -> IO ()
rawSystem_ prog args = rawSystem prog args >>= checkExit >> return ()
  where checkExit ExitSuccess = return ()
        checkExit (ExitFailure err) = hPutStrLn stderr $ show (prog : args) ++ " => " ++ show err

rawSystemP :: String -> [String] -> IO ()
rawSystemP prog args = rawSystem prog args >>= checkExit >> return ()
  where checkExit ExitSuccess = hPutStrLn stderr $ show (prog : args) ++ " => 0"
        checkExit (ExitFailure err) = hPutStrLn stderr $ show (prog : args) ++ " => " ++ show err

I use these to run the samtools binary and compare the results to the functions above, which use the samtools library.

doSamTest :: FilePath -> IO ()
doSamTest bamin = do samout <- bamToSam bamin
                     let samout' = samout ++ "_samtools"
                     rawSystemE "samtools" [ "view", bamin, "-h", "-o", samout' ]
                     rawSystemP "diff" [ samout, samout' ]
                     samout2 <- samToBam samout >>= bamToSam
                     rawSystemP "diff" [ samout, samout2 ]

                     genesam <- extract bamin
                     let genesam' = genesam ++ "_samtools"                     
                     rawSystemE "samtools" [ "view", bamin, "-h", "-o", genesam', BS.unpack geneRegion ]
                     rawSystemP "diff" [ genesam, genesam' ]

                     header <- headerToIndex bamin
                     let header' = header ++ "_samtools"
                     rawSystemE "samtools" [ "view", bamin, "-H", "-o", header' ]
                     rawSystemP "diff" [ header, header' ]

                     bracket (Bam.openBamInFile bamin) Bam.closeInHandle $ \hin ->
                       Bam.get1 hin >>= maybe (return ()) (parseBam (Bam.inHeader hin))

Finally, I write a main function that takes a single BAM format alignment, sorted and indexed, and performs the above tests.

main :: IO ()
main = getArgs >>= mainWithArgs
  where mainWithArgs [ bamin ] = doSamTest bamin
        mainWithArgs _ = do prog <- getProgName
                            error . unwords $ [ "USAGE:", prog, "<IN.BAM>" ]