Introduction
The samtools
package provides a wrapper around the C
samtools library. The library
maintains data in the C native data structures but provides a more
idiomatic Haskell interface.
I begin this example by importing modules that I will use.
{-# LANGUAGE OverloadedStrings #-}
module Main
where
import Control.Exception
import Control.Monad
import qualified Data.ByteString.Char8 as BS
import Data.Maybe
import System.Environment
import System.Exit
import System.FilePath
import System.IO
import System.Process
import Bio.SeqLoc.LocRepr
import qualified Bio.SamTools.Bam as Bam
import qualified Bio.SamTools.BamIndex as BamIndex
import qualified Bio.SamTools.Cigar as Cigar
I also provide a small function to loop over a pair of monadic
actions. The first is an input, which returns Maybe a
, and the
second consumes the a
. The input and output actions are repeated
until the input action returns Nothing
. This will be used to loop
over all alignments from a file and process each with a monadic
action.
loop :: (IO (Maybe a)) -> (a -> IO ()) -> IO ()
loop mi mo = go
where go = mi >>= maybe (return ()) (\i -> mo i >> go)
BAM File Headers
The samtools
library keeps track of target sequences in a header
structure, and internally the target is recorded as an index into the
header sequence list. Here I open an alignment file, which
automatically loads the header, and write just the header data to a
second file. This header data includes the name and length of each
alignment target.
headerToIndex :: FilePath -> IO FilePath
headerToIndex inname = let outname = dropExtension inname ++ "-index.txt"
in bracket (Bam.openBamInFile inname) Bam.closeInHandle $ \hin ->
withFile outname WriteMode $ \hout ->
let hseqs = Bam.targetSeqList $ Bam.inHeader hin
in do forM_ hseqs $ \hseq -> hPutStrLn hout . concat $
[ "@SQ\tSN:", BS.unpack . Bam.name $ hseq, "\tLN:", show . Bam.len $ hseq ]
return outname
Reading and Writing Alignments
I start with simple functions to convert between binary and text
format files by reading the alignments using one interface and writing
them using another. There are separate InHandle
and OutHandle
data
types for reading and writing files. This example opens an InHandle
with openBamInFile
to read binary format alignments. It then opens
an OutHandle
with openTamOutFile
to write text format
alignments. The output file requires the header from the input file,
which is extracted from the InHandle
with inHeader
. The loop
utility above is then used to first read an alignment with get1
and
then write it with put1
.
bamToSam :: FilePath -> IO FilePath
bamToSam inname = let outname = dropExtension inname ++ "-test.sam"
in bracket (Bam.openBamInFile inname) Bam.closeInHandle $ \hin ->
bracket (Bam.openTamOutFile outname (Bam.inHeader hin)) Bam.closeOutHandle $ \hout -> do
loop (Bam.get1 hin) (Bam.put1 hout)
return outname
An analogous conversion can be performed to read text format alignments and write then to a binary format file.
samToBam :: FilePath -> IO FilePath
samToBam inname = let outname = dropExtension inname ++ "-test.bam"
in bracket (Bam.openTamInFile inname) Bam.closeInHandle $ \hin ->
bracket (Bam.openBamOutFile outname (Bam.inHeader hin)) Bam.closeOutHandle $ \hout -> do
loop (Bam.get1 hin) (Bam.put1 hout)
return outname
Accessing Alignment Data
The Bam
format alignment provides accessor functions for the
alignment data. It also provides a Show
instance that returns the
text-format alignment data line. Here I write a function that compares
the results of the accessor functions to the results of manually
extracting fields from the text format data. I test four accessors
against the name of the query sequence, the name of the alignment
target, the position of the alignment (which is 1-based in the text
SAM format), and the query sequence itself.
parseBam :: Bam.Header -> Bam.Bam1 -> IO ()
parseBam hdr b = sequence_ [ verify (Bam.queryName b) (bamfields !! 0)
, verify (Bam.targetName b) (Just $ bamfields !! 2)
, verify (liftM (BS.pack . show . succ) . Bam.position $ b) (Just $ bamfields !! 3)
, verify (Bam.querySeq b) (Just $ bamfields !! 9)
]
where bamfields = BS.split '\t' . BS.pack . show $ b
verify s1 s2 | s1 == s2 = return ()
| otherwise = error $ "Mismatch: " ++ show (s1, s2)
Querying a Sorted BAM File
One great advantage of the SAM format is that alignments can be sorted
and indexed to allow rapid random access to all alignments that lie
within a specific region. This makes it possible to write a genome
browser that displays alignments by retrieving just those alignments
that overlap the genomic region on the screen, or a script that
processes alignments mapping to a single locus, without loading all
alignments into memory. The BamIndex
module provides access to this
interface.
First I take an abritrary region corresponding to a gene near the start of the human genome annotation data file.
geneTargetName = "chr1"
geneTargetBounds = (14362 - 100, 16764 + 100)
geneRegion = BS.concat [ geneTargetName, ":"
, BS.pack . show . fst $ geneTargetBounds, "-"
, BS.pack . show . snd $ geneTargetBounds
]
Next I extract the alignments within that region using the BamIndex
interface. I look up the id of the named target sequence, "chr1"
,
in the header. I use the id, along with the coordinates, to query
the indexed BAM file. I then use the iterator-style interface on the
query, provided by next
, to retrieve alignments and re-write them to
a text-format data file as well as writing some summary information as
text to a second file.
extract :: FilePath -> IO FilePath
extract inname = let outname = dropExtension inname ++ "-gene.sam"
outname2 = dropExtension inname ++ "-gene-sploc.sam"
in bracket (BamIndex.open inname) (BamIndex.close) $ \idxin ->
let header = BamIndex.idxHeader idxin
tid = fromMaybe (error $ "No sequence " ++ show geneTargetName) $
Bam.lookupTarget (BamIndex.idxHeader idxin) $ geneTargetName
in bracket (Bam.openTamOutFile outname header) Bam.closeOutHandle $ \hout ->
withFile outname2 WriteMode $ \hout2 -> do
unless (Bam.targetSeqName header tid == geneTargetName) $ error "Bad target name"
q <- BamIndex.query idxin tid geneTargetBounds
loop (BamIndex.next q) $ \b -> do
Bam.put1 hout b
hPutStrLn hout2 . concat $
[ maybe "n/a" (BS.unpack . repr) . Bam.refSeqLoc $ b
, "\t"
, show b
]
return outname
Putting it All Together
I start with two small utilities to run system commands with verbose error handling.
rawSystemE :: String -> [String] -> IO ()
rawSystemE prog args = rawSystem prog args >>= checkExit
where checkExit ExitSuccess = return ()
checkExit (ExitFailure err) = error $ show (prog : args) ++ " => " ++ show err
rawSystem_ :: String -> [String] -> IO ()
rawSystem_ prog args = rawSystem prog args >>= checkExit >> return ()
where checkExit ExitSuccess = return ()
checkExit (ExitFailure err) = hPutStrLn stderr $ show (prog : args) ++ " => " ++ show err
rawSystemP :: String -> [String] -> IO ()
rawSystemP prog args = rawSystem prog args >>= checkExit >> return ()
where checkExit ExitSuccess = hPutStrLn stderr $ show (prog : args) ++ " => 0"
checkExit (ExitFailure err) = hPutStrLn stderr $ show (prog : args) ++ " => " ++ show err
I use these to run the samtools
binary and compare the results to
the functions above, which use the samtools
library.
doSamTest :: FilePath -> IO ()
doSamTest bamin = do samout <- bamToSam bamin
let samout' = samout ++ "_samtools"
rawSystemE "samtools" [ "view", bamin, "-h", "-o", samout' ]
rawSystemP "diff" [ samout, samout' ]
samout2 <- samToBam samout >>= bamToSam
rawSystemP "diff" [ samout, samout2 ]
genesam <- extract bamin
let genesam' = genesam ++ "_samtools"
rawSystemE "samtools" [ "view", bamin, "-h", "-o", genesam', BS.unpack geneRegion ]
rawSystemP "diff" [ genesam, genesam' ]
header <- headerToIndex bamin
let header' = header ++ "_samtools"
rawSystemE "samtools" [ "view", bamin, "-H", "-o", header' ]
rawSystemP "diff" [ header, header' ]
bracket (Bam.openBamInFile bamin) Bam.closeInHandle $ \hin ->
Bam.get1 hin >>= maybe (return ()) (parseBam (Bam.inHeader hin))
Finally, I write a main
function that takes a single BAM format
alignment, sorted and indexed, and performs the above tests.
main :: IO ()
main = getArgs >>= mainWithArgs
where mainWithArgs [ bamin ] = doSamTest bamin
mainWithArgs _ = do prog <- getProgName
error . unwords $ [ "USAGE:", prog, "<IN.BAM>" ]