DawgPack: Multiple Genome Alignment and Analysis in the Cloud
Contact
Juber Patel
PhD Student, University of Georgia
juberpatel at gmail dot com
DawgPack is an ultrafast cloud-computing framework that performs the alignment of next generation sequences against a reference genome, aggregates results from the alignment of multiple genomes and performs statistical analysis to draw biologically meaningful conclusions. As an example of possible analyses, we describe building a statistical distribution of depth of coverage (DOC) for each position in the genome by aligning healthy genomes and using it to predict abnormal regions in a given genome. The more genomes are added to the distribution, the more accurate are the inferences drawn from it. Many cancer genomes could also be compared together to glean their recurring features. With the unprecedented alignment performance of DawgPack, our aim is to align every human genome sequenced at a high coverage.
DawgPack uses a variant of Bi-directional Burrows Wheeler Transform for aligning reads onto a reference genome. On a single Aligner, DawgPack takes 65 seconds for the paired-end alignment of a million reads against 150 seconds taken by Bowtie and 275 seconds taken by BWA for same parameters. DawgPack performance in the Cloud is unmatched. Speed is achieved by avoiding hard disk access and efficiently managing machine to machine communication. Unlike the SAM format, DawgPack produces its result ordered by genomic position, making further analysis very fast.
DawgPack aims to be a seamlessly integrated analysis pipeline that includes various modules to draw inferences from the alignment process. Apart from the CNV analysis, we plan to add analysis modules for RNA-Seq, SNP calling, ChIP-Seq and Methyl-Seq. We believe that only an integrated analysis could reveal definitive markers and underlying processes involved in pathogenesis.
DawgPack is written from the scratch in Java in a modular, extensible fashion. Each module of DawgPack is fully multi-threaded to take advantage of all the CPUs on a machine.