This talk will educate the audience about Python tools and best practices for creating reproducible petabyte-scale pipelines. This is done within the context of demonstrating a new grammar-based approach to comparative genomics. The genome grammars are produced using public data from the National Institutes of Health, streamed over a high-throughput Internet2 connection to Amazon Web Services.
We introduce a high-performance, open-source application written in Python that models genomic data with a context-free grammar (CFG), a construct from formal language theory. This approach is intended to advance fundamental science by delivering a more extensive model of the genetic interaction of diseases. Current comparative models treat genomic sequences as strings, and recent advances are little more than optimizations of the "grep approach". However a genome is a grammar: it is parsed, follows rules, and has an inherent hierarchical structure. Understanding the structure and rules of this implied grammar are essential for mapping loci to diseases when those loci are distributed across genomic regions.
To produce the CFGs, we have implemented the Sequitur algorithm to run on the AWS Elastic MapReduce platform. This application is written in Python and uses the following packages: MRjob, boto, and pandas. This is a petascale computing pipeline that is successful because it uses inherently scalable services and is able to take advantage of the 100G Internet2 connection between Amazon Web Services and the National Institutes of Health (NIH). This architecture delivers unprecedented transfer speeds and relatively low latency.
We discuss the advantages of this architecture, especially for groups without comparable local resources. In reviewing the results of our computation, we not only look at methods to measure the utility of our CFG models, but also the computational advantages of this approach. Just like the fastest alignment algorithms, this complex approach still operates within linear-space. In addition, future pairwise comparisons are faster because our CFGs act as a compressed representation of the raw sequence data. Our hope is that this CFG approach is further tested as a replacement for raw sequence analysis. In addition, we hope that our bioinformatics pipeline serves as an example for the SciPy community on how to perform large computations across the many petabytes made available by NIH.