Authors: Quinlan, Aaron, University of Virginia; Paila, Uma, University of Virginia; Chapman, Brad, Harvard School of Public Health; Kirchner, Rory,
The throughput of DNA sequencing has increased by five orders of magnitude in the last decade and geneticists can now sequence a complete human genome in 24 hours for less than $5000. This tremendous increase in efficiency has led to large-scale studies of the relationship between inherited genetic variation and human disease. While collecting genetic variation from the genomes of thousands of humans is now possible, unraveling the genetic basis of disease remains a tremendous analytical challenge. Interpretation is especially difficult since many genetic variants associated with human disease lie outside the genomic regions that encode genes. To address this challenge, we have developed GEMINI, a flexible Python analysis framework for exploring human genetic variation. By leveraging Numpy, SQLITE, and several powerful Python packages in the genomics domain, GEMINI integrates genome-scale genetic variation from 1000s of individuals with a wealth of genome annotations that are crucial for disease interpretation.
GEMINI provides a powerful analysis framework allowing researchers to conduct otherwise complicated analyses with an easy to use analysis interface. It provides methods for ad hoc data exploration, a programming interface for custom analyses, and both command line and graphical tools for common analysis tasks. We demonstrate GEMINI's utility for exploring variation for personal genomes and family based genetic studies. Thanks to advances such as IPython.parallel, we further illustrate the framework's ability to scale to studies involving thousands of human samples.