We present scikit-bio, a library based on the Python scientific computing stack implementing core bioinformatics data structures, algorithms and parsers. scikit-bio is useful for students in bioinformatics, who can learn topics such as iterative progressive multiple sequence alignment from the source code and accompanying documentation, and for real-world bioinformatics applications developers.
Python is widely used in computational biology, with many high profile bioinformatics software projects, such as Galaxy, Khmer and QIIME, being largely or entirely written in Python. We present scikit-bio, a new library based on the standard Python scientific computing stack (e.g., numpy, scipy, and matplotlib) implementing core bioinformatics data structures, algorithms, parsers, and formatters. scikit-bio is the first bioinformatics-centric scikit, and arises from over ten years of development efforts on PyCogent and QIIME, representing an effort to update the functionality provided by these extensively used tools, and to make that functionality more accessible. scikit-bio is intended to be useful both as a resource for students, who can learn topics such as heuristic-based sequence database searching or iterative progressive multiple sequence alignment from the source code and accompanying documentation, and as a powerful library for 'real-world' bioinformatics developers. To achieve these goals, scikit-bio development is centered around test-driven, peer-reviewed software development; C/Cython integration for computationally expensive algorithms; extensive API documentation and doc-testing based on the numpy docstring standards; user documentation and theoretical discussion of topics in IPython Notebooks; adherence to PEP8; and continuous integration testing. scikit-bio is available free of charge under the BSD license.