PyData DC 2016
This talk discusses parallel and distributed computing in Python, particularly for ad-hoc and custom algorithms. It focuses on Dask, a Python solution for flexible distributed computing.
The Python data science stack contains efficient algorithms with intuitive interfaces for sophisticated and friendly analysis. As the data science community tackles larger problems with larger hardware we naturally ask how best to parallelize this software stack both across many cores in a single computer and across computers in a cluster. This turns out to be harder than it looks, even with traditional Big Data tools like MapReduce, Storm, and Spark. Both the complexity of the algorithms and the high expectations for interactivity raise challenges for these systems. This talk lays out the benefits and challenges of parallelizing a numeric analytic stack, and then describes Dask, a parallel framework gaining traction within the Python community for interactive performant parallel computing, and finally goes through a few domains where this work is enabling novel science today.