Dask is a flexible tool for parallelizing Python code on a single machine or across a cluster. It builds upon familiar tools in the PyData ecosystem (e.g. NumPy and Pandas) while allowing them to scale across multiple cores or machines. This tutorial will cover both the high-level use of dask collections, as well as the low-level use of dask graphs and schedulers.
Dask is a flexible tool for parallelizing Python code on a single machine or across a cluster.
We can think of dask at a high and a low level
- High level collections: Dask provides high-level Array, Bag, and DataFrame collections that mimic and build upon NumPy arrays, Python lists, and Pandas DataFrames, but that can operate in parallel on datasets that do not fit into main memory.
- Low Level schedulers: Dask provides dynamic task schedulers that execute task graphs in parallel. These execution engines power the high-level collections mentioned above but can also power custom, user-defined workloads to expose latent parallelism in procedural code. These schedulers are low-latency and run computations with a small memory footprint.
Different users operate at different levels but it is useful to understand both. This tutorial will cover both the high-level use of dask.array and dask.dataframe and the low-level use of dask graphs and schedulers. Attendees will come away
- Able to use dask.delayed to parallelize existing code
- Understanding the differences between the dask schedulers, and when to use one over another
- With a firm understanding of the different dask collections (dask.array and dask.dataframe) and how and when to use them.