Description
Dask is a library for distributed computing with Python that integrates tightly with pandas and other libraries from the PyData stack. It offers a DataFrame API that wraps pandas and thus offers an easy transition into the big data space.
Historically, Dask was the easiest choice to use (it’s just pandas) but struggled to achieve robust performance (there were many ways to accidentally perform poorly). It was great for experts, but bad for novices. Other tools (Spark, DuckDB, Polars) just did this better.
Fortunately, these pain points have been fixed with the following features:
- A new and vastly improved shuffle algorithm
- A logical query planning layer to improve performance and usability
- A reduced memory footprint through a more efficient data model due to pandas 2.0
We will look into how these changes work together across pandas, Arrow, and Dask to provide a better UX and a more robust and faster system overall. Additionally, we will look into a comparison of Dask against other tools in the big data space, including Spark, Polars and DuckDB.
We will use the TPC-H benchmarks to compare these tools. We will look ahead into what the future will bring for pandas and Dask and how the logical query planning layer can be extended to fit other frameworks like Dask Array and XArray.