Contribute Media
A thank you to everyone who has made this possible: Read More

Distributed computing with Dask


Dask is a modern parallel computing library completely written in Python. It is extremely flexible, being able to work well on a laptop, using all available cores in parallel, or scale up to a cluster of hundreds of nodes.

Instead of forcing you to wrap your code to use the map-reduce paradigm, it mimics the numpy array and pandas dataframe interfaces, so you can continue doing everything the same way you always do.

Dask main abstraction is a Directed Acyclic Graph called "dask" (distributed task) implemented as a simple dictionary. The different interfaces (bag, array, dataframe) create these dasks, that are later computed in a distributed fashion using a suitable scheduler.

Forget about the JVM overhead. The future is now, the future is dask!

Slides available at

Improve this page