Description
Distributed systems are neat to demo, but hard to use in reality.
This talk goes through lessons learned running 100,000s of Dask clusters and 1,000,000,000s of Python functions for users in critical production settings across many companies and research groups.
We'll cover lessons learned like ...
GIL Vigilance is Good Kubernetes is too heavyweight if all you want is lots of jobs ARM is underused Docker doesn't work well for data science folks Availability-Zones are key for spot/GPU availability Adaptive is underused (but hard) Most workloads are small Most workloads are fast Most users don't scale up properly Most people overestimate costs These lessons will be motivated by tons of metadata collected and aggregated from real-world workloads.