Process bigger-than-RAM data using Pandas, Dask and Vaex
Larger datasets can't fit into RAM - suddenly you can't use Pandas any more - but we need to analyse that data! First we'll review techniques to compress our data (maybe cutting our DataFrame RAM usage in half!) so we can process more rows using regular Pandas. Next we'll look at clever ways to make common operations run faster on DataFrames including dropping down to numpy, compiling with Numba and running multi-core. Finally for still-larger datasets we'll review Dask on Pandas and the new Vaex competitor solution. You'll leave with new techniques to make your DataFrames smaller and ideas for processing your data faster. This talk is inspired by Ian's work updating his O'Reilly book High Performance Python to the 2nd edition for 2020. With over 10 years of evolution the Pandas DataFrame library has gained a huge amount of functionality and it is used by millions of Pythonistas - but the most obvious way to solve a task isn't always the fastest or most RAM efficient. This talk will help any Pandas user (beginner or beyond) process more data faster, making them more effective at their jobs.