Description
Pandas is a good library to deal with tabular data. What if you need to manage an amount of data that doesn’t fit into memory? What if you want to “distribute” your computations among multiple machines?
Starting from a real scenario, Apache Spark will be presented as the main tool to read and process collected data. It will be shown how a Pandas-like syntax will come in handy to run aggregations, filtering and grouping using a Spark Dataframe.
A previous knowledge of Docker and Docker Compose will be very useful while knowing MongoDB (where data will be fetched from) is not mandatory. Basics of functional programming will help to understand Spark inner logic.