In data science (in its all its variants) a significant part of an individual’s time is spent preparing data into a digestible format. In general, a data science pipeline starts with the acquisition of raw data which is then manipulated through ETL processes and leads to a series of analytics. Good data pipelines can be used to automate and schedule these steps, help with monitoring tasks, and even to dynamically train models. On top of that, they make the analyses easier to reproduce and productise.
In this workshop, you will learn how to migrate from ‘scripts soups’ (a set of scripts that should be run in a particular order) to robust, reproducible and easy-to-schedule data pipelines in Airflow. First, we will learn how to write simple recurrent ETL pipelines. We will then integrate logging and monitoring capabilities. And we will end using Airflow along with Jupyter Notebooks and paper mill to produce reproducible analytics reports.