At Shopify we have over 3000 Python batch ETL jobs. These jobs depend upon each other’s output forming a directed acyclic graph that, when visualized, is indiscernible from the hairballs that my cat pukes up.
These jobs are created by a team of over 100 analysts and engineers who deploy on average 15 changes to them per day to production. With so many people and such a rapid pace of change, understanding how a dataset is constructed, debugging relationships, tracing the flow of data, or even just asking how prevalent a feature or type of relationship is becomes has been a daunting task requiring tracing not only 20k lines of YAML schedule files and 50k lines of Python code.
To make asking questions about these jobs tractable, we’ve created a series of CLI tools that, when combined with unix tools, makes answering questions about our schedule possible.
I’ll cover how we flatten that graph into a series of tables that we can output using a CLI tool and then how one can use grep, awk, sort, join, and column to answer some real questions that we had about our schedule.