PyData Berlin 2016
Floto is an open source tool to programmatically author, schedule and run scalable data pipelines using AWS Simple Workflow - without the need to maintain a master server or queue or the state of workers.
There are quite a few great tools for building effective and robust distributed data processing pipelines, especially Luigi from Spotify and Airflow from AirBnB.
For scaling out, they all require a queue or master server, though. And those need maintenance.
We wrote floto (https://github.com/babbel/floto), an open source tool to programmatically author, schedule and run scalable data pipelines on AWS - without the maintenance overhead.
It uses AWS Simple Workflow, but I'll talk most about some general topics regarding data workflow orchestration:
- separation of concerns
- managing complexity through dependency reduction
- idempotent (or re-runnable) jobs
- transactional jobs (either completely fail, or completely succeed)
- failures and reruns
- evolving changes
- organizational scaling
- heterogenous systems