Contribute Media
A thank you to everyone who makes this possible: Read More

Zero-Administration Data Pipelines using AWS Simple Workflow

Description

PyData Berlin 2016

Floto is an open source tool to programmatically author, schedule and run scalable data pipelines using AWS Simple Workflow - without the need to maintain a master server or queue or the state of workers.

There are quite a few great tools for building effective and robust distributed data processing pipelines, especially Luigi from Spotify and Airflow from AirBnB.

For scaling out, they all require a queue or master server, though. And those need maintenance.

We wrote floto (https://github.com/babbel/floto), an open source tool to programmatically author, schedule and run scalable data pipelines on AWS - without the maintenance overhead.

It uses AWS Simple Workflow, but I'll talk most about some general topics regarding data workflow orchestration:

  • separation of concerns
  • managing complexity through dependency reduction
  • idempotent (or re-runnable) jobs
  • transactional jobs (either completely fail, or completely succeed)
  • failures and reruns
  • evolving changes
  • organizational scaling
  • heterogenous systems

Details

Improve this page