Contribute Media
A thank you to everyone who has made this possible: Read More

Self-Healing Systems: The Road to 99.99% Uptime


Stop firefighting and start fireproofing! There are many tools that make oncall easier and increase availability, but we'll be mostly focusing on a few principles and design patterns that help make your systems more robust. ‚Äč Abstract Feature velocity is typically a higher priority early in a software's lifecycle, but as the system matures there is an effort to start fireproofing the system. On the Yelp Transactions Platform team we've used a combination of circuit breakers, queues, and idempotent operations to minimize downtime and waking up in the middle of the night.

We'll take a look at how these design patterns help us in a distributed system, when they should be used, and common pitfalls associated.

Bio: William Ting is a longtime FOSS advocate with contributions in various projects (Pelican, autojump, pyramid_swagger, Rust, GNOME). He's currently an infrastructure engineer at Reddit, and previously on the Yelp Transaction Platform team.


Improve this page