Description
The wayback machine is a high traffic website that has been online for over a decade. It was a mostly Java application. One component of the application is the Liveweb proxy. This is an HTTP proxy that archives a resource which is requested through it and the core data source for the wayback machine. The liveweb proxy was rearchitected from scratch in Python and deployed on the actual website and has been running for a few months now without a single hitch. There were limitations in the standard library which needed to be worked around, careful tuning of parameters to balance disk I/O and memory usage, fine details of the HTTP protocol that needed to be understood and respected. This talk discusses the architecture and design of the new system to handle the kind of traffic and patterns which are expected of an archiving proxy and how it was deployed.