Description
Audience : beginner
Description The goal of this talk is to explain how Athena, a serverless sql-like query service provided by Amazon’s AWS, combined with a Python library called PyAthena, made it possible to store and query as much data as needed with low costs, high performances and in a Pythonesque way.
Abstract We found ourselves in a sticky situation: for monitoring and debugging reasons we had the need to store a large amount of data (around 200 million rows), trying not to spend the entire year’s budget but still managing to efficiently query the data in an interactive setting. With such Big Data, we could not simply resort to Data Science tools like Pandas and hope for the best. Our first idea was to just shove it all in our Postgres DB: since both data and database were stored on Amazon’s AWS infrastructure, all we had to do was to write ad-hoc import and update queries. Sadly, our poor Postgres machine took the hit, and was not able to respond to our requirements without greatly increasing our costs. Then we found out about Athena: a serverless, Presto-based, sql compliant database, that reads directly from S3 folders and creates a virtual table on which you can run sql queries. Using Python’s Athena library (PyAthena) our query execution time dropped from hours to seconds, we simplified the infrastructure and decreased our costs, without the need to pay and maintain a dedicated server. In this talk we will show why Athena was the right solution for our use case and present its Python library with its functionalities.
in __on domenica 22 aprile at 15:30 **See schedule**