Contribute Media
A thank you to everyone who makes this possible: Read More

Scaling your Data infrastructure


This talk aims to answer a few questions:

  • What do you do when you need to move your model from your laptop to production?
  • Is big data == I need to use JVM the right assumption?
  • How can I put my jupyter notebook in production?
  • How do you apply the best software engineering practices (testing and ci for example) inside your data science process?
  • How do you “decouple” your data scientists, developers and devops teams?
  • How do you guarantee the reproducibility of your models?
  • How do you scale your training process when does not fit in memory anymore?
  • How do you serve your models and provide an easy rollback system?

The Agenda:

  • The Data Science workflow
  • Scaling is not just a matter of the size of your Data
  • Scaling when the size of your Data matters
  • DDS, Dockerized Data Science
  • Cassiny

I’ll share my experience highlighting some of the challenges I faced and the solutions I came up to answer these questions.

During this presentation I will mention libraries like jupyter, atom, scikit- learn, dask, ray, parquet, arrow and many others.

The principles and best practices I will share are something that you can apply, more or less easily, if you are running or in the process to run a production system based on the Python stack.

This talk will focus on (my) best practices to run the Python Data stack together and I will also talk about Cassiny, an open source project I started, that aims to simplify your life if you want to use a completely Python based solution in your data science workflow.

in __on Friday 20 April at 11:00 **See schedule**

Improve this page