PyData Amsterdam 2016
Data cleaning is the first step of every Data Science project. Next one does Data Science. The talk covers a missing step of deployment and scaling Data Applications in production. We will go through all major steps of the process like Dockerizing application, Continuous Deployment with further AWS stack creation and rolling deploys although also covering new trends in Serverless architecture.
Data Science is quite a young field. One of the definitions of Data Scientist: Person who is better at statistics than any software engineer and better at software engineering than any statistician. Hence, it's quite important to talk not only about best practices of feature generation and not overfitting but also about more of software engineering topics.
The talk is based on our experience of Data Science developments at Stylight, an international fashion e-commerce company, that operates in 15 countries worldwide. We refer to our Data Applications written in R and Python, Scala; but the content is not limited to mentioned languages and applicable others.
The talk consists three main parts. A first part introduces best practices of development. How to structure your development, make deployment easy and reproducible, how to make Continuous Integration and commit triggered deployments. The second part covers production deployment to AWS stack, in particular focusing on concepts of immutable infrastructure and infrastructure as code. The last part about using serverless architecture for data applications. We introduce an example of our outlier detection system, that automatically scales based on such approach.