Contribute Media
A thank you to everyone who has made this possible: Read More

Scalable, Distributed, and Reproducible Machine Learning

Description

The recent advances in machine learning and artificial intelligence are amazing! Yet, in order to have real value within a company, data scientists must be able to get their models off of their laptops and deployed within a company’s data pipelines and infrastructure. Those models must also scale to production size data. In this talk, we will implement a model locally in Python. We will then take that model and deploy both it's training and inference in a scalable manner to a production cluster with Pachyderm, an open source framework for distributed pipelining and data versioning. We will also learn how to update the production model online, track changes in our model and data, and explore our results.


Daniel Whitenack (@dwhitena) is a Ph.D. trained data scientist working with Pachyderm (@pachydermIO). Daniel develops innovative, distributed data pipelines which include predictive models, data visualizations, statistical analyses, and more. He has spoken at conferences around the world (ODSC, Spark Summit, Datapalooza, DevFest Siberia, GopherCon, and more), teaches data science/engineering with Ardan Labs (@ardanlabs), maintains the Go kernel for Jupyter, and is actively helping to organize contributions to various open source data science projects.

Details

Improve this page