Contribute Media
A thank you to everyone who makes this possible: Read More

Fighting Against Chaotically Separated Values with Embulk

Description

PyData SF 2016

Python is a great tool for performing data analysis, but often time the hardest part is getting access to your data that’s located in a variety of business systems - files, database, and SaaS applications. Productionizing this process is even harder: scripts frequently fail and require precious to to fix and re-test. In this talk, I will review some open source tools I authored and show you how

In this talk we will cover:

  • How we created a data collection tool that can read any chaotically formatted files called "CSV" by guessing its structure automatically
  • Explore the plugin-based-architecture that makes it easy to load data from external sources and publish to production systems. From files to business systems such as Salesforce & Mixpanel.
  • Review current plugins (over 100 released by the OSS community) and use cases
  • Explain how distributed execution enhances stability and scalability

Details

Improve this page