Contribute Media
A thank you to everyone who makes this possible: Read More

Frictionless Data, Frictionless Development edit

Description

A common problem in Data Engineering is how to create a platform capable both of importing and exporting tabular data in numerous formats and of maintaining a change history of the data while users update and query it.

Tools like Trifacta Google Cloud Dataprep provide a turnkey solution to part of the pipeline but the open source Frictionless Data tools from OKFN can provide a simpler subset of these features tailored to your requirements.

Just as Pandas is built around the Dataframe, the Frictionless Data approach uses data packages consisting of a JSON table schema and a data URI. These schemata can be easily generated for any dataset and work well for a number of applications such as:

  • Validating new data with tools like Goodtables or tableschema-py
  • Building a data update interface with tools such as Handsontable JS
  • Creating declarative data processing pipelines that a front end can easily interact with via datapackages pipelines and kubernetes
  • Pushing data into various databases and repository tools such as CKAN datastore
  • Extending the schema to allow export to linked data formats such as IIIF

The talk will cover these use cases and compare with the approaches taken by other open-source data science / BI tools such as Datashape with ODO from Continuum and Superset from AirBnB. I will aim to demonstrate that that lightweight web standards like datapackages speed up the development process.

Improve this page