PyVideo.org · Dataframe Validation In Python

YouTube

Description

As Machine Learning models rely on data in order to make their predictions, data quality evaluation is a crucial aspect of any ML pipeline. We as Engineers/Data-Scientists, should validate our data in the same manner in which we validate our code. Data errors can lead to: Bad and costly decisions, Inaccurate predictions due to invalid data and Time waste. There is an abundance of different libraries that perform various kinds of data integrity checks. I will specifically focus on Dataframe validation.

In this talk, I will present the problem and give a practical overview (accompanied by Jupyter Notebook code examples) of three libraries that aim to address it:

Voluptuous - Which uses Schema definitions in order to validate data [https://github.com/alecthomas/voluptuous]
Engarde - A lightweight way to explicitly state your assumptions about the data and check that they're actually true [https://github.com/TomAugspurger/engarde]
TDDA - Test Driven Data Analysis [ https://github.com/tdda/tdda]

By the end of this talk, you will understand the Importance of data validation and get a sense of how to integrate data validation principles as part of the ML pipeline.

PyVideo

Dataframe Validation In Python - A Practical Introduction

Description

Details