As Machine Learning models rely on data in order to make their predictions, data quality evaluation is a crucial aspect of any ML pipeline. We as Engineers/Data-Scientists, should validate our data in the same manner in which we validate our code. Data errors can lead to: Bad and costly decisions, Inaccurate predictions due to invalid data and Time waste. There is an abundance of different libraries that perform various kinds of data integrity checks. I will specifically focus on Dataframe validation.
In this talk, I will present the problem and give a practical overview (accompanied by Jupyter Notebook code examples) of three libraries that aim to address it:
- Voluptuous - Which uses Schema definitions in order to validate data [https://github.com/alecthomas/voluptuous]
- Engarde - A lightweight way to explicitly state your assumptions about the data and check that they're actually true [https://github.com/TomAugspurger/engarde]
- TDDA - Test Driven Data Analysis [ https://github.com/tdda/tdda]
By the end of this talk, you will understand the Importance of data validation and get a sense of how to integrate data validation principles as part of the ML pipeline.