Constrained Data Synthesis

YouTube

Description

Synthetic data is useful in many contexts, including

providing "safe", non-private alternatives to data containing personally identifiable information
software and pipeline testing
software and service development
enhancing datasets for machine learning.

Synthetic data is often created on a bespoke basis, and since the advent of generative adverserial networks (GANs) there has been considerable interest and experimentation with using those as the basis for creating synthetic data.

We have taken a different approach. We have worked for some years on developing methods for automatically finding constraints that characterise data, and which can be used for testing data validity (so-called "test-driven data analysis", TDDA). Such constraints form (by design) a useful characterisation of the data from which they were generated. As a result, methods that generate datasets that match the constraints necessarily construct datasets that match many of the original characteristics of the data from which the constraints were extracted.

An important aspect of datasets is the relationship between "good" (~ valid) and "bad" (~ invalid) data, both of which are typically present. Systems for creating useful, realistic synthetic data generally need to be able to synthesize both kinds, in realistic mixtures.

This talk will discuss data synthesis from constraints, describing what has been achieved so far (which includes synthesizing good and bad data) and future research directions.

We introduce a method for creating synthetic data "to order" based on learned (or provided) constraints and data classifications. This includes "good" and "bad" data.

PyVideo

Constrained Data Synthesis

Description

Details