We live in the world of "big data," where more data is almost always considered better. Data is usually seen as a raw material - the stuff from which models are built and decisions are made. We may not know, see, or even care where the underlying data comes from. But sometimes our instincts run afoul of laws that decree that too much data, or data from the wrong source, is illegal. This talk is an exploration the legal concepts of ownership and provenance: how the law restricts what data we can use, and ways we can act within the law and reduce risk.
We sometimes think of data as just "facts," existing outside of any sort of ownership or legal structure. But the law doesn't always see it that way. Sometimes observations are owned by those who observe and sometimes by those who are observed. There are certain types of data, such as market-moving information, that may be legal or illegal to use depending on how you learned it.
The problem is that the boundary of ownership – and the duties associated with protecting that data – change as data moves through your data pipeline. Moreover, the duties associated with the data you hold change all the time based upon both the law and our ability to combine or de-anonymize datasets.
Further, once we have established ownership, there is the difficulty of proving where we learned certain pieces of information, and tracking that metadata through a processing pipeline. We also need to build in controls into certain sorts of data applications, because there are also some types of information that are legal to use when separated, but may be illegal when brought together under certain circumstances.