Gold standard data: lessons from the trenches

YouTube

Description

The first stage in a data science project is often to collect training data. However, getting a good data set is surprisingly tricky and takes longer than one expects. This talk describes our experiences in labelling gold-standard data and the lessons we learnt the hard way. We will present three case studies from natural language processing and discuss the challenges we encountered.

Abstract

It is often said that rather than spending a month figuring out how to apply unsupervised learning to a problem domain, a data scientist should spend a week labelling data. However, the difficulty of annotating data is often underestimated. Gathering a sufficiently large collection of good-quality labelled data requires careful problem definition and multiple iterations. In this talk, I will describe three case studies and lessons learnt from them. Each case shows several aspect of the process that should be considered in advance to ensure the project is successful.

PyVideo

Gold standard data: lessons from the trenches

Description

Details