Description
The first stage in a data science project is often to collect training data. However, getting a good data set is surprisingly tricky and takes longer than one expects. This talk describes our experiences in labelling gold-standard data and the lessons we learnt the hard way. We will present three case studies from natural language processing and discuss the challenges we encountered.
Abstract
It is often said that rather than spending a month figuring out how to apply unsupervised learning to a problem domain, a data scientist should spend a week labelling data. However, the difficulty of annotating data is often underestimated. Gathering a sufficiently large collection of good-quality labelled data requires careful problem definition and multiple iterations. In this talk, I will describe three case studies and lessons learnt from them. Each case shows several aspect of the process that should be considered in advance to ensure the project is successful.