We discuss a means by which item response theory (IRT), originally developed as a psychometric tool for assessing a person's intellectual or academic ability given their performance on a standardized test, can be used as a data quality tool. Assuming that a dataset has an underlying "ability" to train predictive models (where the ability is specific to the type of dependent variable being predicted), we build many models on top of a variety of datasets to simultaneously assess the best dataset for a given dependent variable as well as which cases are the most "difficult" for a dataset to predict correctly. The product of this work is an understanding of both which predictions are the "hardest" to get correct for any dataset, as well as which dataset is expected to give the best predictions on a new dependent variable.
The first step in this study is to build a laboratory in which many related models can be trained and validated, reproducibly and in a self-documenting way. By running many models that look at related dependent variables, for example, a number of variables meant to predict different aspects of political behavior, we can characterize a baseline expected performance for any new model similar to those already built. We call this suite of related models a market basket, after the terminology and methodology used by economists to summarize the state of a market.
Then, when we investigate new data sources or formats, we have a well-defined process for determining whether the new data makes the models better--we re-build our market basket, and compare the results with the new data to the results without it (performance, model build time, data storage constraints) to assess the quality of our data in a way that is driven by the models and data itself.
An interesting question is how to assess whether a given dataset or feature is "better" for a given basket of models. An interesting idea comes to us from the field of psychometrics, which uses a set of tools called item response theory to assess exams (such as the SAT and GRE) and use exams to rank students by intellectual or academic ability.
Borrowing the terminology of IRT, we draw the analogy that a dataset is like a student (it has an inherent capability to accomplish certain tasks, like building good models), a single model prediction is a test question, and full set of test predictions is an exam. IRT parameterizes both the (unknown) student ability and the (also unknown) test question difficulty, and uses the EM algorithm to simultaneously solve for the parameters of both quantities at once. This allows a researcher to know both how "smart" a dataset is for solving a given basket of models, as well as rank-order "exam questions" (model predictions) by difficulty. The result is a single methodology with applications for both data quality and assessing the difficulty of making a given prediction (useful for e.g. outlier identification).