Skdata: Data seets and algorithm evaluation protocols in Python; SciPy 2013 Presentation

Summary

Authors: Bergstra, James, University of Waterloo: Pinto, Nicolas, Massachusetts Institute of Technology; Cox, David D., Harvard University

Track: Machine Learning

Machine learning benchmark data sets come in all shapes and sizes, yet classification algorithm implementations often insist on operating on sanitized input, such as (x, y) pairs with vector-valued input x and integer class label y. Researchers and practitioners are well aware of how much work (and even sometimes judgement) is required to get from the URL of a new data set to an ndarray fit for e.g. pandas or sklearn. The skdata library [1] handles that work for a growing number of benchmark data sets, so that one-off in-house scripts for downloading and parsing data sets can be replaced with library code that is reliable, community-tested, and documented.

Skdata consists primarily of independent submodules that deal with individual data sets. Each [new-style] submodule has three important sub-sub-module files:

a 'dataset' file with the nitty-gritty details of how to download, extract, and parse a particular data set;

a 'view' file with any standard evaluation protocols from relevant literature; and

a 'main' file with CLI entry points for e.g. downloading and visualizing the data set.

Various skdata utilities help to manage the data sets themselves, which are stored in the user's "~/.skdata" directory.

The evaluation protocols represent the logic that turns parsed (but potentially ideosyncratic) data into one or more standardized learning tasks. The basic approach has been developed over years of combined experience by the authors, and used extensively in recent work (e.g. [2]). The presentation will cover the design of data set submodules, and the basic interactions between a learning algorithm and an evaluation protocol.