Description
PyData DC 2016
Data science is the backbone of modern scientific discovery and industry. Unfortunately, multiple recent studies have been found to be unreliable and non-reproducible. Adopting techniques from software engineering might help mitigate some of these problems.
Data science is the backbone of modern scientific discovery and industry. It makes sense of everything from cancer trials to package delivery logistics. But all is not well with data science. Over the past decade, multiple studies have been found to be unreliable and non-reproducible when other scientists tried to recreate their results. This is due to a variety of factors, including fraud, pressure to publish, improper data handling practices, and bugs in analytic tools.
The problems faced by data science mirror problems that software engineering has been trying to solve. While there are no silver bullets to guarantee quality software, techniques have been developed over time that have improved quality and reliability. Some of these techniques, including open source, version control, automation, and fuzzing could be adapted to the data science domain to improve reliability and help address the reproducibility crisis.