I present tools for collecting data generated by Scientific Python community development infrastructure (mailing list archives, pull requests, issue trackers) and analyzing it with Pandas and NetworkX. Showing preliminery results using social network analysis and complex systems modeling, I demonstrate using reflexive data science to enrich our understanding of open source development.
The Scientific Python community's contributions to greater scientific understanding have been underappreciated by academic institutions. One reason for this is that software engineering is widely misunderstood and not recognized as research work in its own right, as opposed to paper publication and patents. A better understanding of the open source software development process itself will help academic institutions recognize the contributions of open source developers.
I collect historical data from development of Scientific Python projects and render these into formats suitable for analysis using SciPy tools. To demonstrate the potential of this work, I will show two ways of analyzing this data scientifically: as a self-excited Hawkes process exibiting shock behavior, and as information diffusion over a social network.
The purpose of this talk is twofold.
First, to introduce tools and techniques for turning data from open source software production into scientific data suitable for analysis. This talk proposes that there's an opportunity for SciPy to engage in reflexive data science, using its own data to learn more about how it functions and how to operate more efficiently.
Second, this talk will present visualizations of the data based on complex systems research and social network analysis. Building on prior work, these results will focus on the role of productive bursts in communications. Drawing on social network analysis and prior work on roles in Usenet communities and open source communities, this talk will provide historical insight into the interaction between SciPy communities.