This tutorial will explore the use of tools in the Pandas data analysis library for analyzing unevenly spaced time series data. The tutorial will start off with a brief primer on Pandas and the data.world API, and demonstrate how to use Pandas tools for analyzing data from The Simpsons episodes from data.world.
Indeed data scientists occasionally analyze time series data in which the events of interest are unevenly spaced. For example, when we want to understand how a change to a user interface for Indeed Hire recruiters affects the time it takes them to review candidates, we might look at changes in time intervals between individual candidate dispositions in our logs. When we want to understand the ratio of new business to repeat business - or explore different definitions of repeat business - we analyze the intervals in the creation dates of new requisitions from the same client.
The Pandas data analysis library offers powerful tools for conducting time series analysis. When working on unevenly spaced time series, we have found the shift() and transform() DataFrame methods particularly helpful. Many of the examples of using these methods that we found on the web were used only on small, artificial datasets. Determining how best to apply them to real datasets was not always as straightforward as we would have hoped.
Rather than use internal proprietary data to illustrate examples of how these methods can be used effectively to analyze unevenly spaced time series data, we will instead use data from a publicly available dataset of episodes of The Simpsons at data.world (https://data.world/data-society/the-simpsons-by-the-data). In doing so, we will also provide an introduction on how to use the data.world API.
The purpose of this tutorial is to
- Provide a brief, focused primer on some basic aspects of Pandas
- Provide an overview of data.world datasets and accessing them via the API
- Show how advanced Pandas tools can be used for analyzing unevenly spaced time series data
Participants will be best prepared for this tutorial if they
- Understand Python basics
- Have Python 2 or Python 3 installed on their computers
- Install the latest versions of Pandas and Jupyter Notebook (recommended: use Anaconda)
- Install the data.world Python API (pip install git+git://github.com/datadotworld/data.world-py.git)
- Create a data.world account and an API key via the data.world Advanced Settings page
- Update: jupyter notebooks associated with the tutorial have been uploaded to a GitHub repository (https://github.com/gumption/pydata-simpsons).