Presenters: Ben Zaitlen, Clayton Davis
This tutorial is a crash course in data processing and analysis with Python. We will explore a wide variety of domains and data types (text, time-series, log files, etc.) and demonstrate how Python and a number of accompanying modules can be used for effective scientific expression. Starting with NumPy and Pandas, we will begin with loading, managing, cleaning and exploring real-world data right off the instrument. Next, we will return to NumPy and continue on with SciKit-Learn, focusing on a common dimensionality-reduction technique: PCA.
In the second half of the course, we will introduce Python for Big Data Analysis and introduce two common distributed solutions: IPython Parallel and MapReduce. We will develop several routines commonly used for simultaneous calculations and analysis. Using Disco -- a Python MapReduce framework -- we will introduce the concept of MapReduce and build up several scripts which can process a variety of public data sets. Additionally, users will also learn how to launch and manage their own clusters leveraging AWS and StarCluster.
- Setup/Install Check (15)
- NumPy/Pandas (30)
- Missing Data
- PCA (15)
- Sci-Kit Learn
- MapReduce (30)
- Count Words
- EC2 and Starcluster (15)
- IPython Parallel (30)
- Bitly Links Example (30)
- Wiki Log Analysis (30)
45 minutes extra for questions, pitfalls, and break
Each student will have access to a 3 node EC2 cluster where they will modify and execute examples. Each cluster will have Anaconda, IPython Notebook, Disco, and Hadoop preconfigured
All examples in this tutorial will use real data. Attendees are expected to have some familiarity with statistical methods and familiarity with common NumPy routines. Users should come with the latest version of Anaconda pre-installed on their laptop and a working SSH client.
Preliminary work can be found at: https://github.com/ContinuumIO/tutorials