PyOhio 2010: Processing Large Datasets with Hadoop and Python

Description

Processing Large Datasets with Hadoop and Python

Presented by William McVey

This talk will explore how Hadoop along with Python can be used to process large datasets. An overview of the Apache Hadoop project will be given. The map/reduce concept will be introduced and some methods of coding the data processing routines in python will be explored. The talk will use real world examples to illustrate how this approach can be used to parallelize computationally expensive operations across multiple cluster nodes effectively using python.

The course will assume familiarity with the Python language during the demos, but will not actually require a deep knowledge of python to understand the concepts introduced.