Python MapReduce Programming with Pydoop

Summary

[EuroPython 2011] Simone Leo - 24 June 2011 in "Track Lasagne"

"/>

Description

Hadoop is the leading open source implementation of MapReduce, Google's large scale distributed computing paradigm. Hadoop's native API is in Java, and its built-in options for Python programming - Streaming and Jython - have several drawbacks: the former allows to access only a small subset of Hadoop's features, while the latter carries with it all of the limitations of Jython with respect to CPython.

Pydoop is an API for Hadoop that makes most of its features available to Python programmers while allowing CPython development. Its core consists of Boost.Python wrappers for Hadoop's C/C++ interface.

The talk consists of a MapReduce/Hadoop tutorial and a presentation of the Pydoop API, with the main goal of bridging the gap between the Hadoop and Python communities. A basic knowledge of distributed programming is helpful but not strictly required.