The Python and the Elephant: Large Scale Natural Language Processing with NLTK and Dumbo (#120)

Presented by Nitin Madnani (University of Maryland, College Park); Dr. Jimmy J Lin (University of Maryland)

PyCon US 2010

A practical look at NLTK and Dumbo, python-powered and open-source toolkits and APIs for processing natural language on a large scale.

For people like us who make a living trying to make a computer "understand" human language, Python is a very powerful language, given its rapid prototyping abilities, native unicode support and a stellar standard library. This relationship has been strengthened further by an open-source, python-based Natural Language ToolKit (nltk.org) which is being widely used in the community for both teaching and research purposes and gaining traction in the general Python community as well (pypi.python.org/pypi/nltk). Recently, the Python community has seen the release of Dumbo (github.com/klbostee/dumbo), an open-source, python-based cloud-computing API (based on Hadoop) via the hands of Klaas Bosteels.

In this talk, we show how the amalgamation of Python, NLTK and Dumbo can allow for very large-scale natural language processing efficiently and elegantly.

Recorded: 2010-02-19

Tags: dumbo, nltk, pycon, pycon2010