Contribute Media
A thank you to everyone who makes this possible: Read More

A brief introduction to Distributed Computing with PySpark

Description

Apache Spark is a fast and general engine for distributed computing & big data processing with APIs in Scala, Java, Python, and R. This tutorial will briefly introduce PySpark (the Python API for Spark) with some hands-on-exercises combined with a quick introduction to Spark's core concepts. We will cover the obligatory wordcount example which comes in with every big-data tutorial, as well as discuss Spark's unique methods for handling node failure and other relevant internals. Then we will briefly look at how to access some of Spark's libraries (like Spark SQL & Spark ML) from Python. While Spark is available in a variety of languages this workshop will be focused on using Spark and Python together.

This tutorial is intended for people new to Spark/PySpark, please install Spark (1.3.1 or later) from http://spark.apache.org/downloads.html before class (we are working to have cluster resources available but having a local install is sufficient for the workshop and a good backup in case the WiFi isn't cooperating).

Materials available here: Slides: http://www.slideshare.net/hkarau/a-really-really-fast-introduction-to-py-spark-lightning-fast-cluster-computing-with-python-1 Notebook: https://github.com/holdenk/intro-to-pyspark-demos/blob/master/ipython/Super-Fast-PySpark-Intro-PyData-Seattle-2015.ipynb

Details

Improve this page