Processing large datasets in R have been limited by the amount of memory in the local system. To overcome the native R limitation, several cluster computing alternatives have recently emerged including Apache Spark. In this session, we will discuss the architecture of Spark and introduce the SparkR library. We will work through examples of the API and discuss additional resources to learn more.
In this tutorial, we will focus on SparkR. The outline of the tutorial is as follows: - Introduction to cluster computing with Spark - Getting started with SparkR - Deep dive into SparkR DataFrame API - Additional resources
In preparation for this tutorial please install.packages("SparkR") in your system.