NLP on a Billion Documents: Scalable machine learning with spark

YouTube

Summary

Apache Spark is becoming the new lingua franca for distributed computing. In this talk I'll show how many machine learning tasks can be scaled up almost trivially using Spark. For instance, we'll see how a semi-supervised NLP algorithm can be trained on a billion training examples using a Spark cluster.

Description

Apache Spark is becoming the new lingua franca for distributed computing. In this talk I'll show how many machine learning tasks can be scaled up almost trivially using Spark.

After introducing the Spark computational model I'll detail some useful design principles for running Spark programs on large datasets. I'll also give some tips for effective configuration of a PySpark cluster.

The talk will include a step-by-step walk through of the scaling-up of several NLP algorithms. For instance, we'll see how a semi-supervised NLP algorithm can be trained on a billion training examples using a PySpark cluster.

PyVideo