Big Data Processing with Apache Beam Python

YouTube

Description

Two trends for data analysis are the ever increasing size of data sets and the drive for lower-latency results. In this talk, we present Apache Beam--a parallel programming model that allows one to implement batch and streaming data processing jobs that can run on a variety of scalable execution engines like Spark and Dataflow--and its new Python SDK. We discuss some of the interesting challenges in providing a Pythonic API and execution environment for distributed processing, and show how Beam allows the user to write a Python pipeline once that can run in both batch and streaming mode. We walk through a few examples of data processing pipelines in Beam for use cases such as real time data analytics and feature engineering with Tensorflow for machine learning pipelines.

PyVideo

Big Data Processing with Apache Beam Python

Description

Details