With the trend towards data streams, building successful streaming analysis systems means building a community comfortable with streaming tech. But getting started with stream processing can be intimidating for anyone. In this talk, I’ll talk about designing and deploying a mini-testbed system to scale down the stream and how you can practice your favorite algorithm on an astronomical data stream.
The increasing availability of real-time data sources and the Internet of Things movement have pushed data analysis pipelines towards stream processing. But what does this really mean for my applications, and how do I have to change my code and workflow? In a new era of “Kappa architecture,” it’s easier than ever to use the same programming model for both batch and stream processing.
For those interested in the design and operations side, I will cover high-level design considerations for architecting a modular and scalable stream processing infrastructure that can support the flexibility of different use cases and can welcome a community of users who are more familiar with batch processing.
For the fast-batching Pythonistas, I’ll talk about some of the advantages of using streaming tech in a data processing pipeline and how to make your life easier with
- built-in replication, scalability, and stream “rewind” for data distribution with Kafka,
- structured messages with strictly enforced schemas and dynamic typing for fast parsing with Avro, and
- a stream processing interface that is similar to batch with Spark that you can even use in a Jupyter notebook.
When you’re ready to jump into the stream, or at least take a drink from the fountain, I’ll point you to an open source, containerized (with Docker), streaming ecosystem testbed that you can deploy to mock a stream of data and take your streaming analytics on a dry run over an astronomical data stream.