Streaming or batch is an ongoing debate with the large-scale adoption of “Big Data”. In this talk, we discuss the pros & cons of batch vs. streaming processing, especially with respect to the workflow of data engineers and data scientists. We also present a a demo using PySpark Logistic Regression for offline and online model training to illustrate the difference between the two.
The large-scale adoption of “Big Data” has created a multitude of exciting new job roles and technologies. In line with this, data scientists and data engineers have both become key members of many technology teams, a coexistence which has often motivated the debate: streaming or batch? In this talk, we discuss the pros & cons of batch vs. streaming processing, especially with respect to finding common ground between the data engineer and data scientist’s workflows. We address the types of cases that are applicable to batch processing, streaming processing, or both. Finally, we present a demo using PySpark Logistic Regression for offline or online decisioning and model updating, to illustrate different ways to utilize batch and/or streaming processing to apply machine learning to real-time data.