A lot of the Data Scientists and Engineers don’t come from Software Engineering background and even they have an experience with writing spark code they might luck the knowledge about application structure principals. This talk is designed to help them write better and more readable code.
PySpark has become really popular for last couple of years and is now a go-to tool for building and managing data-heavy applications. One of the most common ways how Spark is used is moving some data around by writing ETL/ELT jobs. Doing that your code should be manageable and understandable to others. In this talk I will try to introduce good practice how to structure PySpark application and write jobs and also some naming conventions.
I will start this talk with an example of bad way of writing PySpark job and during the course of it we will gradually improve it so at the end our application is going to be production ready, easy to manage and share with other developers.
During this talk I will try to answer this questions: - How to structure PySpark ETL application - How to write ETL job - How to package your code and dependencies - What are some coding and naming conventions