When working on a data project, you will be often be facing messy input files with lots of missing or ill formatted values. Data providers may update manually, making the data source even more error prone. Once you geed the data to a data visualization or a dashboard, this will create many issues. I will show how to create a data preparation pipeline using with Pandas running on AWS Lambda.
In the talk, I will first review typical cases where a data scientist or data application developper may be faced with dirty data in unpractical formats (think excel files). I will in particular discuss my experience building data visualization in a data journalism environment here data is gathered and updated manually.
I will present alternative tools that are available on the market (Talend Dataprep, Trifacta wrangler for example), and explain why you may want to roll out your own solution. Then we will see how we can use python and pandas to clean the data, first by interacting with it in a jupyter notebook, then making it into a script.
Finally, we will see how to streamline the preparation using AWS Lambda, in an example where will will automatically run our process whenever data is updated in a google spreadsheet, and uploading the clean dataset on AWS S3.