Description
PyData Berlin 2016
Your source data has multiple formats? You have multiple API’s to pull data from? This talk will go through some common problems with solutions that you will face when trying to combine multiple different research data sources in programmatic way. We go through a real world web project on that visualizes poverty data with JSON API's, Shapefiles and Excel spreadsheets as data sources.
Introduction
- Give talk goals: This talk aims to give the tools to solve the engineering challenges related to combining different datasources
- Set the context: We use geodata from humanitarian projects as an example, but solutions will apply to other areas as well.
- Go through talk outline
Part 2: Quickly introduce the project
Give the audience idea of real world project in preparation for the part 3
- Show screenshots of the final project
- Go through the used technologies (ESRI shapefiles, geo/topo json, xls, API’s, python libraries)
- Introduce the data pipeline
Part 3: Explain common problems and our solutions for them
This is the meat of the talk, each point introduces problem and suggests at least one solution. Solutions are based on Python technologies
- Handling different data formats
- How to manage the data sources (validation, automation, etc)
- Normalizing units
- Mapping problems (different projects may follow different standards for the id’s)
- Normalizing data and metadata
Part 4: Wrap up
- Quickly explain how we applied these problems in the project
- Sum up the things you should consider (check-list)