Helping companies to be better citizens of the world means providing them with information about a myriad of issues such as Human Rights, Diversity, Climate Change, Carbon Emissions, etc., and helping them prioritise the different signals. For us Python developers and data scientists, this means working with thousands of sources of different types (PDF, HTML, text, Tweets, etc.) and building a scalable and flexible data pipeline that can ingest, analyse, normalise and summarise all these signals.
We decided to use Python to hook up all the components of our stack. At the core of our data application lies spaCy, which is the natural language processing engine enabling the extraction of meaningful information from large amounts of textual data.
We will present our workflow at a conceptual level (collecting data, textual analysis, creating insights). We will then describe the different components of our stack and why we chose them (Mongo, ElasticSearch, spaCy, AWS). Finally, we will share the lessons we have learned along the way on this challenging journey. Examples and code illustrating the main points will also be discussed during the talk.