Description
The access to data stocks made publicly available has never been easier than in present times. Among many others, FranceArchives and the Muséum National d’Histoire Naturelle are examples of public institutions providing access to their data. FranceArchives is a service by the Service interministériel des archives de France (SIAF) to help researchers and enthusiasts explore the French archives by grouping and enriching information about their contents. The Muséum National d’Histoire Naturelle (MNHN) is launching the « data.mnhn.fr » project which aims to unify their data stocks and integrate additional information in order to present their naturalists' academic career. In order to help navigate these enormous amounts of data, both institutions aim to enrich their data stocks by integrating additional information from third-party sources, such as e.g. Wikidata and Geonames.
To this end, tools such as Nazca are used to align the datasets in question. There are two main challenges: the datasets' sizes and the heterogenity between them. Nazca is an open-source highly adaptable Python tool addressing both challenges by providing blocking algorithms to reduce the number of comparisons and normalization functions to clean up the data. The tool provides a customizable pipeline containing the normalization, blocking and alignment steps and automatically collects and re-assembles the results at the end.
The talk is split into two major parts: we begin by shortly introducing Nazca and its pipeline. We then illustrate how its used in FranceArchives to align geographical data to Geonames and close by explaining how it is used to create controlled vocabularies for names in data.mnhn.fr.