Filmed at PyData London 2017
Description Most companies a very well aware of the potential behind Big Data solutions today and happily start collecting every piece of information building huge pools of Dark Data. How could Data Science teams create an initial overview on what's available? A simple search strategy, optimised and refined to scale could be a promising way to start.
Abstract In this talk the authors journey of making the pool of Dark Data available to teams with quite different goals is reflected, emphasising on creating a simple and robust set of tools matching each other and addressing the several needs of the teams based mainly on solutions such as dask distributed, dask based dataframes, bokeh and flask.
The key to success was to prevent structuring too much at the very beginning and postpone this task into the several projects of the users consuming the results of these services giving them the freedom to create and use their own models.
It is shown how we implemented a distributed filesystem scanning utility to crawl for data in our 1.5 PB storage system every night ending up in a simple, yet useful table of contents, and how this result set is processed further to fulfill all the project teams requirements.
These services are for example used to
find expensive duplicates of datasets create customer as well as product and service orientated views on the available data find data suitable to test algorithms, software and procedures, and to derive current performance serve training and education material show the usage frequency of the datasets to support an optimised data tiering process Finally the involved procedures helped to gain more awareness of the value the available data had, both helping to build more trust in Big Data based solutions and to reduce the volume of the data itself that is available online, which in turn keeps the corresponding costs at a reasonable rate.