Description
The talk is going to present, with examples, how a software engineer team can work together with data scientists (both in-house and external collaborators) in order to leverage their unique domain knowledge and skills in analyzing data, while supporting them to work independently and making sure that their work can be constantly tested/evaluated and easily integrated into the larger product.
Abstract
Collaboration between data scientists and software engineers can have the following issues:
- Different tools used between data scientists and engineers (more interactive vs more automated, for example ipython notebook vs command line)
- If getting the latest data requires ops/engineering knowledge then the analysis may be done in "stale" data or a too-small subset of the data (As an example: data scientists working with manual exports )
- Regression testing/parameter tuning/evaluation of results/backfills and other common scenarios in data-driven applications also require more engineering knowledge. The engineers are in the best position to provide tools and processes for the data science team, but it can happen that this potential goes untapped
Those issues lead to more time to production, unhappiness in the data science team if they end up fighting with operations work instead of doing mostly the work they like, less trustworthy results and less trust between teams in general. If collaboration is done right however, data science and engineering teams can have a very good symbiotic relationship where each person takes advantage of their strengths towards a common goal.
Some collaboration patterns to foster a good relationship between data scientists and engineers are the following:
- Continuous evaluation – making sure the data science algorithm continues to give good results with every commit (or combinations of commits, in case there is several repositories with different data scientists working on them)
- Report templating – data scientists can work with jupyter notebooks with an extension that allows those ipynb files to be used as templates (ie, where some variable values can be filled in later). Those notebooks can then be applied to different datasets to quickly diagnose issues.
- Data API – have a well documented API for the data scientists to have easy access to the data so that they can do their exploration without needing the software engineering team to manually provide exports
- Some flexibility regarding tools – if domain experts prefer to use SFTP to upload files to the server for analysis, let them. Too much flexibility can be an anti-pattern.