Contribute Media
A thank you to everyone who makes this possible: Read More

Orchestrating a climate modeling data pipeline

Summary

PyCon Canada 2015:

Talk Description: In order to run high-resolution regional climate models, it is necessary to interpolate and pre-process large amounts of data from a global climate model at the boundaries of the regional model. Several C and Fortran tools are available in the scientific community to achieve different aspects of this task, but communication between these tools is limited to the filesystem (the program/tool reads input from a file and writes output to a file). In a High Performance Computing (HPC) environment, filesystem access is a bottleneck and temporary files should be avoided.

In this talk I will show how a Python driver module and an in-memory filesystem (RAM disk) can be used to orchestrate the data flow between various tools without creating temporary files on disk and fully automate the entire process. Except for the first input and the last output step, all file I/O is redirected to the RAM disk. The process can also be parallelized in the Python driver module by distributing different input files to different processes using Python multiprocessing. The use of this technique leads to a speed-up of 800% compared to traditional methods, and requires no human intervention.

Different input datasets are supported and new datasets can be added easily due to the object oriented implementation: at every stage of the pre-processing pipeline a dataset method can be overloaded and a different tool can be used, depending on the input dataset. This would not have been possible in a simple scripting language that might otherwise be used to automate such a process.

This module (called PyWPS), is part of the WRF Tools package, a set of Python modules and shell scripts designed to facilitate the operation of a regional climate model (the Weather Research and Forecasting model - WRF) in a HPC environment. It is capable of autonomously running the model over extended periods of time (including automatic crash handling and restarts), automatic pre- and post-processing and archiving.

In the presentation I will first provide some context on regional climate modeling and its computational challenges, before detailing the main design features of the Python WRF pre-processing system (PyWPS).

The package is available on GitHub: https://github.com/aerler/WRF-Tools

Details

Improve this page