The IceCube data pipeline from the South Pole to publication

YouTube

Description

PyData Berlin 2016

Transforming data produced at the IceCube Neutrino Observatory into scientific results presents several interesting challenges. My talk will give an overview of our software stack, written in a colorful mix of Python, C++, Java, and FORTRAN, present some of our custom solutions, and highlight areas where we use standard libraries and design patterns.

The IceCube Neutrino Observatory (http://icecube.wisc.edu/) uses a grid of more than 5000 light sensors embedded in a cubic kilometer of glacial ice at the geographic South Pole to detect neutrinos, some of the most elusive particles in the universe. These neutrinos are interesting because the most energetic of them likely hold the key to one of the longest-standing problems in astrophysics, the origin of ultra-high-energy cosmic rays.

Cosmic rays were discovered over 100 years ago, and in the mean time we know that they are electrically charged particles (mostly protons) from outer space. The most energetic of them pack the kinetic energy of a rifle bullet into a single atomic nucleus, well beyond the reach of any particle accelerator that we could ever build on Earth. What we don't know, however, is where or how they are accellerated to such energies, because their directions are scrambled by intervening magnetic fields. This is where neutrinos enter the picture. They are produced in large numbers whenever particles like protons are smashed into each other at high energies, something that is bound to happen in the cosmic accelerators that produce the cosmic rays we observe at Earth. Neutrinos are, however, electrically neutral and only interact via the weak force, which makes them extremely difficult to stop. That makes high-energy neutrinos a very interesting way to do astronomy: unlike photons, they can't be absorbed in dust clouds and radiation fields, and unlike cosmic rays, they can't be deflected by magnetic fields.

The elusiveness of neutrinos is also what makes IceCube data analysis interesting and challenging. The first challenge on the way from raw data to scientific result is data preparation: one needs to actually find neutrino events in the data. For every neutrino event, there are millions of events caused by cosmic-ray air showers that penetrate the 1.5 km of ice over the detector. We separate the neutrinos from uninteresting background events using the features of each event. The optimal set of features and the separation strategy depend on the kind of measurement one wants to do, and there are many different directions of interest within the 300-strong IceCube Collaboration. This means that the event processing chain and data format need to be extremely flexible, allowing physicists with limited software skills to add new features to the data. Our data processing framework, called IceTray, represents each event as a frame, essentially a dictionary of string-key pairs. Each frame is passed through a series of modules that can add new features of arbitrary type based on the data already in the frame, or drop frames from further processing. While the core of IceTray and many of its modules are written in C++, all the configuration is done in Python, and modules themselves can be written in Python. This allows for quite rapid and accessible development. The serialized form of these frames, called an "I3" file, also serves as our long-term archive format.

Once a data sample has been prepared, it's time for visualization and analysis. Here we move out of our custom world and start using standard tools and data formats. We support two analysis universes: one based on HDF5 (https://www.hdfgroup.org/HDF5/), which is convenient to read with PyTables (http://www.pytables.org/), and another based on ROOT (https://root.cern.ch/), an analysis framework and associated data format specific to high-energy physics. We write to such "tabular" formats using an internal framework called tableio that hammers each object in a frame into a series of table rows. Backends for HDF5 and ROOT then emit these rows in a specific format, along with a description and unit string for each column.

I will illustrate a typical Python-based analysis using the discovery of the first high-energy astrophysical neutrinos as an example (http://science.sciencemag.org/content/342/6161/1242856).

PyVideo

The IceCube data pipeline from the South Pole to publication

Description

Details