Caterva: A Compressed And Multidimensional Container For Big Data

YouTube

Description

Caterva is a C library on top of C-Blosc2 that implements a simple multidimensional container for compressed binary data. It adds the capability to store, extract, and transform data in these containers, either in-memory or on-disk.

While there are several existing solutions for this scenario (HDF5 is one of the most known), Caterva brings novel features that, when taken toghether, set it appart from them:

Leverage important features of C-Blosc2. C-Blosc2 is the next generation of the well-know, high performance C-Blosc compression library (see below for a more in-depth description).
Fast and seamless interface with the compression engine. While in other solutions compression seems an after-thought and can implies several copies of buffers internally, the interface of Caterva and C-Blosc2 (its internal compression engine) is meant to be as direct as possible minimizing copies and hence, increasing performance.
Both in-memory and on-disk paradigms are supported the same way. This allows for using the same API for data that can be either in-memory or on-disk.
Support for a plain buffer data layout. This allows for essentially no-copy data sharing among existing libraries (NumPy), allowing to use existing functionality to be used directly in Caterva without loosing performance.

Along this features, there is an important 'mis-feature': Caterva is type- less. Lacking the notion of data type means that Caterva containers are not meant to be used in computations directly, but rather in combination with other higher-level libraries. While this can be seen as a drawback, it actually favors simplicity and leaves up to the user the addition of the types that he is more interested in, which is far more flexible than typed-aware libraries (HDF5, NumPy and many others).

During our talk, we will describe all these Caterva features by using cat4py, a Python wrapper for Caterva. Among the points to be discussed would be:

Introduction to the main features of Caterva.
Description of the basic data container and its usage.
Short discussion of different use cases:
Create and fill high dimensional arrays.
Get multi-dimensional slices out of the arrays.
How different compression codecs and filters in the pipeline affect store/retrieval performance.

We have been using Caterva in one of our internal projects for several months now, and we are pretty happy with the flexibility and easy-of-use that it brings to us. This is why we decided to open-source it in the hope that it would benefit others, but also that others may help us in developing it further ;-)

About C-Blosc and C-Blosc2

C-Blosc is a high performance compressor optimized for binary data. It has been designed to transmit data to the processor cache faster than the traditional, non-compressed, direct memory fetch approach via a memcpy() OS call. Blosc is the first compressor (that we are aware of) that is meant not only to reduce the size of large datasets on- disk or in-memory, but also to accelerate memory-bound computations.

C-Blosc2 is the new major version of C-Blosc, with a revamped API and support for new compressors and new filters (data transformations), including filter pipelining, that is, the capability to apply different filters during the compression pipeline, allowing for more adaptability to the data to be compressed. Dictionaries are also introduced, allowing better handling of redundancies among independent blocks and generally increasing compression ratio and performance. Last but not least, there are new data containers that are meant to overcome the 32-bit limitation of the original C-Blosc. Furthermore, the new data containers are available in various formats, including in-memory and on-disk implementations.

Caterva is a library on top of the Blosc2 compressor that implements a simple multidimensional container for compressed binary data. It adds the capability to store, extract, and transform data in these containers, either in-memory or on-disk.

PyVideo