Out-of-Core Columnar Datasets

Summary

Tables are a very handy data structure to store datasets to perform data analysis (filters, groupings, sortings, alignments...).

But it turns out that how the tables are actually implemented makes a large impact on how they perform.

Learn what you can expect from the current tabular offerings in the Python ecosystem.

Description

It is a fact: we just entered in the Big Data era. More sensors, more computers, and being more evenly distributed throughout space and time than ever, are forcing data analyists to navigate through oceans of data before getting insights on what this data means.

Tables are a very handy and spreadly used data structure to store datasets so as to perform data analysis (filters, groupings, sortings, alignments...). However, the actual table implementation, and especially, whether data in tables is stored row-wise or column-wise, whether the data is chunked or sequential, whether data is compressed or not, among other factors, can make a lot of difference depending on the analytic operations to be done.

My talk will provide an overview of different libraries/systems in the Python ecosystem that are designed to cope with tabular data, and how the different implementations perform for different operations. The libraries or systems discussed are designed to operate either with on-disk data (PyTables, relational databases, BLZ, Blaze...) as well as in-memory data containers (NumPy, DyND, Pandas, BLZ, Blaze...).

A special emphasis will be put in the on-disk (also called out-of-core) databases, which are the most commonly used ones for handling extremely large tables.

The hope is that, after this lecture, the audience will get a better insight and a more informed opinion on the different solutions for handling tabular data in the Python world, and most especially, which ones adapts better to their needs.