Contribute Media
A thank you to everyone who has made this possible: Read More

HDF5 is for Lovers part 2

Summary

HDF5 is a hierarchical, binary database format that has become the de facto standard for scientific computing. While the spec may be used in a relatively simple way it also supports several high-level features that prove invaluable. HDF5 bindings exist for almost every language - including two Python libraries (PyTables and h5py). This tutorial will cover HDF5 through the lens of PyTables.

Description

Description

HDF5 is a hierarchical, binary database format that has become the de facto standard for scientific computing. While the specification may be used in a relatively simple way (persistence of static arrays) it also supports several high-level features that prove invaluable. These include chunking, ragged data, extensible data, parallel I/O, compression, complex selection, and in-core calculations. Moreover, HDF5 bindings exist for almost every language - including two Python libraries (PyTables and h5py). This tutorial will cover HDF5 itself through the lens of PyTables.

This tutorial will discuss tools, strategies, and hacks for really squeezing every ounce of performance out of HDF5 in new or existing projects. It will also go over fundamental limitations in the specification and provide creative and subtle strategies for getting around them. Overall, this tutorial will show how HDF5 plays nicely with all parts of an application making the code and data both faster and smaller. With such powerful features at the developer's disposal, what is not to love?!

Knowledge of Python, NumPy, C or C++, and basic HDF5 is recommended but not required.

Outline

  • Meaning in layout (20 min)
    • Tips for choosing your hierarchy
  • Advanced datatypes (20 min)
    • Tables
    • Nested types
    • Tricks with malloc() and byte-counting
  • Exercise on above topics (20 min)
  • Chunking (20 min)
    • How it works
    • How to properly select your chunksize
  • Queries and Selections (20 min)
    • In-core vs Out-of-core calculations
    • PyTables.where()
    • Datasets vs Dataspaces
  • Exercise on above topics (20 min)
  • The Starving CPU Problem (1 hr)
    • Why you should always use compression
    • Compression algorithms available
    • Choosing the correct one
    • Exercise
  • Integration with other databases (1 hr)
    • Migrating to/from SQL
    • HDF5 in other databases (JSON example)
    • Other Databases in HDF5 (JSON example)
    • Exercise

Details

Improve this page