Description
Twenty years ago, in 2003, Python 2.3 was released with csv.reader(), a function that provided support for parsing CSV files. The C implementation, proposed in PEP 305, defines a core tokenizer that has been a reference for many subsequent projects. Two commonly needed features, however, were not addressed in csv.reader(): determining type per column, and converting strings to those types (or columns to arrays). Pandas read_csv() implements automatic type conversion and realization of columns as NumPy arrays (delivered in a DataFrame), with performance good enough to be widely regarded as a benchmark. Pandas implementation, however, does not support all NumPy dtypes. While NumPy offers loadtxt() and genfromtxt() for similar purposes, the former (recently re-implemented in C) does not implement automatic type discovery, while the latter (implemented in Python) suffers poor performance at scale.
To support reading delimited files in StaticFrame (a DataFrame library built on an immutable data model), I needed something different: the full configuration options of Python's csv.reader(); optional type discovery for one or more columns; support for all NumPy dtypes; and performance competitive with Pandas read_csv().
Following the twenty-year tradition of extending csv.reader(), I implemented delimited_to_arrays() as a C extension to meet these needs. Using a family of C functions and structs, Unicode code points are collected per column (with optional type discovery), converted to C-types, and written into NumPy arrays, all with minimal PyObject creation or reference counting. Incorporated in StaticFrame, performance tests across a range of DataFrame shapes and type heterogeneity show significant performance advantages over Pandas. Independent of usage in StaticFrame, delimited_to_arrays() provides a powerful new resource for converting CSV files to NumPy arrays. This presentation will review the background, architecture, and performance characteristics of this new implementation.