Efficient ML pipelines using Parquet and PyArrow - PyCon Italia 2022
Parquet is an high-performance columnar data format that has become the de facto standard in the ML world. By leveraging the powerful PyArrow API, I’ll show how to manage parquet datasets, ranging from a single local file to a partitioned cloud-based dataset updated in real time. Advanced analytics and Machine Learning (ML) are increasingly used to drive business decisions or provide real-time services for end-users in virtually every industry. Tabular data is the most ubiquitous type of data. Therefore, efficient processing of handle tabular datasets is a critical requirement to deliver performant products or services.
In a proto-typical production ML workflow, an “ingestion pipeline” needs to store large datasets on the cloud and continuously update them as new data becomes available. An “analytics pipeline” usually needs to process the entire dataset by reading it in batches, because the full dataset would be too large to fit in RAM. An “inference pipeline” provides real-time results (i.e. model predictions or other online statistics) and needs to process small batches of data in quasi-realtime. Finally, the presentation of analytics results requires not only to show the output from the models but also to provide context through “historical data” for an arbitrary set of features. Therefore, low-latency access to a small group of columns from a large dataset represents an additional requirement.
In the Python ecosystem, we can leverage tools such as Parquet and PyArrow to address such complex workflow.
Apache Parquet is a columnar storage format initially created to address similar storage challenges in the Hadoop ecosystem. It has since become a standard for efficient storage of large datasets in all the major languages, including Python.
The Apache Arrow project provides a cross-language in-memory representation and query engine for tabular datasets and has a performant IO interface for Parquet datasets. Its Python interface, PyArrow, allows to query and process large partitioned datasets distributed across multiple files and folders on local and cloud storage.
In this talk, combining PyArrow and Parquet datasets, we will explore several techniques to address the use-cases of the typical production ML workflows delineated above.