Contribute Media
A thank you to everyone who makes this possible: Read More

Getting Scikit-Learn To Run On Top Of Pandas

Translations: en


Ami Tavory

Ami is a data scientist at Facebook Research's Core Data Science group. He previously worked as a machine learning researcher in the fields of bioinformatics and algorithmic trading. In 2010 he received a Ph.D in Electrical Engineering from Tel Aviv University, in the field of financial information theory. His bachelor's and master's are from Tel Aviv University too.

Ami uses Python and C++ for data analysis. He contributed to various open source projects, and is the author of a libstd C++ extension shipped with g++ (pb_ds: policy-based data structures).


Scikit-Learn is built directly over numpy, Python's numerical array library. Pandas adds to numpy metadata and higher-level munging capabilities. This talk describes how to intelligently auto-wrap Scikit-Learn for creating a version that can leverage pandas's added features.


Scikit-Learn is the de-facto standard Python library for general-purpose machine learning. It operates over NumPy, an efficient, but low-level, homogeneic array library. Pandas adds to NumPy metadata, heterogeneity, and higher-leve munging capabilities.

In the field of visualization, newer generation libraries, e.g., Seaborn and Bokeh, are providing safer, more readable, and higher-level functionality, by operating over Pandas data structures. Some of these are implemented using Matplotlib, a lower-level NumPy-based plotting library.

This talk describes a library for a Pandas-based version of sickit-learn. Here, too, giving a Pandas interface to a machine-learning library, provides code which is safer to use, more readable, and allows direct integration with Pandas's higher-level munging capabilities.

Due to the large-scale, and evolving nature, of sicikit-learn's codebase, it is infeasible to manually wrap it. Except for a small number of intentional deviations from sickit-learn, the library wraps Scikit-Learn modules lazily through module and class introspection, and dynamic module loading.

Following a short review of the relevant points of Pandas and Scikit-Learn, the talk is roughly divided into two aspects: Scikit-Learn And Pandas User Perspective Safety Advantages Of Pandas-Based Estimators Using Metadata For Inter-Instance Aggregated Features And Cross-Validation Using Metadata For Advanced Meta-Algorithms: Stacking, Nested Labeled And Stratified Cross-Valdiation Python Develop Perspective Unique Challenges Of Scikit-Learn Introspection And Decoration Two Approaches For Wrapping Scikit-Learn Estimators Lazy Dynamic Module Loading

Recorded at PyCon.DE 2017 Karlsruhe:

Video editing: Sebastian Neubauer & Andrei Dan

Tools: Blender, Avidemux & Sonic Pi


Improve this page