Description
Scikit-learn traditionally centered its data model around numpy arrays. However, in an important subset of scikit-learn's use cases, the original data in the machine learning pipeline is tabular: heterogeneously typed and labeled. In the meantime, pandas has become very popular, and increasingly used to represent such tabular data, but scikit-learn does not always play well with heterogeneous DataFrames. This talk will give an overview of the challenges and current bottlenecks when working with tabular data and scikit- learn. Then it will show the ungoing developments in sckikit-learn to improve this situation and highlight some third-party libraries that try to ease those problems.Presenter(s): Speaker: Joris Van den Bossche, Université Paris-Saclay Center for Data Science