Description
Selecting the optimal set of features is a key step in the ML modeling process. This talk will present research conducted that tested five approaches for feature selection. The approaches included current widely used methods, along with novel approaches for feature selection using open-source libraries, building a classification model using the Lending Club dataset.
A central component to the Machine Learning process is feature selection. Selecting the optimal set of features is important to generate a best fit model which generalizes to unseen data. A widely used approach for feature selection involves calculating Gini Importance (Gain) to identify the best set of features. However, recent work from Scott Lundberg has found challenges with the consistency of the Gain attribution method. This talk will present results of model metrics on the Lending Club dataset, testing five different feature selection approaches. The approaches tested involved widely used approaches combined with novel approaches for feature selection.
Through the experimental design of the five feature selection approaches that were tested; attendees will gain clarity on the impact of:
- Data splitting method
- Including relevant two-way and three-way interactions (xgbfir library)
- Backwards stepwise feature selection as opposed to a singular feature selection step
- Backwards stepwise feature selection using Shapley values (shap library).
The knowledge from this research can provide added predictive power and velocity to the feature selection process for Data Scientists.