Ensemble methods are extremely performant in terms of prediction, but lack easy interpretation. Feature importance is not only counting up how many times a feature has been used in a weak learner, but also by how much this feature contributes to the result. Detailed example and implementation are provided in a jupyter notebook in python for the library "xgboost" of extreme gradient boosting.
I - Feature importance in ensemble algorithms - state of the art
- Feature importance in sklearn/xgboost: basically counts the occurrences of a feature in all the weak learners
- Construction of the trees in xgboost: if the trees are deep enough, every feature is going to be used
- Global feature importance is a misleading: a given feature might be critical for a given subpopulation but completely irrelevant for another (ex : multi-class classification)
II - Xgboost real feature importance
- Prediction influence: first splits influence the prediction more than last splits, so the importance of a feature must be weighted by the discrimination it provides
- Point-to-point feature importance: following the path of a given prediction, it is possible to weigh the importance of every used feature
- A relevant assessment of feature importance: explanation of a given prediction, and aggregation on a set of data points
III - Implementation and examples
- Point-to-point feature importance illustration and implementation explanation
- Evolution of feature importance with respect to learning iterations
- Noisy variables cancellation
IV - Limits and ways forward
- A word on correlated variables
- Is there a compromise performance/interpretation ?