Machine Learning with Imbalanced Data Sets

YouTube

Summary

Classification algorithms tend to perform poorly when data is skewed towards one class, as is often the case when tackling real-world problems such as fraud detection or medical diagnosis. A range of methods exist for addressing this problem, including re-sampling, one-class learning and cost-sensitive learning. This talk looks at these different approaches in the context of fraud detection.

Description

Classification algorithms tend to perform poorly when data is skewed towards one class. Real-world examples include fraud detection, medical diagnosis and oil spill detection, where the class of interest is generally the minority class. A key assumption built into many classification algorithms is that maximising accuracy is the goal; however, when positive instances account for only 1% of the data set in question, an accuracy of 99% doesn’t quite cut it.

A common practice to address the problem with imbalanced data sets is to rebalance them artificially using a range of sampling techniques. However, one-class learning and cost-sensitive learning algorithms are growing in popularity. This talk looks at different approaches to tackling the problem of imbalanced classes in the context of fraud detection, as have recently been explored by the GoCardless data team.

PyVideo