Contribute Media
A thank you to everyone who makes this possible: Read More

Hierarchical Bayesian Modelling with PyMC3 and PySTAN

Description

PyData London 2016

Can we use Bayesian inference to determine unusual car emissions test for Volkswagen? In this worked example, I'll demonstrate hierarchical linear regression using both PyMC3 and PySTAN, and compare the flexibility and modelling strengths of each framework.

Overview

Bayesian inference bridges the gap between white-box model introspection and black-box predictive performance. We gain the ability to fully specify a model and fit it to observed data according to our prior knowledge. Small datasets are handled well and the overall method and results are very intuitive: lending to both statistical insight and future prediction.

This talk will demonstrate the use of Bayesian inference in a real-world scenario: using a set of hierarchical models to compare exhaust emissions data from a set of vehicle manufacturers.

This will be interesting to people who work in the Type A side of data science, and will demonstrate usage of the tools as well as some theory.

The Frameworks

PyMC3 and PySTAN are two of the leading frameworks for Bayesian inference in Python: offering concise model specification, MCMC sampling, and a growing amount of built-in conveniences for model validation, verification and prediction.

PyMC3 is an iteration upon the prior PyMC2, and comprises a comprehensive package of symbolic statistical modelling syntax and very efficient gradient-based samplers using the Theano library of deep-learning fame for gradient computation. Of particular interest is that it includes the Non U-Turn Sampler NUTS developed recently by Hoffman & Gelman in 2014, which is only otherwise available in STAN.

PySTAN is a wrapper around STAN, a major3 open-source framework for Bayesian inference developed by Gelman, Carpenter, Hoffman and many others. STAN also has HMC and NUTS samplers, and recently, Variational Inference - which is a very efficient way to approximate the joint probability distribution. Models are specified in a custom syntax and compiled to C++.

The Real-World Problem & Dataset

I'm currently quite interested in road traffic and vehicle insurance, so I've dug into the UK VCA Vehicle Type Approval to find their Car Fuel and Emissions Information for August 2015. The raw dataset is available for direct download and is small but varied enough for our use here: roughly 2500 cars and 10 features inc hierarchies of car parent-manufacturer - manufacturer - model.

I will investigate the car emissions data from the point-of-view of the Volkswagen Emissions Scandal which seems to have meaningfully damaged their sales. Perhaps we can find unusual results in the emissions data for Volkswagen.

GitHub repo: https://github.com/jonsedar/pymc3_vs_pystan

Details

Improve this page