This tutorial will focus on analyzing a dataset and building statistical models from it. We will describe and visualize the data. We will then build and analyze statistical models, including linear and logistic regression, as well as chi-square tests of independence. We will then apply 4 machine learning techniques to the dataset: decision trees, random forests, lasso regression, and clustering.
I would be happy to conduct an introductory level tutorial on exploring a dataset with the pandas/StatsModels/scikit-learn framework:
- Descriptive statistics. Here we will describe each variable depending on its type, as well as the dataset overall.
- Visualization for categorical and quantitative variables. We will learn effective visualization techniques for each type of variable in the dataset.
- Statistical modeling for quantitative and categorical, explanatory and response variables: chi-square tests of independence, linear regression and logistic regression. We will learn to test hypotheses, and to interpret our models, their strengths, and their limitations.
- I will then expand to the application of machine learning techniques, including decision trees, random forests, lasso regression, and clustering. Here we will explore the advantages and disadvantages of each of these techniques, as well as apply them to the dataset.
This would be a very applied, introductory tutorial, to the statistical exploration of a dataset and the building of statistical models from it. I would be happy to send you the ipython notebook for this tutorial as well.