This tutorial will offer a hands-on introduction to machine learning and the process of applying these concepts in a Kaggle competition. We will introduce attendees to machine learning concepts, examples and flows, while building up their skills to solve an actual problem. At the end of the tutorial attendees will be familiar with a real data science flow: feature preparation, modeling, optimization and validation.
Packages used in the tutorial will include: IPython notebook, scikit-learn, pandas and NLTK. We’ll use IPython notebook for interactive exploration and visualization, in order to gain a basic understanding of what’s in the data. From there, we’ll extract features and train a model using scikit-learn. This will bring us to our first submission. We’ll then learn how to structure the problem for offline evaluation and use scikit-learn’s clean model API to train many models simultaneously and perform feature selection and hyperparameter optimization.
At the end of session, attendees will have time to work on their own to improve their models and make multiple submissions to get to the top of the leaderboard, just like in a real competition. Hopefully attendees will not only leave the tutorial having learned the core data science concepts and flow, but also having had a great time doing it.