PyVideo.org · Robert Coup - /me wants it. Scraping sites to get data.

ARCHIVE.ORG

Description

/me wants it. Scraping sites to get data.

Presented by Robert Coup

Abstract

Building scrapers for grabbing data from websites. Tools, techniques, and tips.

Outline

Life would be so much easier if the data contained in websites was available raw via APIs. Alas, until that mythical day comes we either need to deal with unhelpful people via email and phone, or just get it ourselves. Python has some great tools available to help with building scrapers and for parsing and formatting the data we get. Starting off with the basics - tracking what needs to be done, making web requests, parsing HTML, following links, and extricating data from Excel and PDF documents. Our scraper needs to be resilient against too-clever content management systems, Frontpage-era HTML, and plain dodgy data. We may need to pass through logins and other messiness. There are some techniques and tips for approaching the problems and keeping your solution flexible and as simple as possible. We'll discuss some scrapers built for New Zealand data, and introduce a new project from the NZ open government data group to provide a RESTful interface to scrapers - effectively creating a nice API where there isn't one.

Slides: http://www.slideshare.net/rcoup/me-wants-it-scraping-sites-to-get- data

[VIDEO HAS ISSUES: Sound and video are poor. Slides are hard to read.]

Description

Details