Description
PyData Madrid 2016
Most of the talks and workshop tutorials can be found here: https://github.com/PyDataMadrid2016/Conference-Info
If we want to extract the contents of a website automating information extraction, often we find that the website does not offer any API to get the data you need and It is necessary use scraping techniques to recover data from a Web automatically. Some of the most powerful tools for extracting the data in web pages can be found in the python ecosystem.
Introduction to webscraping
WebScraping is the process of collecting or extracting data from web pages automatically. Nowdays is a very active field and developing shared goals with the semantic web field, natural language processing,artificial intelligence and human computer interaction.
Python tools for webscraping
Some of the most powerful tools to extract data can be found in the python ecosystem, among which we highlight Beautiful soup, Webscraping, PyQuery and Scrapy.
Comparison between webscraping tools
A comparison of the mentioned tools will be made, showing advantages and disadvantages of each one,highlighting the elements of each one to perform data extraction as regular expressions,css selectors and xpath expressions.
Project example with scrapy
Scrapy is a framework written in python for extraction automated data that can be used for a wide range of applications such as data mining processing. When using Scrapy we have to create a project, and each project consists of:
- Items: We define the elements to be extracted.
- Spiders: The heart of the project, here we define the extract data procedure.
- Pipelines: Are the proceeds to analyze elements: data validation, cleansing html code outline
Introduction to webscraping(5 min) I will mention the main scraping techniques
1.1. WebScraping
1.2. Screen scraping
1.3. Report mining
1.4. Spiders
Python tools for webscraping(10 min) For each library I will make and introduction with a basic example. In some examples I will use requests library for sending HTTP requests
2.1. BeautifulSoup
2.2. Webscraping
2.2. PyQuery
Comparing scraping tools(5 min)
3.1.Introduction to techniques for obtain data from web pages like regular expressions,css selectors, xpath expressions
3.2.Comparative table comparing main features of each tool
Project example with scrapy(10 min)
4.1.Project structure with scrapy
4.2.Components(Scheduler,Spider,Pipeline,Middlewares)
4.3.Generating reports in json,csv and xml formats