Alexander Sibiryakov - Frontera: open source large-scale web crawling framework [EuroPython 2015] [20 July 2015] [Bilbao, Euskadi, Spain]
In this talk I'm going to introduce Scrapinghub's new open source framework [Frontera]. Frontera allows to build real-time distributed web crawlers and website focused ones.
- customizable URL metadata storage (RDBMS or Key-Value based),
- crawling strategies management,
- transport layer abstraction.
- fetcher abstraction.
Along with framework description I'll demonstrate how to build a distributed crawler using [Scrapy], Kafka and HBase, and hopefully present some statistics of Spanish internet collected with newly built crawler. Happy EuroPythoning!