Extracting structured information from a webpages is a relatively simple task in python, given the innumerable tools at our disposal namely BeautifulSoup, PyQuery, lxml etc. However, crawling and scraping data from multiple websites makes the job difficult because everyone on the internet likes to structure their information differently.
Crawling upto 10 portals is manageable upto 10 portals, beyond that it becomes a menace. What we need then, is a framework to keep the crawling and parsing logic separate and also help manage the parsers. This is where scrapy comes to our assistance. It is the most pythonic way of scraping the web.