Scraping Techniques to Extract Advertisements from Web Pages
[EuroPython 2011] Mirko Urru,Stefano Cotta Ramusino - 24 June 2011 in "Track Tagliatelle "
Online Advertising is an emerging research field, at the intersection of Information Retrieval, Machine Learning, Optimization, and Microeconomics. Its main goal is to choose the right ads to present to a user engaged in a given task, such as Sponsored Search Advertising or Contextual Advertising. The former puts ads on the page returned from a Web search engine following a query. The latter puts ads within the content of a generic, third party, Web page. The ads themselves are selected and served by automated systems based on the content displayed to the user.
Web scraping is the set of techniques used to automatically get some information from a website instead of manually copying it. In particular, we're interested in studying and adopting scraping techniques for: i. accessing tags as object members ii. finding out tags whose name, contents or attributes match selection criteria iii. accessing tag attributes by using a dictionary-like syntax.
In this talk, we focus on the adoption of scraping techniques in the contextual advertising field. In particular, we present a system aimed at finding the most relevant ads for a generic web page p. Starting from p, the system selects a set of its inlinks (i.e., the pages that link p) and extracts the ads contained into them. Selection is performed querying the Google search engine, whereas extraction is made by using suitable scraping techniques.