Web Data Extraction on DIADEM

What’s the best apartment for your needs? Where you can buy your preferred camera in your area?
Search engines are not your best friends here: To find an answer, you need to manually examine dozen of pages. Even using large aggregators, there is no guarantee that you won’t miss your best apartment, just one more click away! In DIADEM (Domain-centric Intelligent Automated Data Extraction Methodology) we focus on automatic web data extraction to answer object queries, returning objects rather than pages.
Without human supervision, our system finds, navigates, and analyses websites of a specific domain generating wrappers to extract all contained objects. DIADEM replaces “the human annotator” in traditional wrapper induction systems, by domain knowledge. This domain knowledge describes the ontology as well as the phenomenology of the domain: what are the entities and their relations as well as how do they occur on websites.

The main components of DIADEM are:

  • OPAL, Form Understanding
  • AMBER, Result Page Analyses
  • MALACHITE, Web Block Classification
  • OXTRACTOR, natural text analysis
  • OXPath, wrapper specification and scalable data extraction

DIADEM is a five-years ERC advanced investigator grant. A preliminar demo of our first prototype has been presented at WWW’12(see publications).