
In data record extraction, wrapper induction heavily depends on example quality, but manual example creation severely limits the scalability of wrapper induction. For web scale wrapper induction to work, automatic example generation is necessary, at an accuracy and robustness well beyond reach of current approaches.
To bridge this gap, we present AMBER (Adaptable Model-based Extraction of Result Pages), a template-independent but domain-aware system for fully automated generation of human quality examples from any result page of a given domain. AMBER proceeds in three phases:
- In the data area identification phase, AMBER locates data areas where the relevant data is located, and
- in the record segmentation phase, AMBER analyzes these areas for repeated structures to subdivide them into records.
- In the attribute alignment phase, domain knowledge guides the alignment of the attributes within the records, dealing with missing and noisy annotations.
The records and attributes are then adapted as examples for a wrapper inducer to obtain an efficient wrapper for a given site. AMBER has been evaluated on multiple domains, covering hundreds of sites, achieving an accuracy (>98%) comparable to skilled human annotators.
For more details, check out AMBER’s page in DIADEM, or our publications.