Forms are our gates to the web. They enable us to access the deep content of web sites. Automatic form understanding unlocks this content for applications ranging from crawlers to meta-search engines and is essential for improving usability and accessibility of the web.
OPAL (ontology based web pattern analysis with logic), is a domain-aware form understanding system that combines visual, textual, and structural features with a thin layer of domain knowledge. The visual, textual, and structural features are used in a domain-independent analysis to produce a highly accurate form labeling. Also, OPAL produces a form model consistent with a given domain schema, where all the fields are associated with given types. The domain schema is not only used to classify the fields and segments of the form model, but also to improve the form model based on a set of structural constraints that describe typical fields and their arrangement in forms of the domain, e.g., how price ranges are presented in forms.
OPAL has been presented at WWW’12. For more details, check out OPAL’s page on DIADEM, or our publications.
