•Earliest
systems use hard-coded wrappers
–Content-focused
(E.g., largest table cell)
–Didn’t scale
•
•Now, multi-class
classification using mixed features: lexical, structural and spatial.
–HTML Path
structure
•(Yang et al. 03,
Shih and Karger 04)
–Spatial
random walk using browser-exposed DOM tree
•Allows
precise layout information (Xin and Lee 04)