Approaches to classification
Earliest systems use hard-coded wrappers
Content-focused (E.g., largest table cell)
Didn’t scale
Now, multi-class classification using mixed features:
lexical, structural and spatial.
HTML Path structure
(Yang et al. 03, Shih and Karger 04)
Spatial random walk using browser-exposed DOM tree
Allows precise layout information (Xin and Lee 04)