|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
• |
Earliest systems
use hard-coded wrappers
|
|
|
|
– |
Content-focused
(E.g., largest table cell)
|
|
|
|
– |
Didn’t scale
|
|
|
• |
Now, multi-class
classification using mixed features:
|
|
|
lexical, structural and spatial.
|
|
|
|
– |
HTML Path
structure
|
|
|
|
• |
(Yang et al. 03,
Shih and Karger 04)
|
|
|
|
– |
Spatial random
walk using browser-exposed DOM tree
|
|
|
• |
Allows precise
layout information (Xin and Lee 04)
|
|