Stylistic and lexical co-training for webpage block classification

13 Nov 2004

WIDM 04: Lee et al. Co-training Web Block Classification

Lexical and Stylistic Co-training

1.Split the document into blocks using DOM tree

–Nontrivial (overlapping blocks, visual segments differ)

2.Co-train

–Learner 1 – Stylistic learner

•Spatial and structural relationship

•External relationship to other blocks

–Learner 2 – Lexical learner

•POS and link related features

•Internal classification irrespective of other blocks