1	Stylistic and lexical co-training for webpage block classification Chee How Lee, Min-Yen Kan and Sandra Lai National University of Singapore kanmy@comp.nus.edu.sg
2	Web page blocks What’s a web page block? Parts of a web page with different functions e.g., main content, site navigation, advertisements Important for this distinction Different names for the same thing fragments, elements, blocks, tiles
3	Uses of block classification Extract content to mobile device “Just the facts, ma’am” summarization = better (whole) page classification Advertisement blocking Fragment versioning Distinguish navigation from content better link-based ranking
4	Approaches to classification Earliest systems use hard-coded wrappers Content-focused (E.g., largest table cell) Didn’t scale Now, multi-class classification using mixed features: lexical, structural and spatial. HTML Path structure (Yang et al. 03, Shih and Karger 04) Spatial random walk using browser-exposed DOM tree Allows precise layout information (Xin and Lee 04)
5	Which approach to use A obvious approach is to build a supervised classifier Train on labeled examples (f₁,f₂,…,f_i,…,f_n, C) Test by distilling features (f₁,f₂,…,f_i,…,f_n) = ? Training data costly, need to use unlabeled data The feature sets are largely orthogonal = Try co-training!
6	Co-training (Blum and Mitchell) Two learners with separate views of the same problem Characterize this as the example of classifying web pages Link structure Text on the page
7	Co-training (cont’d) Use one classifier to help the other e.g. pages that the link classifier is confident on should be passed as correct answers to the text-based classifier. Assumes that the individual classifiers are not bad to start with Otherwise noise level will escalate
8	Architecture B&M co-training handles only binary classification Handles distribution skewing
9	PARCELS PARser for Content Extraction & Layout Structure Goals: Coarse-grained classification Fine-grained information extraction Work on a variety of sources Open-source, reference implementation
10	Target Classification News stories Domain-specific fine grained classes (denoted by ) Needs XHTML / CSS support Blocks can have multiple classes Multi-class forced to single Assessor picks most prominent class Resulting corpus has skewed distribution 50 sites from Google News Not well-formed: Tidy first Main Content Site Navigation Search Supporting content Links supporting content Image supporting content Sub headers Site image Advertisements Links to related articles* Newsletter / alert links* Date or Time of article* Source Station (country of report)* Reporter Name*
11	Lexical and Stylistic Co-training Split the document into blocks using DOM tree Nontrivial (overlapping blocks, visual segments differ) Co-train Learner 1 – Stylistic learner Spatial and structural relationship External relationship to other blocks Learner 2 – Lexical learner POS and link related features Internal classification irrespective of other blocks
12	Stylistic Features Layout: guess from first level DOM nodes Linear <Table>: Use reading order, cell type propagation XHTML / CSS (e.g., <DIV>): Translate relative to absolute positioning, model depth Font (CSS too): relative features Image size
13	Lexical Features For each block: POS tag distribution in text Stemmed tokens weighted by TF×IDF IDF from Stanford’s web base Number of words Alt text of images Hyperlink type (e.g., embedded image, text, mailto)
14	Evaluations Adapted co-training: Sample balancing: preserve ratio of noisily labeled examples, poor performance without it Replace unlabeled data at each round Use BoosTexter: handles word features easily Five fold cross validation General performance? Specific performance on: Fine-grained classification? XHTML / DIV pages? Others’ tasks?
15	General performance Statistically significant improvement Improvement on large classes at expense of minority Despite sample balancing No fine grained classes detected
16	XHTML / DIV Evaluation Smaller dataset 1/5 the size, limited sites for sample Both annotated and unannotated data sets were smaller As a result, fewer co-training iterations Single view model still seems to do better
17	Rough grained model Slightly different model of splitting than earlier work Smaller amount of training examples No significant gain from co-training but comparable to other work (19.5% error vs. 14-18 error%)
18	Conclusion Co-training model for web block classification Achieves 28.5% reduction in error in main task However, fails in Detecting fine grained classes → Exploit templates, IE methods, path similarity and context Likely needs enough unlabeled data → Re-run using more experimental data Dependent on learning model → Looking to change learning package
19	Question time! Any questions? http://parcels.sourceforge.net/ Available in late November 2004 Annotator, evaluation tools provided Handles XHTML and DIV / CSS Open source, GPL’ed code