Stylistic and lexical co-training  for webpage block classification
Chee How Lee, Min-Yen Kan and Sandra Lai
National University of Singapore
kanmy@comp.nus.edu.sg

Web page blocks
What’s a web page block?
Parts of a web page with different functions
e.g., main content, site navigation, advertisements
Important for this distinction
Different names for the same thing
fragments, elements, blocks, tiles

Uses of block classification
Extract content to mobile device
“Just the facts, ma’am”
summarization = better (whole) page classification
Advertisement blocking
Fragment versioning
Distinguish navigation from content
better link-based ranking

Approaches to classification
Earliest systems use hard-coded wrappers
Content-focused (E.g., largest table cell)
Didn’t scale
Now, multi-class classification using mixed features: lexical, structural and spatial.
HTML Path structure
(Yang et al. 03, Shih and Karger 04)
Spatial random walk using browser-exposed DOM tree
Allows precise layout information (Xin and Lee 04)

Which approach to use
A obvious approach is to build a supervised classifier
Train on labeled examples (f1,f2,…,fi,…,fn, C)
Test by distilling features (f1,f2,…,fi,…,fn) = ?
Training data costly, need to use unlabeled data
The feature sets are largely orthogonal
= Try co-training!

Co-training (Blum and Mitchell)
Two learners with separate views of the same problem
Characterize this as the example of classifying web pages
Link structure
Text on the page

Co-training (cont’d)
Use one classifier to help the other
e.g. pages that the link classifier is confident on should be passed as correct answers to the text-based classifier.
Assumes that the individual classifiers are not bad to start with
Otherwise noise level will escalate

Architecture
B&M co-training handles only binary classification
Handles distribution skewing

PARCELS
PARser for Content Extraction & Layout Structure
Goals:
Coarse-grained classification
Fine-grained information extraction
Work on a variety of sources
Open-source, reference implementation

Target Classification
News stories
Domain-specific fine grained classes (denoted by *)
Needs XHTML / CSS support
Blocks can have multiple classes
Multi-class forced to single
Assessor picks most prominent class
Resulting corpus has skewed distribution
50 sites from Google News
Not well-formed: Tidy first
Main Content
Site Navigation
Search
Supporting content
Links supporting content
Image supporting content
Sub headers
Site image
Advertisements*
Links to related articles*
Newsletter / alert links*
Date or Time of article*
Source Station (country of report)*
Reporter Name*

Lexical and Stylistic Co-training
Split the document into blocks using DOM tree
Nontrivial (overlapping blocks, visual segments differ)
Co-train
Learner 1 – Stylistic learner
Spatial and structural relationship
External relationship to other blocks
Learner 2 – Lexical learner
POS and link related features
Internal classification irrespective of other blocks

Stylistic Features
Layout: guess from first level DOM nodes
Linear
<Table>: Use reading order, cell type propagation
XHTML / CSS (e.g., <DIV>): Translate relative to absolute positioning, model depth
Font (CSS too): relative features
Image size

Lexical Features
For each block:
POS tag distribution in text
Stemmed tokens weighted by TF×IDF
IDF from Stanford’s web base
Number of words
Alt text of images
Hyperlink type (e.g., embedded image, text, mailto)

Evaluations
Adapted co-training:
Sample balancing: preserve ratio of noisily labeled examples, poor performance without it
Replace unlabeled data at each round
Use BoosTexter: handles word features easily
Five fold cross validation
General performance?
Specific performance on:
Fine-grained classification?
XHTML / DIV pages?
Others’ tasks?

General performance
Statistically significant improvement
Improvement on large classes at expense of minority
Despite sample balancing
No fine grained classes detected

XHTML / DIV Evaluation
Smaller dataset
1/5 the size, limited sites for sample
Both annotated and unannotated data sets were smaller
As a result, fewer co-training iterations
Single view model still seems to do better

Rough grained model
Slightly different model of splitting than earlier work
Smaller amount of training examples
No significant gain from co-training but comparable to other work (19.5% error vs. 14-18 error%)

Conclusion
Co-training model for web block classification
Achieves 28.5% reduction in error in main task
However, fails in
Detecting fine grained classes
→ Exploit templates, IE methods, path similarity and context
Likely needs enough unlabeled data
→ Re-run using more experimental data
Dependent on learning model
→ Looking to change learning package

Question time!
Any questions?
http://parcels.sourceforge.net/
Available in late November 2004
Annotator, evaluation tools provided
Handles XHTML and DIV / CSS
Open source, GPL’ed code